You2Idea is a free AI-powered tool that extracts startup and business ideas from 700+ YouTube videos. It provides semantic search, a daily idea digest, and auto-generated PRDs (Product Requirements Documents) for the best ideas.

How does You2Idea work?

You2Idea uses AI to analyze transcripts from YouTube videos by top startup and business creators. It extracts actionable business ideas, indexes them for semantic search, and generates PRDs. You can search by keyword, browse by channel, or explore curated insights.

Is You2Idea free to use?

Yes, You2Idea is completely free. You can search startup ideas, browse videos, read PRDs, and subscribe to the daily digest at no cost.

What YouTube channels does You2Idea index?

You2Idea indexes videos from top startup and entrepreneurship YouTube channels including creators who discuss business ideas, side projects, SaaS, and solopreneurship. New channels and videos are added regularly.

you2idea@video:~$ watch mNcXue7X8H0 [2:38:37]

// transcript — 7060 segments

0:02 Welcome to the video that I have put the most effort into creating by far on my

0:08 channel to date. This is the local AI master class and we're going to dive

0:12 into everything you need to know about local AI. What it is, why it's so

0:16 important for you no matter what you are building with AI. How you can run your

0:20 own large language models and self-host your own infrastructure. How you can

0:26 build 100% private and offline AI agents and deploy them to the cloud in a secure

0:31 way. I have everything that you need to get started here even if you haven't

0:35 done anything with local AI before and I take things pretty far. There is a lot

0:38 of value packed into this for you. So, buckle up, enjoy the ride and follow

0:43 along as well. So, first things first, let's start with an agenda for the

0:47 master class. There are so many things that I cannot wait to share with you.

0:51 And I have very detailed chapters for this YouTube video so you can easily

0:55 navigate between everything that I'm going to show you. I just want to make

0:59 it super easy for you to get exactly what you want out of this master class.

1:03 Nothing more, nothing less. We'll start by diving into what is local AI and I

1:07 have a quick demo to make this very, very hands-on. And then with that, we'll

1:12 get into the why. Why local AI? Why should you care about it? Why do I

1:16 believe so firmly that it is the future of AI? I'll dive into all my reasoning

1:20 there. And then we'll get into hardware requirements because these local LLMs

1:24 are beasts and you have to have specific hardware to be able to run them. So I'll

1:28 dive into all that based on different large language models and some

1:32 alternatives as well. Then we'll get into all of the tricky stuff. There are

1:35 a few things that are usually pretty daunting for people. So I want to break

1:39 down those barriers just to make you super confident running your own local

1:43 LLMs and infrastructure. And then with that, we'll get into how you can use

1:48 local AI anywhere. Because Olama and other solutions for running your own

1:52 large language models, they are OpenAI API compatible. I'll get into what that

1:56 means when we get to this point. But basically, any agents that you already

2:00 have running with Python or N8N, whatever. If you're using OpenAI or

2:04 Gemini or Enthropic, you can very easily swap them to use local AI instead. So

2:09 you can turn your existing agents into ones that are 100% offline, free, and

2:15 private. And then with that, we will get into the local AI package. This is a set

2:20 of services that I've curated for you to run your entire local AI infrastructure

2:25 like your UI, your database, your large language models, and a lot more. This is

2:28 where we really start to build out our full infrastructure. I'll walk you

2:32 through setting up the local AI package, getting into the nitty-gritty details to

2:35 make sure that you have everything set up at this point. And then once we have

2:40 that set up, we can dive into building a fully local AI agent with N8N. And then

2:44 we'll transition that same agent into Python as well. So that you'll see once

2:49 we have the local AI package set up, how you can build a 100% offline and private

2:54 agent both with no code and with code. And then we'll take those agents and

2:58 deploy them to the cloud, specifically on the Digital Ocean platform. But I'll

3:02 walk you through a process that you can use no matter the cloud provider that

3:05 you are using. and we'll deploy things in a very secure way both for the

3:09 package for our infrastructure and the AI agent itself. And then last, I want

3:14 to end with some additional resources just to make sure you have everything

3:17 that you need to take this master class forward and really use this to build any

3:21 AI agent that you could possibly want 100% local. And also, if you are

3:26 interested in mastering more than just local AI, but building entire AI agents

3:31 in a local AI environment or even with cloud AI, definitely check out

3:35 dynamis.ai. AI. This is my community for early AI adopters just like yourself.

3:40 And a big part of this community is the AI agent mastery course where I dive

3:45 super deep into my full process for building AI agents. I'm talking planning

3:50 and prototyping and coding and using AI coding assistance and building full

3:55 frontends for our AI agents and securing things and deploying things. There's a

3:59 lot more coming soon for this course as well. I'm very actively working on it.

4:03 And a big part of this course is the complete agent that I build throughout

4:07 it. I build both with cloud AI and local AI. And so this master class will help

4:11 you get very very comfortable with local AI. But when it comes to building

4:15 complex agents and really getting deep into building out agents, then you

4:19 definitely want to check out the AI agent mastery course here in Dynamis.AI.

4:23 So with that, let's get back to the master class, diving into what local AI

4:28 is all about. Let's start by laying the foundation. What is local AI in the

4:32 first place? Well, very simply put, local AI is running your own large

4:36 language models and infrastructure like your database and your UI entirely on

4:42 your own machine 100% offline. So when you think about when you typically want

4:46 to build an AI agent, you need a large language model maybe like GPT4.1

4:51 or Claude 4 and then you need something like your database like Superbase and

4:55 you need a way to create a user interface. you have all these different

4:58 components for your agent and typically you're using APIs to access things that

5:03 are hosted on your behalf. But with local AI, we can take all of these

5:07 things completely in our own control running them ourselves. So this is

5:11 possible through open-source large language models and software. So

5:15 everything is running on your own hardware instead of you paying for APIs.

5:20 So we run the large language model ourself on our own machine instead of

5:25 paying for the OpenAI API for example. And so for large language models, there

5:29 are thousands of different open source large language models available for us

5:32 to use in a lot of different ways. And some of these you've probably heard of

5:37 before, like Deepseek R1, Quen 3, Mistral 3.1, Llama 4. These are just a

5:41 couple of examples of the most popular ones that you've probably heard of

5:44 before. We'll be tinkering around with using some of these in this master

5:48 class. And then we also have open-source software. So all of our infrastructure

5:53 that goes along with our agents and LLMs, things like Olama for running our

5:58 LLMs, Superbase for our database, N8N for our no/ lowode workflow automations,

6:04 and open web UI to have a nice user interface to talk to our agents and

6:07 LLMs. And we'll dive into using all of these as well. Now, because local AI

6:12 means running large language models on our own computer, it's not as easy as

6:18 just going to claw.ai AI or chatgbt.com and typing in a prompt. We have to

6:22 actually install something, but it still is very easy to get started. So, let me

6:25 show you right now with a hands-on example. So, here we are within the

6:30 website for Olama. This is just.com. I'll have a link to this in the

6:33 description of the video. This is one of the open- source platforms that allows

6:39 us to very easily download and run local large language models. And so, you just

6:42 have to go to their homepage here and click on this nice big download button.

6:45 You can install it for Windows, Mac or Linux. It really works for any operating

6:49 system. Then once you have it up and running on your machine, you can open up

6:53 any terminal. Like I'm on Windows here, so I'm in a PowerShell session and I can

6:58 run Olama commands now to do things like view the models that I have available on

7:03 my machine. I can download models and I can run them as well. And the way that I

7:08 know how to pull and run specific models is I can just go to this models tab in

7:12 their navigation and I can browse and filter through all of the open source

7:16 LLMs that are available to me like DeepSeek R1. Almost everyone is familiar

7:20 with DeepSeek. It just totally blew up back in February and March. We have

7:25 Gemma 3, Quen 3, Llama 4, a few of them that I mentioned earlier when we had the

7:30 presentation up. And so we can click into any one of these like I can go into

7:35 DeepSeek R1 for example and then I have the command right here that I can copy

7:39 to download and run this specific model in my terminal. And there are a lot of

7:44 different model variants of DeepSeek R1. So we'll get into different sizes and

7:47 hardware requirements and what that all means in a little bit, but I'll just

7:50 take one of them and run it as an example. So I'll just do a really small

7:54 one right now. I'll do a 1.5 billion parameter large language model. And

7:58 again, I'll explain what that means in a little bit. I can copy this command.

8:01 It's just lama run and then the unique ID of this large language model. So I'll

8:06 go back into my terminal. I'll clear it here and then paste in this command. And

8:09 so first it's going to have to pull this large language model. And the total size

8:15 for this is 1.1 GB. And so it'll have to download it. And then because I used the

8:20 run command, it will immediately get me into a chat interface with the model

8:24 once it's downloaded. Also, if you don't want to run it right right away, you

8:27 just want to install it, you can do Olama pull instead of Olama run. And

8:32 then again, to view the models that you have available to you installed already,

8:36 you can just do the Olama list command like I did earlier. And so, right now,

8:39 I'll pause and come back once it's installed in about 30 seconds. All

8:43 right, it is now installed. And now I can just send in a message like hello.

8:47 And then boom, we are now talking to a large language model. But instead of it

8:50 being hosted somewhere else and we're just using a website, this is running on

8:54 my own infrastructure, the large language model and all the billions of

8:59 parameters are getting loaded onto my graphics card and running the inference.

9:02 That's what it's called when we're generating a response from the LLM

9:06 directly within this terminal here. And so I can ask another question like um

9:13 what is the best GPU right now? We'll see what it says. So it's thinking

9:16 first. This is actually a thinking model. Deepseek R1 is a reasoning LLM.

9:20 And then it gives us an answer. It's top GPU models today. 3080 RX6700.

9:27 Obviously, we have a training cutoff for local large language models just like we

9:30 do with ones in the cloud like GPT. And so the information is a little outdated

9:34 here, but yeah, this is a good answer. So we have a large language model that

9:38 we're talking to directly on our machine. And then to close out of this,

9:42 I can just do control D or command D on Mac. And if I do list, we have all the

9:47 other models that you saw earlier, plus now this one that I just installed. So

9:50 these are all available for me to run again just with that Olama run command.

9:54 And it won't have to reinstall if you already have it installed. Run just

9:58 installs it if you don't have it yet already. So that is just a quick demo of

10:03 using Olama. We'll dive a lot more into Olama later, like how we can actually

10:07 use it within our Python code and within our N8N workflows. This is just our

10:11 quick way to try it out within the terminal. Now, to really get into why we

10:15 should care about local AI now that we know what it is, I want to cover the

10:19 pros and cons of local AI and what I like to call cloud AI. That's just when

10:23 you're paying for things to be hosted for you, like using Claude or Gemini or

10:28 using the cloud version of N8N instead of hosting it yourself. And I also want

10:32 to cover the advantages of each because I don't want to sugarcoat things and

10:35 just hype up this master class by telling you that you should always use

10:39 local AI. That is certainly not the case. There is a time and place for both

10:43 of these categories here, but there are so many use cases where local AI is

10:50 absolutely crucial. You have no idea how many businesses I have talked to that

10:54 are willing to put tens of thousands of dollars into running their own LLMs and

10:58 infrastructure because privacy and security is so crucial for the things

11:02 that they're building with AI. And that actually gets into the first advantage

11:06 here of local AI, which is privacy and security. You can run things 100%

11:11 offline. The data that you're giving to your LLMs as prompts, it now doesn't

11:16 leave your hardware. It stays entirely within your own control. And for a lot

11:20 of businesses, that is 100% crucial, especially when they're in highly

11:24 regulated industries like the health industry, finance, uh even real estate.

11:28 Like there's so many use cases where you're working with intellectual

11:32 property or just really sensitive information. You don't want to be

11:36 sending your data off to an LLM provider like Google or OpenAI or Enthropic. And

11:41 so as a business owner, you should definitely be paying attention to this

11:45 if you are working with automation use cases where you're dealing with any kind

11:48 of sensitive data. And then also if you're a freelancer, you're starting an

11:52 AI automation agency, anything where you're building for other businesses,

11:55 you are going to have so many opportunities open up to you when you're

11:59 able to work with local AI because you can handle those use cases now where

12:03 they need to work with sensitive data and you can't just go and use the OpenAI

12:08 API. And that is the main advantage of local AI. It is a very big deal. But

12:11 there are a few other things that are worth focusing on as well. Starting with

12:16 model fine-tuning, you can take any open- source large language model and

12:20 add additional training on top with your own data. Basically making it a domain

12:24 expert on your business or the problem that you are solving. It's so so

12:29 powerful. You can make models through fine-tuning more powerful than the best

12:33 of the best in the cloud depending on what you are able to fine-tune with

12:38 depending on the data that you have. And you can do fine-tuning with some cloud

12:42 models like with GPT, but your options are pretty limited and it can be quite

12:46 expensive. And so it definitely is a huge advantage to local AI. And local AI

12:52 in general can be very coste effective, including the infrastructure as well. So

12:56 your LLMs and your infrastructure. You run it all yourself and you pay for

13:00 nothing besides the electricity bill if it's running on your computer at your

13:04 house or if you have some private server in the cloud. You just have to pay for

13:07 that server and that's it. There's no N8N bill, no Superbase bill, no OpenAI

13:13 bill. You can save a lot of money. It's really, really nice. And on top of that,

13:17 when everything is running on your own infrastructure, the agents that you

13:22 create can run on the same server, the same place as your infrastructure. And

13:26 so it can actually be faster because you don't have network delays calling APIs

13:30 for all your different services for your LLMs and your database and things like

13:34 that. And then with that, we can now get into the advantages of cloud AI.

13:38 Starting with it's a lot easier to set up. There's a reason why I have to have

13:42 this master class for you in the first place. There are some initial hurdles

13:47 that we have to jump over to really have everything fully set up for our local

13:51 LLMs and infrastructure. And you just don't have that with cloud AI because

13:54 you can very simply call into these APIs. You just have to sign up and get

13:58 an API key and that's about it. So, it certainly is easier to get up and

14:02 running and there's less maintenance overall because they are hosting things

14:05 for you. Superbase is hosting the database for you. OpenAI is hosting the

14:09 LLM for you. So, you don't have to manage things on your own hardware. With

14:13 Local AI, you have to apply patches and updates if you have a private server in

14:17 the cloud. You have to manage your own hardware if you're running on your own

14:20 computer, making sure that it's on 24/7, if you want your database on 24/7, that

14:24 kind of thing. It's just less maintenance with cloud AI. And then

14:28 probably the biggest advantage of cloud AI overall is that you have better

14:34 models available to you. Claude 4 sonnet or opus for example is more powerful

14:40 than any local AI that you could run. So we have this gap here and this gap was a

14:45 lot bigger at one point even a year ago. The best local LLMs absolutely crushed

14:50 the best local LLMs and that gap is starting to diminish. And so I really

14:55 see a future where that gap is diminished entirely and all the best

14:59 local LLMs are actually on par with the best cloud ones. That's the future I

15:03 see. That's why I think that cloud that's why I think that local AI is such

15:07 a big deal because the advantages of local AI, those are just going to get

15:12 more prevalent over time when businesses realize they really want private and

15:15 secure solutions. And then the advantages of cloud AI, I think those

15:19 are actually going to diminish over time. That's the key. minimal setup,

15:24 less maintenance. Well, those advantages are going to go away as we have

15:28 platforms and better instructions and solutions to make the setup and

15:32 maintenance easier for local AI and we have the gap that's continuing to

15:35 diminish between the power of these LLMs. All these advantages are going to

15:39 actually go away and then it'll just completely make sense to use local AI

15:43 honestly probably for like every single solution in the future. That's really

15:47 what I see us heading towards. And then the last advantage to cloud AI which

15:51 also I think will go away over time is that you have some features out of the

15:55 box like you have memory that's built directly into chat GPT. Gemini has web

15:59 search baked in even when you use it through the API like these kind of

16:02 capabilities that are out of the box that you have to implement yourself with

16:07 local AI maybe as tools for your agent and you can definitely do that but it is

16:10 nice that these things are out of the box for cloud AI. So those are the pros

16:15 and cons between the two. I hope that this makes it very clear for you to pick

16:19 right now for your own use case. Should I implement local AI or cloud AI? A lot

16:24 of it comes down to the security and privacy requirements for your use case.

16:28 Now, the next big thing that we need to talk about for local AI is hardware

16:33 requirements. Cuz here's the thing, large language models are very resource

16:39 inensive. You can't just run any LLM on any computer. And the reason for that is

16:43 large language models are made up of billions or even trillions of numbers

16:47 called parameters. And they're all connected together in a web that looks

16:51 kind of like this. This is a very simplified view with just a few

16:54 parameters here. But each of the parameters are nodes and they're

16:58 connected together. The input layer is where our prompt comes in and our prompt

17:02 is fed through all these hidden layers and then we have the output at the end.

17:05 This is the response we get back from the LLM. But like I said, this is a very

17:10 simplified view. GPT4, for example, like you can see on the right hand side, is

17:15 estimated to have 1.4 trillion parameters. And so, if you want to fit

17:20 an entire large language model into your graphics card, you have to store all of

17:25 these numbers. And even though we can handle gigabytes at a time in our

17:28 graphics cards through what is called VRAMm, storing billions or trillions of

17:34 numbers is absolutely insane. And so that's why large language models, you

17:37 actually have to have a pretty good graphics card if you want to run some of

17:42 the best ones. And so looking at Olama here, when we see these different sizes,

17:47 going back to their model list, like 1.5 billion parameters or 27 billion

17:51 parameters, there are different sizes for the local LLMs. Obviously, the

17:56 larger a local LM that you are running, the more performance you are going to

17:59 get, but you are going to be limited to what you are capable of running with

18:04 your graphics card or your hardware. So, with that in mind, I now want to dive

18:07 into the nitty-gritty details with you so you know exactly the kind of models

18:11 that you can run, the kind of speeds you can expect depending on your hardware.

18:15 And if you want to invest in new hardware to run local AI, I've got some

18:19 recommendations as well. So there are generally four primary size ranges for

18:25 large language models based on the speed and the power that you are looking for.

18:29 You have models that are around seven or 8 billion parameters. Those are

18:33 generally the smallest that I'd recommend trying to run. There are a lot

18:37 of smaller LLMs available like 1 billion parameters or three billion parameters,

18:41 but I'm so unimpressed when I use those LLMs that I don't really want to focus

18:45 on them here. 7 billion parameters is still tiny compared to the large cloud

18:51 AI models like Claude or GBT, but you can get pretty good results with them

18:55 for just simple chat use cases. And so for these models, assuming a Q4

18:59 quantization, which I'll get into quantization in a little bit, it's

19:02 basically just a way to make the LLM a lot smaller without hurting performance

19:06 that much, a 7 billion parameter model will need about four to 5 GB of VRAM on

19:11 your graphics card. And so if you have something like a 3060 Ti from Nvidia

19:16 with 8 GB of VRAM, you can very comfortably run a 7 billion parameter

19:20 model and you can expect to get very roughly around 25 to 35 tokens per

19:27 second. A token is roughly equivalent to a word. And so your local large language

19:31 model at 7 billion parameters with this graphics card will get about 25 to 35

19:37 words per second out on the screen to you being streamed out. And then if you

19:42 use much more powerful hardware like a 3090 to run a 7 billion parameter model

19:47 then you'll just jack up the speed a lot more. So that's 7 billion or 8 billion

19:51 parameters. Another very common size is something around 14 billion parameters.

19:57 This will take about 8 to 10 GB of VRAM. And so just a couple of options for

20:01 this. You have the 4070Ti which is usually 16 GB of VRAM or you could go as

20:08 low as 12 GB of VRAM with the 3080 Ti. And you could expect to get about 15 to

20:13 25 words per second. And then this is where you start to get into basic tool

20:17 calling. So I find that when you are building with a 7 billion parameter

20:21 model, they don't do tool calling very well. So you can't really build that

20:25 powerful of agents around a 7 billion parameter model. But once you get to

20:29 something around 14 billion parameters, that's when I see agents being able to

20:33 really accept instructions well around tools and system prompts and leveraging

20:37 tools to do things on our behalf. That's when we can really start to use LLMs to

20:42 make things that are agentic. And then the next big category of LLMs

20:47 is somewhere between 30 and 34 billion parameters. You see a lot of LM that

20:51 fall in that size range. This will typically need 16 to 20 gigabyt of VRAM.

20:57 And so a 3090 is a really good example of a graphics card that can run this. It

21:03 has 24 GB of VRAM. I actually have two 3090s myself. And I'll have a link to my

21:08 exact PC that I built for running local AI in the description of this video. So

21:12 I have two 3090s, which we'll need in a second for a 70 billion parameter, but

21:16 one is enough for a 32 billion parameter model. And then also Macs with their new

21:22 M4 chips are very powerful with their unified memory architecture. So if you

21:27 get a Mac M4 Pro with 24 GB of unified memory, you can also run 32 billion

21:32 parameter models. Now the speed isn't going to be the best necessarily, and

21:36 again, this does depend a lot on your computer overall, but you can expect

21:40 something around 10 to 15, maybe up to 20 tokens per second. and 32 billion

21:46 parameters is when you really start to see LLMs that are actually pretty

21:50 impressive. Like 7 billion and 14 billion, they are disappointing quite a

21:54 bit. I'll be totally honest. Especially when you try to use them with more

21:58 complicated agentic tasks. 32 billion when you start to get into this range is

22:02 when I I'm actually genuinely impressed. I'm like, "Oh, this is actually pretty

22:05 close to the performance of some of the best cloud AI." And then 70 billion

22:10 parameters. This is going to take about 35 to 40 GB of VRAM for most consumer

22:19 GPUs like 3090s and 4090s even 5090s. It's not actually enough VRAM. And so

22:22 this is when you have to start to split a large language model across multiple

22:28 GPUs which solutions like Olama will actually help you do this right out of

22:31 the box. So it's not this insane setup even though it might feel kind of

22:35 daunting like oh I have to split the layers of my LLM between GPUs. It's not

22:38 actually that complicated. And so 23090s, 24090s, that will be necessary.

22:44 Um, or you could have more of like an enterprisegrade GPU like an H100. So

22:49 Nvidia has a lot of these non-consumer grade GPUs that have a lot more VRAMm to

22:53 handle things like 70 billion parameter models. And the speed won't be the best

22:58 if you're using something like 23090s, especially because performance is hurt

23:01 when you have to split an LM between GPUs. You could expect something like 8

23:06 to 12 tokens per second. And this is obviously if you have the most complex

23:09 agents that you're really trying to match the performance of cloud AI as

23:13 much as possible, that's when you'd want to use a 7 billion parameter model. And

23:16 then if you're investing in hardware to run local AI, I have a couple of quick

23:20 recommendations here. And a lot of this depends on the size of the model that's

23:24 going to be good enough for your use case. And so I'll dive into some

23:28 alternatives for running local AI directly if you want to do testing

23:32 before you buy infrastructure. I'll get into that in a little bit, but

23:35 recommended builds. If you want to spend around $800 to build a PC, I'd recommend

23:41 getting a 4060Ti graphics card and then 32 GB of RAM. If you want to spend

23:47 $2,000, I'd recommend either getting a PC with a 3090 and 64 GB of RAM or

23:53 getting that Mac M4 Pro with 24 GB of unified memory. And then lastly, if you

23:58 want to spend $4,000, which is about what I spent for my PC, then I'd

24:02 recommend getting two 3090 graphics cards, and I got both of mine used for

24:07 around $700 each. Um, and then also getting 128 GB of RAM, or you can get a

24:15 Mac M4 Max with 64 GB of unified memory. So, I wanted to really get into the

24:19 nitty-gritty details there. So, I know I spent a good amount of time diving into

24:22 super specific numbers, but I hope this is really helpful for you. No matter the

24:26 large language model or your hardware, you now know generally where you're at

24:30 for what you can run. So, to go along with that information overload, I want

24:34 to give you some specifics, individual LLMs that you can try right now based on

24:38 the size range that you know will work for your hardware. So, just a couple of

24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This

24:47 is the most popular local LLM ever. It completely blew up a few months ago. And

24:52 the best part about DeepSeek R1 is they have an option that fits into each of

24:56 the size ranges that I just covered in that chart. So they have a 7 billion

25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And

25:05 then there is also the full real version of R1, which is 671 billion parameters.

25:10 I'm sorry though, you probably don't have the hardware to run that unless

25:13 you're spending tens of thousands on your infrastructure. So, probably stick

25:16 with one of these based on your graphics card or if you have a Mac computer, pick

25:19 the one that'll work for you and just try it out. You can click on any one of

25:23 these sizes here. And then here's your command to download and run it. And this

25:28 is defaulting to a Q4 quantization, which is what I was assuming in the

25:31 chart earlier. And again, I will cover what that actually means in a little bit

25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.

25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do

25:45 have all the other um sizes that fit into those ranges that I mentioned

25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And

25:52 the same kind of deal where you click on the size that you want and you've got

25:55 your command to install it here. And this is a reasoning LLM just like

26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.

26:05 I've had really good results with this as well. There are less options here,

26:08 but you've got 22 or 24 billion parameters, which is going to work well

26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified

26:18 memory. Really, really good model. And then also, there is a version of it that

26:22 is fine-tuned for coding specifically called Devstrol, which is a another

26:26 really cool LLM worth checking out as well if you have the hardware to run it.

26:30 So, that is everything for just general recommendations for local LMS to try

26:34 right now. This is the part of the master class that is going to become

26:38 outdated the fastest because there are new local LMS coming out every single

26:42 month. I don't really know how long my recommendations will last for. But in

26:45 general, you can just go to the model list in Olama, search for the ones,

26:49 finds one that has the size that works with your graphics card and just give it

26:52 a shot. You can install it and run it very easily with Olama. And the other

26:57 thing that I want to mention here is you don't always have to run open- source

27:01 large language models yourself. You can use a platform like Open Router. You can

27:05 just go to open router.ai, sign up, add in some API credits. You can try these

27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for

27:15 your agents before you invest in hardware to actually run them yourself.

27:18 And so within Open Router, I can just search for Quen here. And I can go down

27:23 to Quen and I can go to 32 billion. They have a free offering as well that

27:26 doesn't have the best rate limits. So I'll just go to this one right here,

27:31 Quen 3 32B. So I can try the model out through open router. They actually host

27:35 it for me. So it's an open- source non-local version, but now I can try it

27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now

27:43 I want to buy a 3090 graphics card so that I can install it directly through

27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here

27:51 in open router. And there are other platforms like Grock as well where you

27:55 can run these open source large language models um not on your own infrastructure

27:58 if you just want to do some testing before beforehand or whatever that might

28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's

28:04 everything for my general recommendations for LLMs to try and use

28:09 in your agents. All right, it is time to take a quick breather. This is

28:12 everything that we've covered already in our master class. What is local AI? Why

28:17 we care about it? Why it's the future and hardware requirements. And I really

28:20 wanted to dive deep into this stuff because it sets the stage for everything

28:24 that we do when we actually build agents and deploy our infrastructure. And so

28:28 the last thing that I want to do with you before we really start to get into

28:33 building agents and setting up our package is I want to talk about some of

28:37 the tricky stuff that is usually pretty daunting for anyone getting into local

28:42 AI. I'm talking things like offloading models, quantization, environment

28:47 variables to handle things like uh flash attention, all the stuff that is really

28:51 important that I want to break down simply for you so you can feel confident

28:55 that you have everything set up right, that you know what goes into using local

29:00 LLMs. The first big concept to focus on here is quantization. And this is

29:04 crucial. It's how we can make large language models a lot smaller so they

29:10 can fit on our GPUs without hurting performance too much. We are lowering

29:14 the model precision here. And so what basically what that means is we have

29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits

29:23 with the full size, but we can lower the precision of each of those parameters to

29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of

29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the

29:35 parameters that we already covered. And we can make these numbers less precise

29:40 or smaller without losing much performance. So, we can fit larger LLMs

29:45 within a GPU that normally wouldn't even be close to running the full-size model.

29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4

29:55 quantization like four bit per parameter in that diagram earlier. If you had the

30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could

30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it

30:10 possible. It's like rounding a number that has a long decimal to something

30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're

30:19 doing it for each of the billions of parameters, those numbers that we have.

30:23 And so just to give you a visual representation of this, you can also

30:27 quantize images just like you can quantize LLMs. And so we have our full

30:31 scale image on the lefth hand side here comparing it to different levels of

30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with

30:40 a 16- bit quantization, it almost looks the same. But then once we go down to

30:44 4bit, you can very much see that we have a huge loss in quality for the image.

30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit

30:54 quantization, we don't actually lose that much performance like we lose a lot

30:58 of quality with images. And so that's why it's so useful for us. And so I have

31:01 a table just to kind of describe what this looks like. So FP16, that's the

31:07 16bit precision that all LMS have as a base. That is the full size. The speed

31:11 is obviously going to be very slow because the model is a lot bigger, but

31:16 your quality is perfect compared to what it could be. I mean, obviously that

31:18 doesn't mean that you're going to get perfect answers all the time. I'm just

31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8

31:28 precision, so it's half the size. The speed is going to be a lot better. And

31:33 the quality is nearperfect. So it's not like performance is cut in half just

31:37 because size is. You still have the same number of parameters. Each one is just a

31:42 bit less precise. And so you're still going to get almost the same results.

31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very

31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these

31:57 numbers are very vague on purpose. There's not a huge way to for me to like

32:01 qualify exactly the difference, especially because it changes per LLM

32:05 and your hardware and everything like that. So, I'm just being very general

32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to

32:13 be very very fast, but usually your performance starts to go down quite a

32:17 bit once you go down to a Q2. And then like the note that I have in the bottom

32:22 left here, a Q4 quantization is generally the best balance. And so when

32:26 you are thinking to yourself, which large language model should I run? What

32:31 size should I use? My rule of thumb is to pick the largest large language model

32:37 that can work with your hardware with a Q4 quantization. That is why I assumed

32:42 that in the table earlier. And then also like we saw in Olama earlier, it always

32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4

32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion

32:59 parameter model is still going to be a lot more powerful than the full 7

33:02 billion parameter or 14 billion parameter because you don't actually

33:07 lose that much performance. So that is quantization. So just to make this very

33:11 practical for you, I'm back here in the model list for Quen 3. We have all these

33:15 models that don't specify a quantization, but we can see that it

33:20 defaults to Q4 because if I click on any one of them, the quantization right here

33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group

33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of

33:33 what really matters for you. The big thing is the Q4 like the actual number

33:38 here. So Q4 quantization is the default for Quen 332B and really any model in

33:44 Olama. But if we want to see the other quantized variants and we want to run

33:48 them, you can click on the view all. This is available no matter the LLM that

33:52 you're seeing in Olama. Now we can scroll through and see all the levels of

33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all

34:01 the way down, the absolute biggest version of Quenti that I can run is the

34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just

34:14 to install this. And there is no way that you're ever going to lay hands on

34:17 infrastructure to run this unless you're working for a very large enterprise. But

34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4

34:27 like this. So, you can click on any one that you want to run. Like let's say I

34:30 want to run Q8. I can click on this and then I have the command to pull and run

34:35 this specific quantization of the 14 billion parameter model. So each of the

34:39 quantized variants they have a unique ID within Olama. So you can very

34:42 specifically choose the one that you want. Again my general recommendation is

34:47 just to go with also what Olama recommends which is just defaulting to

34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter

34:56 the size that I pick. But if you do want to explore different quantizations, you

35:00 want to try to run the absolute full model for maybe something smaller like 7

35:04 billion or 14 billion, you can definitely do that through a lama and

35:08 really any other provider of local LLMs. So that is everything for quantization.

35:12 It's important to know how that works, but yes, generally stick with a Q4 of

35:17 the largest LLM that you can run. The next concept that is very important to

35:21 understand is offloading. All offloading is is splitting the layers for your

35:26 large language model between your GPU and your CPU and RAM. It's kind of

35:30 crazy, but large language models don't have to fit entirely in your GPU. All

35:36 large language models can be split into layers, layers of the different weights,

35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM

35:45 and computed by the GPU. And then some of the large language models stored in

35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,

35:55 generally, you want to avoid offloading if you can. You want to be able to fit

35:59 everything in your GPU, which by the way, the context, like your prompts for

36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what

36:08 happens when you have very long conversations for a large language model

36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it

36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you

36:19 have longer conversations and all of a sudden things get really slow, you know

36:24 that offloading is happening. Sometimes this is necessary though as context

36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of

36:32 the conversation, whatever to the CPU and RAM, it won't affect performance

36:36 that much. And so sometimes if you're trying to squeeze the biggest size you

36:41 can into your machine for an LLM, you can take advantage of offloading to run

36:45 something bigger or have a much larger conversation. Just know that usually it

36:49 kind of sucks. Like when I have offloading start to happen, my machine

36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it

36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM

37:04 is full, you can actually offload to storage, like literally using your hard

37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But

37:11 just fun fact, you can actually do that. Now, the very last thing that I want to

37:15 cover before we dive into some code, setting up the local AI package, and

37:21 building out some agents is a few very crucial parameters, environment

37:25 variables for Olama. So, these are environment variables that you can set

37:28 on your machine just like any other based on your operating system. And

37:32 Olama does have an FAQ for setting up some of these things, which I'll link to

37:36 in the description as well. But yeah, these are a bit more technical, so

37:40 people skip past setting this stuff up a lot, but it's actually really, really

37:44 important to make things very efficient when running local LLMs. So the first

37:49 environment variable is flash attention. You want to set this to one or true.

37:54 When you have this set to true, it's going to make the attention calculation

37:59 a lot more efficient. It sounds fancy, but basically large language models when

38:04 they are generating a response, they have to calculate which parts of your

38:08 prompt to pay the most attention to. That's the calculation. And you can make

38:12 it a lot more efficient without losing much performance at all by setting up

38:16 the flash attention, setting that to true. And then for another optimization,

38:21 just like we can quantize the LLM itself, you can also quantize or

38:27 compress the context. So your system prompt, the tool descriptions, your

38:31 prompt and conversation history, all that context that's being sent to your

38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for

38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the

38:46 context memory. It's a very simplified explanation, but it's really, really

38:50 useful because a long conversation can also take a lot of VRAM just like larger

38:55 LLM. And so it's good to compress that. And then the third environment variable,

38:58 this is actually probably the most crucial one to set up for Olama. There

39:02 is this crazy thing. I don't know why Olama does it, but by default, they

39:07 limit every single large language model to 2,000 tokens for the context limit,

39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and

39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a

39:21 lot of local large language models can also handle large prompts. But Olamo

39:25 will limit you to default to 2,000 tokens. And so you have to override that

39:30 yourself with this environment variable. And so generally I recommend starting

39:34 with about 8,000 tokens to start. You can move this all the way up to

39:38 something like 32,000 tokens if your local large language model supports

39:42 that. And if you view the model page on Alama, you can see the context link

39:46 that's supported by the LLM. But you definitely want to, you know, jack this

39:50 up more from just 2,000 because a lot of times when you have longer

39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do

39:57 not miss this. If your large language model is starting to go completely off

40:02 the rails and ignore your system prompt and forget that it has these tools that

40:06 you gave it, it's probably because you reached the context length. And so, just

40:10 keep that in mind. I see people miss this a lot. And then the very last

40:14 environment variable, uh, probably the least important out of all these four,

40:18 but if you're running a lot of different large language models at once and you're

40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so

40:25 in Olama, you can limit the number of models that are allowed to be in your

40:29 memory at a single time. With this one, typically you want to set this to either

40:33 one or two. Definitely set this to just one if you are using large language

40:37 models that are basically fit for your GPU. like it's going to fit exactly into

40:41 your VRAM and you're not going to have room for another large language model.

40:44 But if you are running more smaller ones and maybe you could actually fit two on

40:48 your GPU with the VRAM that you have, you can set this to two. So again, more

40:52 technical overall, but it's very important to have these right. And we'll

40:55 get into the local AI package where I already have these set up in the

40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a

41:02 minute ago that I'll have linked in the description. And so there's actually a

41:06 lot of good things to read into here. um like being able to verify that your GPU

41:10 is compatible with Olama. How can you tell if the model's actually loaded on

41:13 your GPU? So, a lot of like sanity check things that they walk you through in the

41:17 FAQ as well. Also talking about environment variables, which I just

41:20 covered. And so, they've got some instructions here depending on your OS

41:23 how to get those set up. So, if there's anything that's confusing to you, this

41:26 is a very good resource to start with. So, I'm trying to make it possible for

41:30 you to look into things further if there's anything that doesn't quite make

41:33 sense for what I explained here. And of course, always let me know in the

41:35 comments if you have any questions on this stuff as well, especially the more

41:39 technical stuff that I just got to cover because it's so important even though I

41:43 know we really want to dive into the meat of things, which we are actually

41:47 going to do now. All right, here is everything that we have covered at this

41:50 point. And congratulations if you have made it this far because I covered all

41:55 the tricky stuff with quantization and the hardware requirements and offloading

41:59 and some of our little configuration and parameters. So, if you got all of that,

42:03 the rest of it is going to be a walk in the park as we start to dive into code,

42:07 getting all of our local AI set up and building out some agents. You understand

42:10 the foundation now that we're going to build on top of to make some cool stuff.

42:15 And so, now the next thing that we're going to do is talk about how we can use

42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show

42:23 you an example. We can take something that is using OpenAI right now,

42:27 transform it into something that is using OAMA and local LLM. So, we'll

42:31 actually dive into some code here. And I've got my fair share of no code stuff

42:35 in this master class as well, but I want to focus on both because I think it's

42:38 really important to use both code and no code whenever applicable. And that

42:42 applies to local AI just like building agents in general. So, I've already

42:45 promised a couple of times that I would dive into OpenAI API compatibility, what

42:50 it is, and why it's so important. And we're going to dive into this now so you

42:54 can really start to see how you can take existing agents and transform them into

42:59 being 100% local with local large language models without really having to

43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI

43:10 has created a standard for exposing large language models through an API.

43:14 It's called the chat completions API. It's kind of like how model context

43:19 protocol MCP is a standard for connecting agents to tools. The chat

43:23 completions API is a standard for exposing large language models over an

43:28 API. So you have this common endpoint along with a few other ones that all of

43:35 these providers implement. This is the way to access the large language model

43:39 to get a response based on some conversation history that you pass in.

43:43 So, Olama is implementing this as of February. We have other providers like

43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.

43:53 Almost every single provider is OpenAI API compatible. And so, not only is it

43:57 very easy to swap between large language models within a specific provider, it's

44:02 also very easy to swap between providers entirely. You can go from Gemini to

44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one

44:13 piece of configuration pointing to a different base URL as it is called. So

44:18 you can access that provider and then the actual API endpoint that you hit

44:22 once you are connected to that specific provider is always the exact same and

44:26 the response that you get back is also always the exact same. And so Olama has

44:31 this implemented now. And I'll link to this article in the description as well

44:33 if you want to read through this because they have a really neat Python example.

44:37 It shows where we create an OpenAI client and the only thing we have to do

44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are

44:47 pointing to Olama that is hosted locally instead of pointing to the URL for

44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And

44:55 then with Olama, you don't actually need an API key because everything's running

44:58 locally. So you just need some placeholder value here. But there is no

45:02 authentication that is going on. You can set that up. I'm not going to dive into

45:05 that right now. But by default, because it's all just running locally, you don't

45:09 even need an API key to connect to Olama. And then once we have our OpenAI

45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in

45:18 exactly the same way. But now we can specify a model that we have downloaded

45:22 locally already through Lama. We pass in our conversation history in the same way

45:27 and we access the response like the content the AI produced the token usage

45:31 like all those things that we get back from the response in the same way.

45:34 They've got a JavaScript example as well. They have a couple of examples

45:38 using different frameworks like the Versell AI SDK and Autogen. Really any

45:44 AI agent framework can work with OpenAI API compatibility to make it very easy

45:47 to swap between these different providers. like Pyantic AI, my favorite

45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily

45:57 within your Pantic AI agents swap between these different providers. And

46:02 so what I have for you now is two code bases that I want to cover. The first

46:07 one is the local AI package, which we'll dive into in a little bit. But right

46:13 now, we have all of the agents that we are going to be creating in this master

46:17 class. So I have a couple for N8N that are also available in this repository.

46:21 And then a couple of scripts that I want to share with you as well. And so the

46:25 very first thing that I want to show you is this simple script that I have called

46:31 OpenAI compatible demo. And so you can download this repository. I'll have this

46:34 linked in the description as well. There's instructions for downloading and

46:38 setting up everything in here. And this is all 100% local AI. And so with that,

46:43 I'm going to go over into my windsurf here where I have this OpenAI compatible

46:47 demo set up. So I've got a comment at the top reminding us what the OpenAI API

46:52 compatibility looks like. We set our base URL to point to Olama hosted

46:59 locally and it's hosted on port 11434 by default. So I can actually show you

47:02 this. I have Ola running in a Docker container, which we're going to dive

47:05 into this when we set up the local AI package, but you can see that it is

47:11 being exposed on port 11434. And by the way, you can see the

47:14 127.0.0.1 in that URL that I have highlighted here, that is synonymous with localhost.

47:21 And so this right here, you could also replace with 127.0.0.1.

47:25 Just a little tidbit there. It's not super important. I just typically leave

47:28 it as localhost. And then you can change the port as well. I'm just sticking to

47:32 what the default is. And then again, we don't need to set our API key. We can

47:36 just set it to any value that we want here. We just need some placeholder even

47:39 though there is no authentication with a llama for real unless you configure

47:43 that. So that's OpenAI compatibility. And the important thing with this script

47:47 here is I have two different configurations here. I have one for

47:51 talking to OpenAI and then one for OALMA. So with OpenAI, we set our base

47:57 URL to point to api.openai.com. We have our OpenAI API key set in our

48:01 environment variables. So you can just set all your environment variables here

48:05 and then rename this to env. I've got instructions for that in the readme of

48:08 course. And then going back to the script, we are using GPT4.1 nano for our

48:13 large language model. There's something super fast and cheap. And then for our

48:17 Lama configuration, we are setting the base URL here, localhost1434

48:22 or just whatever we have set in our environment variables. Same thing for

48:26 the API key. And then same thing for our large language model. And what I'm going

48:31 to be using in this case is Quen 314B. That is one of the large language models

48:34 that I showed you within the Olama website. Definitely a smaller one

48:38 compared to what I could run, but I just want to run something fast. And very

48:41 small large language models are great for simple tasks like summarization or

48:45 just basic chat. And that's what I'm going to be using here just for a simple

48:49 demo. And so whether it's enabled or not, this configuration is just based on

48:53 what we have set for our environment variables. And the important thing here

48:59 is the code that runs for each of these configurations just as we go through

49:03 this demo is exactly the same. We are parameterizing the configuration for the

49:08 base URL and API key. So we are setting up the exact same OpenAI client just

49:13 like we saw in the Olama article but just changing the base URL and API key.

49:17 And so then for example when we use it right here it's client.hat.comp

49:22 completions.create create calling the exact same function no matter if we're

49:26 using OpenAI or Olama. And then we're handling the response in the same way as

49:31 well. And so I'll go back to my terminal now. And so I went through all the steps

49:34 already to set up my virtual environment, install all of my dependencies. And so now I can run the

49:40 command OpenAI compatible demo. And now it's going to present the two

49:43 configuration options for me. And so I can run through OpenAI. So we'll go

49:46 ahead and do that first. And these two demos are going to look exactly the

49:50 same, but that is the point. And so we have our base URL here for OpenAI. We

49:54 have a basic example of a completion with GPT4.1 Nano. There we go. So this

49:59 is the model that was used. Here are the number of tokens. And this is our

50:03 response. And then I can press enter to see a streaming response now as well. So

50:07 we saw it type out our answer in real time. And then I can press enter one

50:10 more time. This is the last part of the demo. Just say multi-turn conversation.

50:14 So we got a couple of messages here in our conversation history. So very nice

50:19 and simple. The point here is to now show you that I can run this and select

50:23 Olama now instead and everything is going to look exactly the same and all

50:27 of the code is the same as wallet. It is only our configuration that is

50:31 different. And so it will take a little bit when you first run this because

50:36 Olama has to load the large language model into your GPU. And so going to the

50:42 logs for Olama, I can show you what this looks like here. And so when we first

50:47 make a request when Quen 314B is not loaded into our GPU yet, you're going to

50:52 see a lot of logs come in here and we'll and you'll have this container up and

50:54 running when you have the local AI package which we'll cover in a little

50:57 bit. So it shows all the metadata about our model like it's Quen 314b. Uh we can

51:05 see here that uh we have a Q4 KM quantization like we saw in the Olama

51:09 website. Uh what other information do we have here? There's just so much to to

51:14 digest here. Um, yeah, another really important thing is we have the uh

51:19 context link. I have that set to 8,192 just like I recommended in the

51:22 environment variables. And then we can see that we offloaded all of the layers

51:26 to the GPU. So I don't have to do any offloading to the CPU or the RAM. I can

51:30 keep everything in the GPU, which is certainly ideal, like I said, to make

51:34 sure this is actually fast. And then when we get a response from quen 314b,

51:41 we are calling the v1/hatcompletions endpoint because it is openi API

51:46 compatible. So that exact endpoint that we hit for openai is the one that we are

51:50 hitting here with a large language model that is running entirely on our computer

51:54 in Olama. And so the response I get back, it's actually a reasoning LLM as

51:58 well. So we even have the thinking tokens here, which is super cool. And so

52:02 we got our response. It's just printing out the first part of it here just to

52:04 keep it short. And then I can press enter. And we can see a streaming demo

52:08 as well. And it's going to be a lot faster this time because we do already

52:11 have the model loaded into our GPU. And so that first request when it first has

52:15 to load a model is always the slower one. And then it's faster going forward

52:19 once that model is already loaded in our GPU. And then as long as we don't swap

52:24 to another large language model and use that one, then it will remain in our GPU

52:28 for some time. And so then all of our responses after are faster. And then we

52:33 just have the last part of our demo here with a multi-turn conversation. So we

52:37 can see conversation history in action as well, just not with streaming here.

52:40 Um, and and everything's a bit slower with this large language model because

52:43 it is a reasoning one. And so you can certainly if you want faster uh

52:47 inference, you can always use a non-reasoning local LLM like Mistl or

52:52 Gemma for example. So that is our very simple demo showing how this works. I

52:55 hope that you can see with this and again this works with other AI agent

52:59 frameworks like eggno or pideantic AI or crew AI as well like they all work in

53:03 this way where you can use openAI API compatibility to swap between providers

53:08 so easily so you don't have to recreate things to use local AI and that's

53:11 something so important that I want to communicate with you because if I'm the

53:15 one introducing you to local AI I also want to show you how it can very easily

53:19 fit into your existing systems and automations. All right. Now, we have

53:23 gotten to the part of the local AI master class that I'm actually the most

53:27 excited for because over the past months, I have very much been pouring my

53:31 heart and soul into building up something to make it infinitely easier

53:35 for you to get everything up and running for local AI. And that is the local AI

53:40 package. And so, right now, we're going to walk through installing it step by

53:44 step. I don't want you to miss anything here because it's so important to get

53:47 this up and running, get it all working well. Because if you have the local AI

53:51 package running on your machine and everything is working, you don't need

53:55 anything else to start building AI agents running 100% offline and

53:59 completely private. And so here's the thing. At this point, we've been

54:04 focusing mostly on Olama and running our local large language models. But there's

54:08 the whole other component to local AI that I introduced at the start of the

54:13 master class for our infrastructure. things like our database and local and

54:18 private web search, our user interface, agent monitoring. We have all these

54:23 other open-source platforms that we also want to run along with our large

54:27 language models and the local AI package is the solution to bring all of that

54:32 together curated for you to install in just a few steps. So, here is the GitHub

54:37 repository for the local AI package. I'll have this linked in the description

54:41 below. Just to be very clear, there are two GitHub repos for this master class.

54:45 We have this one that we covered earlier. This has our N8N and Python

54:49 agents that we'll cover in a bit, as well as the OpenAI compatible demo that

54:53 we saw earlier. So, you want to have this cloned and the local AI package as

54:57 well. Very easy to get both up and running. And if you scroll down in the

55:02 local AI package, I have very comprehensive instructions for setting

55:06 up everything, including how to deploy it to a private server in the cloud,

55:10 which we'll get into at the end of this master class, and a troubleshooting

55:13 section at the bottom. So, everything that I'm about to walk you through here,

55:17 there's instructions in the readme as well if you just want to circle back to

55:21 clarify anything. Also, I dive into all of the platforms that are included in

55:26 the local AI package. And this is very important because like I said, when you

55:30 want to build a 100% offline and private AI agent, it's a lot more than just the

55:35 large language model. You have all of the accompanying infrastructure like

55:39 your database and your UI. And so I have all that included. First of all, I have

55:44 N8N that is our low/noodeode workflow automation platform. We'll be building

55:48 an agent with N8N in the local AI package in a little bit once we have it

55:52 set up. We have Superbase for our open- source database. We have Olama. Of

55:56 course, we want to have this in the package as well for our LLMs. Open Web

56:01 UI, which gives us a chat GPT like interface for us to talk to our LLMs and

56:06 have things like conversation history. Very, very nice. So, we're looking at

56:09 this right here. This is included in the package. Then we have Flowwise. It's

56:13 similar to N8N. It's another really good tool to build AI agents with no slash

56:18 low code. Quadrant, which is an open- source vector database. Neo4j which is a

56:25 knowledge graph engine and then seir xng for open-source completely free and

56:31 private web search caddy which this is going to be very important for us once

56:35 we deploy the local AI package to the cloud and we actually want to have

56:38 domains for our different services like nn and open web UI and then the last

56:42 thing is langfuse this is an open- source LLM engineering platform it helps

56:47 us with agent observability now some of these services are outside of the scope

56:52 for this local AI master class. I don't want to spend a half hour on every

56:56 single one of these services and make this a 10-hour video. I will be focusing

57:02 in this video on N8N, Superbase, Olama, Open WebUI, CRXNG, and then Caddy once

57:08 we deploy everything to the cloud. So, I do cover like half of these services.

57:12 And the other thing that I want to touch on here is that there are quite a few

57:16 things included here. And so you do need about 8 GB of RAM on your machine or

57:21 your cloud server to run everything. It is pretty big overall. And so you can

57:26 remove certain things like if you don't want Quadrant and Langfuse for example,

57:30 you can take those out of the package. More on that later. It doesn't have to

57:34 be super bloated, you can whittle this down to what you need. But yeah, there's

57:37 a lot of different things that go into building AI agents. And so I have all of

57:40 these services here so that no matter what you need, I've got you covered. And

57:44 so with that, we can now move on to installing the local AI package. And

57:48 these instructions will work for you on any operating system, any computer. Even

57:52 if you don't have a really good GPU to run local large language models, you

57:56 still could always use OpenAI or Anthropic, something like that, and then

58:00 run everything else locally to save on costs or just to have everything running

58:04 on your computer. And so there are a couple of prerequisites that you have to

58:08 have before you can do the instructions below. You need Python so you can run

58:12 the start script that boots everything up. Git or GitHub desktop so you can

58:16 clone this GitHub repository, bring it all onto your own machine. And then you

58:21 want Docker or Docker Desktop. And so I've got links for all of these. Docker

58:25 and Docker Desktop we need because all of these local AI services that I've

58:29 curated for you, they all run as individual Docker containers that are

58:34 all combined together in a stack. And so I'll actually show you this is the end

58:36 result once we have everything up and running within your docker desktop. You

58:41 have this local AI docker compos stack that has all of the services running in

58:45 tandem like superbase and reddus and nitn and flowwise caddy neo4j. All of

58:50 these are running within this stack. That is what we're working towards right

58:54 now. And so make sure you have all these things installed. I've got links that'll

58:57 take you to installing no matter your operating system. Very easy to get all

59:01 of this up and running on your machine. Then we can move on to our first command

59:05 here, which is to clone this GitHub repository, bringing all of this code on

59:10 your machine so you can get everything running. And so you want to open up a

59:14 new terminal. So I've got a new PowerShell session open here. Going to

59:18 paste in this command. And I'm going to be doing this completely from scratch

59:22 with you. So you clone the repo and then I'm just going to change my directory

59:26 into local AI package, which was just created from this get clone command. So

59:31 those are the first two steps. The next thing is we have to configure all of our

59:36 environment variables. And believe it or not, this is actually the longest part

59:41 of the process. And once we have this taken care of, it's a breeze getting the

59:44 rest of this up and running. But there's a lot of configuration that we have to

59:49 set up for our different services like credentials for logging into our

59:54 Superbase dashboard or Neo4j. Uh things like our Superbase um anonymous key and

59:59 private key. All these things we have to configure. And so within our terminal

60:04 here, you can do code dot to open this within VS code or windsurf. Open this in

60:09 windsurf. You just want to open up this folder within your IDE and the specific

60:14 IDE that you use. Really doesn't matter. You just want to get to this.env.example

60:20 here. I'm going to copy it and then I'm going to paste it. And then I'm going to

60:24 rename this toenv. So we're taking the example. example, turning it into av file. So,

60:31 you want to make sure that you copy it and rename it like this. Then we can go

60:36 ahead and start setting all of our configuration. And I'll even zoom in on

60:39 this just so that it's very easy for you to see everything that we are setting up

60:44 here. So, first up, we have a couple of credentials for N8N. We have our

60:49 encryption key and our JWT secret. And it's very easy to generate these. In

60:53 fact, we'll be doing this a couple of times, but we'll use this open SSL

60:58 command to generate a random 32 character alpha numeric string that

61:02 we're going to use for things like our encryption key and JWT secret. And so,

61:08 OpenSSL is a command that is available for you by default on Linux and Macs.

61:12 You can just open up any terminal and run this command and it'll spit out a

61:16 long string that you can then just paste in for this value. For Windows, you

61:20 can't just open up any terminal and use OpenSSL, but you can use Git Bash, which

61:26 is going to come with GitHub Desktop when you install it. And so, I'll go

61:29 ahead and just search for that. If you just go to your search bar on your

61:32 bottom left on Windows and search for Git Bash, it's going to open up this

61:37 terminal like this. And so, I can go ahead and copy this command, go in here,

61:42 and paste it in. And then I can run it. And then, boom, there we go. This is I

61:45 know it's really small for you to see right now. I'm going to go ahead and

61:48 copy this because this is now the value that I can use for my encryption key.

61:52 And then you want to do the exact same thing to generate a JWT secret. And then

61:57 the other way that you can do this if you don't want to install git bash or

62:01 it's not working for whatever reason, you can use Python to generate this as

62:05 well. So I can just copy this command and then I can go into the terminal here

62:10 and I can just paste this in. And so it's going to just like with OpenSSL

62:15 generate this random 32 character string that I can copy and then use for my JWT

62:20 secret. There we go. And so I am going to get in the weeds a little bit here

62:24 with each of these different parameters, but I really want to make sure that I'm

62:27 clear on how to set up everything for you so you can really walk through this

62:31 step by step with me. And like I said, setting up the environment variables is

62:35 the longest part by far for getting the local AI package set up. So if you bear

62:39 with me on this, you get through this configuration, you will have everything

62:43 running that you need for local AI for the LLMs and your infrastructure. So

62:47 that's everything for N8N. Now we have some secrets for Superbase. And there

62:53 are some instructions in the Superbase documentation for how to get some of

62:57 these values. So it's this link right here, which I have open up on my

63:01 browser. So we'll we'll reference this in a little bit here. But first, we can

63:05 set up a couple of other things. The first thing we need to define is our

63:11 Postgress password. So, Supphabase uses Postgress under the hood for the

63:14 database. And so, we want to set a password here that we'll use to connect

63:19 to Postgress within N8N or a connection string that we have for our Python code,

63:23 whatever that might be. And this value can be really anything that you want.

63:26 Just note that you have to be very careful at using special characters like

63:31 percent symbols. So if you ever have any issues with Postgress, it's probably

63:36 because you have special characters that are throwing it off. U that's something

63:39 that I've seen happen quite a few times. And so like I said, I want to mention

63:42 troubleshooting steps and things to make sure that it is very clear for you. So

63:47 for this Postgress password here, I'm just going to say test Postgress pass.

63:51 I'm just going to give some kind of random value here. Just end with a

63:54 couple of numbers. I don't care that I'm exposing this information to you because

63:58 this is a local AI package. These passwords are for services that never

64:02 leave my computer. So, it's not like you could hack me by connecting to anything

64:08 here. And then we have a JWT secret. And this is where we get into this link

64:13 right here in the Superbase docs. And so they walk you through generating a JWT

64:18 secret and then using that to create both your anonymous and your service

64:22 role keys. If you're familiar with Superbase at all, we need both of these

64:27 pieces of information. The anonymous key is what we share to our front end. This

64:30 is our public key. And then the service role key has all permissions for

64:33 Superbase. We'll use this in our backends for things like our agents. And

64:39 so you can just go ahead and copy. You can go ahead and copy this JWT secret.

64:43 And then you can paste this in right here. This is 32 characters long just

64:47 like the things that we generated with OpenSSL. I'm just going to be using

64:52 exactly what Superbase tells me to. And then what you can do with this is you

64:56 can select the anonymous key. Click on generate JWT and then I can copy this

65:02 value and then I will paste this for my anonymous token. And so I'm just

65:06 replacing the default value there for the anonymous key. And then going back

65:10 and selecting the service key, I'm going to generate that one as well. So it

65:13 looks very similar. They'll always start with ey, but these values are different

65:18 if you go towards the end. And so I'll go ahead and paste this for my service

65:22 ro key. Boom. There we go. All right. And then for the Superbase dashboard

65:27 that we'll log into to see our tables and our SQL editor and authentication

65:31 and everything like that, we have our username here, which I'm just going to

65:35 keep as superbase. And then for the password, I can just say test superbase

65:39 pass. I'll just kind of use that as my common nomenclature here for my

65:42 passwords cuz I don't really care what that is right now. And then the last

65:45 thing that we have to set up is our pooler tenant ID. And it's not really

65:49 important to dive into what exactly this means. Just know that you can set this

65:52 to really anything that you want. Like I typically will just choose four digits

65:57 here like 1,00 for my pooler tenant ID. So that is everything that we need for

66:01 superbase. And actually most of the configuration is for superbase. Then we

66:06 have Neo4j. This is really simple. You can leave Neo4j for the username and

66:11 then I'll just say test Neo4j pass for my password here. So you just set the

66:15 password for knowledger graph and even if you're not using Neo4j you still have

66:19 to set this but yeah it just takes two seconds. Then we have langfuse. This is

66:23 for agent observability. We have a few secrets that we need here. And for these

66:28 values they can really just be whatever you want. It doesn't matter because

66:31 these are just passwords just like we had passwords for things like Neoforj.

66:35 So I can just say test click house pass. Um and then I can do test mo pass. And

66:43 um I mean it really doesn't matter here. Random Langfuse salt. I'm just doing

66:47 completely whack values here. You probably want something more secure in

66:51 this case, but um I'm just doing something as a placeholder for now. Um

66:56 yeah, that there we go. Okay, good. And then then the last thing that we need

66:59 for Langfuse is an encryption key. And this is also generated with OpenSSL like

67:04 we did for the N8N credentials. And so I'll go back to my git bash terminal.

67:08 And again, you can do this with Python as well. I'll just run the exact same

67:12 command. I'll get a different value this time. And so I'll go ahead and copy

67:16 that. You could technically use the same value over and over if you wanted to,

67:20 but obviously it's way more secure to use a different value for each of the

67:24 encryption keys that you generate with OpenSSL. So there we go. That is our

67:28 encryption key. And that is actually everything that we have to set up for

67:32 our environment variables when we are just running the local AI package on our

67:37 computer. Once we deploy it to the cloud and we actually want domains for our

67:41 different services like open web UI and N8N then we'll have to set up caddy. So

67:45 this is where we'll dive into domains and we'll get into this at the end of

67:49 the master class here. But everything past this point for environment

67:54 variables is completely optional. You can leave all of this exactly as it is

67:59 and everything will work. Most of this is just extra configuration for

68:03 superbase. So, Superbase is definitely the biggest service that's included in

68:08 this list of, you know, curated services for you. And so, there's a lot of

68:11 different configuration things you can play around with if you want to dive

68:15 more into this. You can definitely look at the same documentation page that we

68:19 were using for the Superbase Secrets. And so, you can scroll through this if

68:22 you want to learn more um like setting up email authentication or Google

68:27 authentication. um diving more into all of those different configuration things

68:32 for Superbase if you want to dive more into that. I'm not going to get into all

68:36 of this right now because the core of getting Superbase up and running we

68:40 already have taken care of with the credentials that we set up at the top um

68:44 right here. And so that these are these are just the base things and so that's

68:47 what we'll stick to right now. So that is everything for our environment

68:51 variables. So then going back to our readme now which I have open directly in

68:55 windsurf now instead of my browser we have finished our configuration and I do

68:59 have a note here that you want to set things up for caddy if you're deploying

69:03 to production. Obviously we're doing that later not right now like I said and

69:07 so with that we are good to start everything. Now before we spin up the

69:12 entire local AI package there is one thing that I want to cover. It's

69:14 important to cover this before we run things. If you don't want to run

69:19 everything in the package cuz it is a lot like maybe you only want to use half

69:22 of these services and you don't want Neo4j and Langfuse and Flowwise right

69:28 now. There are two options that you have. The easiest one right now is to go

69:33 into the docker compose file. This is the main file where all of the services

69:38 are curated together and you can just remove the services that you don't want

69:42 to include. So, for example, if you don't want Quadrant right now, cuz it is

69:46 actually one of the larger services. It's like 600 u megabytes of RAM just

69:50 having this running, you can search for Quadrant, and you can just go ahead and

69:54 delete this service from the stack like that. Boom. Now I don't have Quadrant.

69:59 It won't spin up as a part of the stack anymore. And then also I have a volume

70:03 for Quadrant. So, you can remove that as well. Volumes, by the way, is how we are

70:08 able to persist data for these containers. So if we tear down

70:12 everything and then we spin it back up, we still are going to have our open web

70:17 UI conversations and our N8N workflows, everything in Superbase, like all that

70:21 is still going to be saved because we're storing it all in volumes. So we can do

70:25 whatever the heck we want with these containers. We can tear them down. We

70:28 can update them, which I'll show you how to do later. We can spin it back up. And

70:32 all of our data will always be persisted. So you don't have to worry

70:35 about losing information. And you can always back things up if you want to be

70:39 really secure, but I've never done that before and I've been updating this

70:42 package for months and months and months and all of my workflows from 6 months

70:46 ago are still there. I haven't lost anything. And so that's just a quick

70:50 caveat there for how you can remove services if you want. And then another

70:54 thing that we don't have available yet, but I'm very excited to, you know, kind

70:58 of talk about this right now. It's in beta right now. We are creating me and

71:03 one other guy uh that's actually on my Dynamist team. Um Thomas, he's got a

71:07 YouTube channel as well. He's a great guy. We're working together on this.

71:09 He's actually been putting in most of the work creating a front-end

71:13 application for us to manage our local AI package. And one of the big things

71:17 with this is that we're going to make it possible for you to toggle on and off

71:22 the services that you want to have within your local AI package. So you can

71:27 very much customize the package to the services that you want to run. So you

71:31 can keep it lightweight just to the things you care about. Also, we'll be

71:34 able to manage environment variables and monitor the containers. Not all of this

71:38 is up and running at this point, but this is in beta. We're working on it.

71:41 I'm really excited for this. So, not available yet, but at once this is

71:44 available, this will be a really good way for you to customize the package to

71:47 your needs. So, you don't have to go and edit the docker compose file directly.

71:51 So, that's something that I just wanted to get out of the way now. But, we can

71:57 start and actually execute our package now. Get all these containers up and

72:02 running. So the command that you run to start the local AI package is different

72:07 depending on your operating system and the hardware that you have. So for

72:13 example, if you are an Nvidia GPU user, you want to run this start services.py

72:18 script. This boots up all of the containers and you want to specifically

72:23 pass in the profile of GPU NVIDIA. This is going to start Ola in a way where the

72:29 Olama container is able to leverage your GPU automatically. And then if you are

72:34 using an AMD GPU and you're on Linux, then you can run it this way. Which by

72:38 the way, unfortunately, if you have an AMD GPU on Windows, you aren't able to

72:46 run O Lama in a container. And it's the same thing with Mac computers.

72:49 Unfortunately, like you see right here, you cannot expose your GPU to the Docker

72:55 instance. And so if you are an AMD GPU on Windows or running on Mac, you cannot

73:01 run Olama in the local AI package. You just have to install it on your own

73:04 machine like I already showed you in this master class and then you'll just

73:08 run everything else through the local AI package and they can actually go out to

73:12 your machine and communicate to Olama directly. So just a small limitation for

73:17 Mac and AMD on Windows. But if you're running on Linux or an Nvidia GPU on

73:21 Windows like I'm using, then you can go ahead and run this command right here.

73:27 So if you can't run a GPU in the Olama container, then you can always just

73:32 start in CPU mode or you can run with a profile of none. This will actually make

73:36 it so that Olama never starts in the local AI package. So you can just

73:40 leverage the Olama that you have already running on your computer like I showed

73:43 you how to install already. So, just a couple of small caveats that I really

73:46 want to hit on there. I need to make sure that you're using the right

73:51 command. And so, in my case, I'm Nvidia on Windows. So, I'm going to copy this

73:55 command. Go back over into my terminal. I'll just clear it here. So, we have a

73:59 blank slate. And I'll paste in this command. And so, it's going to do quite

74:02 a few things initially. First, it's going to clone the Superbase repository

74:07 because Superbase actually manages the stack in a separate place. And so, we

74:10 have to pull that in. Then there's some configuration for CRXNG for our uh local

74:17 and private web search. And then I have a couple of warnings here saying that

74:20 the Flowwise username and password are not set, which by the way for that if

74:24 you want to set the Flow Wise username and password, it's optional, but you can

74:29 do that if I scroll down right here. So you can set these values, those will

74:32 actually make those warnings go away, but you can also ignore them, too. So

74:35 anyway, I just wanted to mention that really quickly. But now what's happening

74:39 here is it starts by running all of the Superbase containers. And so there's

74:44 quite a bit that goes into Superbase, like I said. So we're running all of

74:47 that. It's getting all that spun up. And then once we run all of these, it's

74:51 going to move on to deploying the rest of our stack. And if you're running this

74:55 for the very first time, it will take a while to download all of these images.

74:59 They're not super small. There's a lot of infrastructure that we're starting up

75:03 here. And so it'll take a bit. You just have to be patient. maybe go grab your

75:06 coffee or make your next meal, whatever that is. And then everything will be up

75:09 and running once you are back. And so yeah, now you can see that we are

75:13 running the rest of the containers here. Um, and so we'll just wait for that to

75:16 be done. And then I'll show you what that looks like in Docker Desktop as

75:19 well. And so I'll give it a second here just to finish. Uh, looks like my

75:24 terminal glitched a little bit. Like I was scrolling and so it kind of broke it

75:27 a bit. But anyway, everything is up and running now. It'll look like this where

75:30 it'll say all of the containers are healthy or running or started. And then

75:34 if I go into Docker Desktop and I expand the local AI compost stack, you want to

75:39 make sure that you have a green dot for everything except for the Olama pull and

75:45 N8N import. These just run once initially and then they go down because

75:49 they're responsible for pulling some things for our local AI package. And so

75:53 yeah, I've got green dots for everything except for two right here. Now I'm

75:57 leaving this in here intentionally actually because there is a bug with

76:02 Superbase specifically if you are on Windows. So you'll see this issue where

76:08 the Superbase pooler is constantly restarting and that also affects N8N

76:12 because N8N relies on the Superbase pooler. So it's constantly restarting as

76:17 well. If you see this problem, I actually talk about this in the

76:21 troubleshooting section of the readme. If you scroll all the way down, if the

76:24 Superbase pooler is restarting, you can check out this GitHub issue. And so I

76:29 linked to this right here, and he tells you exactly which file you want to

76:33 change. It's this one right here. So it's docker volumes poolerpooler.exs.

76:39 And you need to change the file to end in lf. And so I'll show you what I mean

76:43 by that. I'll show you exactly how to do this. It's like a super tiny random

76:47 thing, but this has tripped up so many people. So I want to include this

76:51 explicitly in the master class here. So you want to go within the superbase

76:56 folder within docker volumes and then it's within pooler and then we have

77:00 pooler.exs and basically no matter your IDE you can see the crlf in the bottom right here.

77:08 You want to click on this and then change it to lf and then make sure that

77:13 you save this file. Very easy to fix that. And then what you can do is you

77:19 can run the exact same command to spin everything up again. And so I'm going to

77:22 do this now. It's going to go through all the same steps. It'll be faster this

77:25 time because you already have everything pulled. And this, by the way, is how you

77:28 can just restart everything really quickly if you want to enforce new

77:31 environment variables or anything like that. So I want to include that

77:35 explicitly um for that reason as well. And I'll go ahead and close out of this.

77:39 And and while this is all restarting, the other thing that I want to show you

77:42 in the readme is I also have instructions for upgrading the containers in the local AI package. So

77:49 when N8N has an update or Superbase has an update, it is your responsibility

77:53 because you're managing the infrastructure to update things yourself. And so you very simply just

77:58 have to run these three commands to update everything. You want to tear down

78:03 all of the containers and make sure you specify your profile like GPU Nvidia and

78:09 then you want to pull all of the latest containers and again specifying your

78:13 profile. And then once you do those two things, you'll have the most up-to-date

78:17 versions of the containers downloaded. So you can go ahead and run the start

78:21 services with your profile just like we just did to restart things. Very easy to

78:25 update everything. And even though we are completely tearing down our

78:29 containers here before we upgrade them, we aren't losing any information because

78:33 we are persisting things in the volumes that we have set up at the top of our

78:37 Docker Compose stack. And so this is where we store all of our data in our

78:41 database and and workflows. All these things are persisted. So we don't have

78:45 to worry about losing them. Very easy to upgrade things and you still get to keep

78:48 everything. You don't have to make backups and things like that unless you

78:54 just want to be ultra ultra safe. So now we can go back to our Docker desktop and

79:00 we've got green dots for everything now since we fixed that pooler.exs issue.

79:04 The only thing that we don't have green dots for is the N8N import and then we

79:08 have our Olama pull as well because like I said those are the two things that

79:11 just have to run at the beginning and then they aren't ongoing processes like

79:16 the rest of our services. So, we have everything up and running. And if there

79:22 is anything that is a white dot besides Olama pull or n import or if there's

79:27 anything that is constantly restarting, just feel free to post a comment and

79:31 I'll definitely be sure to help you out. And then also check out the

79:34 troubleshooting section as well. One thing that I'll mention really quick is

79:37 sometimes your N8N will constantly restart and it'll say something like the

79:42 N8N encryption key doesn't match what you have in the config. And the big

79:46 thing to keep in mind for that is you want to make sure that you set this

79:51 value for the encryption key before you ever run it for the first time.

79:53 Otherwise, it's going to generate some random default value and then if you

79:56 change this later, it won't match with what it expects. And so, yeah, my big

79:59 recommendation is like make sure you have everything set up in your

80:03 environment variables before you ever run the start services for the first

80:08 time. This should be run once you have your environment variables set up.

80:11 Otherwise, you risk any of these services creating default values that

80:14 then wouldn't match with the keys and things that you set up later. And so

80:18 with that, we can now go into our browser and actually explore all of

80:22 these local AI services that we have running on our computer now. Now over in

80:26 our browser, we can start visiting the different services that we have spun up.

80:30 Like here is N8N. You just have to go to localhost port 5678. It'll have you

80:35 create a local account when you first visit it. And then you'll have this

80:38 workflow view that should look very familiar to you. if you have used NAND

80:41 in the past. And then we have open web UI localhost port 8080. This is our chat

80:47 GPT like interface where we can directly talk to all of the models that we have

80:52 pulled in our Olama container. Really, really neat. And then we have local host

80:57 port 8000 for our Superbase dashboard. The signin definitely isn't pretty

81:00 compared to the managed version of Superbase. But once you enter in your

81:04 username and password that you have set for the environment variables for the

81:07 dashboard, then you have the very typical view where we have our tables

81:11 and we've got our SQL editor. Everything that you're familiar with with

81:14 Superbase. And that's the key thing with all these different services. They all

81:18 will look the exact same for you pretty much. Um like another one for example,

81:25 if I go to localhost um port 3000, we have languages. This is for agent

81:28 observability and monitoring. And this is something I'm not going to dive into

81:31 in this master class. Like I said, I'm not covering all the services. But yeah,

81:34 I just want to show that like every single one of these pretty much you can

81:39 access in your browser. And by the way, the way that we know the specific port

81:43 to access for each of these services is by taking a look at either what it tells

81:48 us in Docker Desktop. So like we can see that Neo4j is um let's see, we have port

81:53 7474. For uh CR XNG, it's port 8081. For Flowwise, it's port 3001. What's one

82:01 that we've seen already? Um, let me Yeah, like Open Web UI is port 8080. So,

82:06 the port on the left is the one that we access in our browser. And then the port

82:11 on the right is what's mapped on the container. So, when we visit port 8080

82:16 on our computer, that goes into port 8080 on the container. And that's what

82:21 we have exposed. The other way that you can see the port that you need to use is

82:24 just by taking a look at this docker compose file. And you don't need to have

82:29 like a super good understanding of this docker compose file. But if you want to

82:33 customize your stack or even help me by making contributions to local AI

82:36 package, this is the main place to make changes. And so for example, I can go

82:41 down to flowwise and I can see that the port is 3001. Or if I go down to let's say N8N, we can

82:49 see that the port is 5678. And so the port is always going to be

82:52 there somewhere in the service that you have set up. Like for the Langfuse

82:56 worker, it's 3030. That's more of a behind-the-scenes kind of service. But

82:59 let me just find one more example for you here. Um yeah, like Reddus for

83:04 example is 6379. So you can see the ports in the Docker Compose as well. I

83:07 just want to call it out just to at least get you a little bit comfortable

83:11 and familiar with the Docker Compose file in case you want to customize

83:14 things. But the main thing is just leveraging what you see here in Docker

83:18 Desktop. Last thing in Docker Desktop really quickly, if you want to bring

83:21 more local large language models into the mix, you can do it without having to

83:25 restart anything. You just have to find the Olama container in the Docker

83:29 Compose stack. Head on over to the exec tab. And now here we can run any

83:33 commands that we'd want. We're directly within the container here. And we can

83:36 use Lama commands just like we did earlier on our host machine. And so for

83:40 example, I ran Lama list already. So I can see the large language models that

83:43 have already been pulled in my Olama container. If I want to pull more, I can

83:48 just do Olama pull and then find that ID for the model I want to use on the Olama

83:52 website. And like I said, you don't have to restart anything. If I pull it here,

83:56 it's now in the container and I can immediately start using it in Open Web

84:00 UI or N8N. We'll see that in a little bit. And so that's just really important

84:03 because a lot of times you're going to want to start to use different large

84:06 language models and you don't want to have to restart anything. The ones that

84:11 are brought into the machine by default is it's determined by this line right

84:15 here. So if you want to change the ones that are pulled by default, I just have

84:20 Quinn 2.57B instructs like a really small lightweight one that I have

84:23 brought into your Lama container by default. Uh if you want to add in

84:27 different ones, you can just update this line right here to include multiple

84:32 Olama pulls. And so that way you can bring in Quen 3 or Mistral 3.1 small,

84:36 whatever you want. This is just the one I have by default. And then all the

84:40 other ones that you saw in my list here, I've pulled myself. All right. Now that

84:45 we have the local AI package up and running, it is time to build some

84:50 agents. Now, we get to use our local AI package to actually build out an

84:53 application. And so, I'm going to start by introducing you to Open Web UI, and

84:58 we'll use it to talk to our Olama LLM. So, we have an application kind of right

85:02 out of the box for us. Then I'll dive into building a local AI agent with N8N,

85:08 even connecting it to Open Web UI. So we have this custom agent that we built in

85:12 N8N and then we immediately have a really nice UI to chat with it. And then

85:16 we'll transition to Python building the exact same agent in Python as well. Like

85:21 I said, I want to focus on both no code and code to really make this a complete

85:24 master class so that whether you want to build with N8N or Python, you can see

85:28 how to connect to our different services that we have running locally like

85:32 Superbase and CRXNG and Open Web UI. So, we'll cover all of that and then I'll

85:36 get into deployments after this. But yeah, let's go ahead right now focus on

85:40 open web UI and building out some agents. So, back over in Open Web UI,

85:45 remember this is localhost port 8080. You want to set up your connection to

85:49 Olama so we can start talking with our local LLMs in this nice interface. And

85:54 so bottom left, go to the admin panel, then go to settings and then the

85:58 connections tab. Here we can set up our connections both to OpenAI with our API

86:02 key, which we're not going to do right now, but then also the Olama API. This

86:07 is what we want to set up. Now, usually by default, this value is just

86:12 localhost. And this is actually wrong. This is something that is so important

86:16 to understand. And this will apply when we set up credentials in N8N and Python

86:20 as well. When you are within a container, localhost means that you are

86:26 referencing still within the container. Open web UI needs to reach out to the

86:32 Olama container, not itself. So localhost is not correct here. This is

86:36 generally the default just because open web UI assumes that you're running on

86:39 your machine and so then you would also have Lama running on your machine. So

86:42 local host usually works when you're outside of containers. But here we have

86:47 to change this. This is super important to get right. And so there are two

86:51 options we have. If you are running on a Mac or AMD on Windows and you want to

86:56 use Lama running on your machine not within a container, then you want to do

87:00 host.doccker.in. This is the way in docker to tell the container to look outside to the host

87:07 machine where you're running the containers and you're running

87:11 separately. Very important to know that. And then if you are running Olama in the

87:16 container like I am doing. I have Ola running in my Docker desktop. You want

87:21 to change this to Olama, you're specifically calling out the service

87:26 that is running the Olama container in your Docker Compose stack. And the way

87:30 that we know that this is the name specifically is because we just go back

87:35 to our allimportant Docker Compose file. Olama. So whenever there's an X and a

87:40 dash, you just ignore that. It's just the thing after it. So, ola is the name

87:45 of our service running the container. And then if we wanted to connect to

87:49 something else like flow-wise, flow-wise is the name of the service. Open WebUI,

87:55 it's open- web UI. All of these tople keywords, these are the names when we

87:59 want our containers to be talking to each other. And all of this is possible

88:03 because they are within the same Docker network. And so I'll just show you that

88:06 so you know what I'm talking about here. If I go back to Docker Desktop, we have

88:11 this local AI compos stack. All of these containers can now communicate

88:14 internally with each other by referencing the names like Reddus or

88:19 CRXNG. So, we'll be seeing that a lot when we're building out our agents as

88:22 well. So, I wanted to spend a couple minutes to focus on that. And so, you

88:25 can go ahead and click on save in the very bottom right. I know my face is

88:28 covering this right now, but you have a save button here. Make sure you actually

88:32 do that. Um, and for this API key, I don't know why it's asking me to fill it

88:34 out. I don't really care about connecting to open AI. So I'll just put

88:37 some random value there and click save. And then boom, there we go. We are good.

88:40 And then a lot of times with open web UI, it also helps to refresh otherwise

88:43 it doesn't load the models for some reason. So I just did a refresh of the

88:49 site here. Control F5. And then now we can select all of the local LLMs that we

88:53 have pulled in our Olama container. And so for example, I can do Quinn 2.57B.

88:57 That's the one that I just have by default. I can say hello. And it's going

89:02 to take a little bit cuz it has to load this model onto my GPU just like we saw

89:06 with quen 3 earlier. But then in a second here we'll get a response. And

89:10 there are actually multiple calls that are being done here. We have one to get

89:15 our response, one to get a title for our conversation on the lefth hand side. And

89:19 then also if you click on the three dots here, you can see that it created a

89:23 couple of tags for this conversation. So couple of things that are fired off all

89:26 at once there. And I can test conversation history. What did I just

89:30 say? So yeah, I mean everything's working really well here. We have chat

89:34 history, conversation history on the lefth hand side. There's so much that we

89:37 get out of the box. And so I wanted to show you this really quickly. Now we can

89:41 move on to building an agent in N8N. And I'll even show you how to connect it to

89:46 Open Web UI as well through this N8N agent connector. Really exciting stuff.

89:49 So let's get right into it. So I'm going to start really simple here by building

89:53 a basic agent. The main thing that I want to focus on is just connecting to

89:57 our different local AI services. So I am going to assume that you have a basic

90:01 knowledge of N8N here because this is not an N8N master class. And so I'm

90:04 starting with a chat trigger so we can talk to our agent directly in the UI.

90:08 We'll connect this to open web UI in a bit as well. And then I want to connect

90:14 an AI agent node. And so what we want to do is connect for the chat model and

90:18 then local superbase for our conversation history, our agent memory.

90:22 And so for the chat model I'm going to do lama chat model. I'm going to create

90:26 brand new credentials. You can see me do this from scratch. The URL that you want

90:32 for the base URL is exactly the same as what we just entered into open web UI.

90:35 And so if you are running Olama on your host machine like an AMD on Windows or

90:40 you are running on a Mac or you just don't want to run the Olama container,

90:46 then it is host.doccker.in. And then if you are referencing the

90:49 Olama container, we just reference Olama. That's the name of the service

90:53 running the Olama container in our stack. And then the port is 11434 by

90:57 default. And you can test this connection. So it'll do a quick ping to

91:01 the container to make sure that we are good to go. And I'll even show you what

91:04 that looks like. So right here in my Olama container, I have the logs up. And

91:09 the last two requests were just a simple get request to the root endpoint. We

91:13 have two of those right here. And if I click on retry and I go back to the

91:18 logs, boom, we are at three now. So it made three requests. So it's just making

91:22 that simple ping each time to make sure the container is available. And so I'm

91:26 going to go ahead and click on save and then close out. So now we have our

91:29 credentials and then we can automatically select the model that we

91:33 have loaded now in our container. And so just to keep things really lightweight,

91:36 I'm going to go with the 7 billion parameter model right now from Quen 2.5.

91:40 Cool. All right. So that is everything that we need to connect Olama. It is

91:44 that easy. And then we could even test it right now. So, I'm going to go ahead

91:47 and save this workflow. And I'm going to just say hello. And uh we don't need the

91:52 conversation history or tools or anything at this point. We're already

91:55 getting a response here from the LLM. It's working on loading the model into

91:59 my GPU as we speak. And so there we go. We got our answer looking really good.

92:04 Cool. So now we can add memory as well. So I'm going to add Postgress because

92:08 remember Superbase uses Postgress under the hood. And then I'm going to create

92:12 brand new credentials here. And this is actually probably the hardest one to set

92:16 up out of all of the credentials for connecting to our local AI service. And

92:20 so I'm going to show you what the Docker Compose file looks like just that it's

92:24 clear how I'm getting these different values. And so I'll point out all of

92:28 them. So the first one for our host it is DB because this is the name of the

92:35 specific Superbase service that we have that is the underlying Postgress

92:38 database. And I can show you how I got that really quick. If you go to the

92:42 superbase folder that we pull when we run that start services script, I go to

92:47 docker and then docker compose. If I search for db and there's quite a few

92:53 dependencies on db here. So let me find the actual reference to it. Where is db?

92:58 Here we go. So yeah, it's really short. Uh db is the name of our service that

93:03 actually is the superbase DB. So this is the container name that this is what

93:07 you'll see in docker desktop. But then this is the underlying service that we

93:10 want to reference when we have our containers communicating with each

93:14 other. Like in this case we have our N8N container talking to our superbase

93:18 database container. And then the database and username are both going to

93:22 be Postgress. Those are the values that we have by default. If you scroll down a

93:26 bit in thev you can see these right here. The Postgress database is

93:29 Postgress and the user is also Postgress. And you can customize these

93:33 things but these are some of the optional parameters that I didn't touch

93:36 in the setup with you. And so you can just leave those as is. Now the

93:41 Postgress password, this is one of them that we set. That was the first

93:44 superbase value that we set there. Make sure you have that from what you have in

93:48 thev. And then everything else you can kind of leave as the defaults here. The port is

93:53 going to be 5432. So that is everything for setting up our connection to

93:57 Postgress. You can test this connection as well. And then we can move on to

94:01 adding in some tools and things like that as well. But yeah, this is like the

94:06 very first basic version of the agent that I wanted to show you. And hopefully

94:10 with this you can see how no matter the service that you have running in the

94:13 local AI package. It's very easy to figure out how to connect to it both

94:17 with the help of N8N because N8N always makes it really easy to connect to

94:20 things. Then also just knowing that like you just have to reference that service

94:25 name that we have for the container in the Docker Compose stack. That's how we

94:28 can talk to it. So you could add in quadrant or you could add in language.

94:31 Like you can connect anything that you want into our agent here. And so now we

94:36 have conversation history. Next up, I want to show you how to build a bit more

94:40 of a complicated agent with N8N using some tools. And then also I'm going to

94:44 show you how to connect it to Open Web UI. And so right now this is a live

94:47 demo. Instead of connecting to one of the Olama LLMs, I'm going straight to

94:53 N8N. I have this custom N8N agent connector. And so we are talking to this

94:57 agent that I'll show you how to build in a little bit. This one has a tool to use

95:02 CRXNG for local and private web search. This is one of the platforms that we

95:06 have included in the local AI package. And so this response is going to take a

95:10 little bit here because it has to search the web. And the response that it

95:13 generates with this question is pretty long. Like there we go. Okay. So we got

95:16 the answer. It's pretty long. But yeah, we are able to search the internet now

95:21 with a local agent. N8N connected to open web UI. We're getting pretty fancy

95:25 here. And we also have the title that was generated on the left. And then we

95:30 have the tags here as well. And so the way that this all works, I'm going to

95:33 start by explaining how we can connect N8N to Open Web UI. And this is just

95:37 crucial. Makes it so easy for us to test agents locally as we are developing

95:42 them. And so if you go to the settings and the admin panel in the bottom left

95:47 and go to functions, open web UI has this thing called functions which gives

95:51 us the ability to add in custom functionality kind of as like custom

95:56 models that we can then use like you saw with the N8N agent connector. And so

96:02 what I have here is this thing that I call the N8N pipe. And I'll have a link

96:05 to this in the description as well. I created this myself and I uploaded it to

96:10 the open web UI directory of functions. And so you can go to this link right

96:14 here. You can even just Google the N8N pipe for open web UI. And then you click

96:19 on this get button. It'll just have you enter in the URL for your open web UI.

96:23 So I can just like paste in this right here. Click on import to open web UI and

96:28 it'll automatically redirect you to your open web UI instance. So you'll have

96:33 this function now. And we don't have to dive into the code for all how how all

96:36 of this works. I worked pretty hard to create this for you. Uh actually quite a

96:40 while ago I made this. And the thing that we need to care about is

96:44 configuring this to talk to our N8N agent. And so if you click on the

96:49 valves, the setting icon in the top right, there are a few values that we

96:54 have to set. And so now I'm going to go over to showing you how to build things

96:57 in N8N. Then all of this will click and it'll make sense. I right now looking at

97:00 these values, you're probably like, how the heck do I get all of these? But

97:02 don't worry, we'll dive into all of that. But first, let's go into our N8N

97:07 agent. I'll explain how all of this works. So, first of all, we have our

97:12 chat trigger that gives us the ability to communicate with our agent very

97:16 easily in the workflow. We have a new trigger now for the web hook. And so,

97:22 this is turning our agent into an API endpoint. So, we're able to talk to it

97:27 with other services like open web UI. And so to configure the web hook here,

97:31 you want to make sure that it is a post request type. And then you can define a

97:35 custom path here. Whatever you set here is going to determine what our URL is.

97:40 So we have our test URL. And then also if you toggle the workflow to active,

97:44 this is really important. The workflow in N does have to be active. Then you

97:49 have access to this production URL. And this is actually the first value that we

97:54 need to set within the valves for this open web UI function. We have our N8N

97:59 URL. And because this is a container talking to another container, we don't

98:03 actually want to use this localhost value that it has here for us. We want

98:08 to specify N8N because N8N again is the name of the service running the N8N

98:13 container in our Docker Compose stack. So N8N port 5678. And then this is the

98:18 custom URL that we can determine based on this. And then the other thing that

98:23 we want to do is set up header authentication. We don't want to expose

98:27 this endpoint without any kind of security. And so we want to set up some

98:31 authentication. And so you can select header off from the authentication

98:34 dropdown. And then for the credentials here, I'll just create brand new ones to

98:38 show you what this looks like. The name needs to be authorization with a capital

98:43 A. This has to be very specific. The name in the top left and the name of

98:46 your credentials. This can be whatever you want, but this has to be

98:51 authorization. And then the value here, the way that we want to format this is

98:55 it's going to be bearer and then the and then a space and then whatever you want

99:00 your bearer token to be. So this is what you get to define, but it needs to start

99:05 with a bearer capital B and a space. And then whatever you type after bearer

99:09 space, this goes in as the NAN bearer token. So you don't include a bearer

99:13 space here because that it's just assumed that it's going to be like that.

99:16 It's going to be prefixed with that. So you just type in like test off is what I

99:21 have. So my bearer token is bearer test off like that. And then this is what I

99:24 enter in for this field. Now I already have mine set up. So I'm just going to

99:27 go ahead and close out of this. And then the last thing that we have to set up

99:30 for the web hook. And don't worry, this is the node that we spend the most time

99:33 with. You want to go to the drop down here and change this to respond using

99:38 the respond to web hook node. very important because then at the end of our

99:40 workflow and we get the response from our agent, we're going to send that back

99:45 to whatever requested our API which is going to be open web UI in this case.

99:48 And so that's everything for our configuration for the web hook. Now the

99:53 next thing that we have to do is we have to determine is open web UI sending in a

99:58 request to get a response for our main agent or is it just looking to generate

100:03 that conversation title or the tags for our conversation? Because like we were

100:06 looking at earlier, I'm going to close out of this for now and go back to a

100:10 conversation, our last conversation here. We get our main response, but then

100:15 also there is a request to an LLM to create a very simple title for our

100:19 conversation and the tags that we can see in the top right. And so our N8

100:24 workflow actually gets invoked three separate times for just the first

100:30 message in a new conversation. And so we need to determine, are we getting a main

100:34 response? Like should we go to our main agent or should we just go to a simple

100:39 LLM that I have set up here to help generate the tags or title? And so the

100:44 way that we can determine that is whenever Open Web UI is requesting

100:47 something like a title for a conversation, it always prefixes the

100:53 prompt with three pound symbols, a space, and then the word task. And so we

100:58 can key off of this. If the prompt starts with this, and that prompt just

101:02 is coming in from our web hook here. If it does start with it, then we're just

101:06 going to go to this simple LLM, we're just going to be using Quen 2.514b

101:11 instruct. We have no tools, no memory or anything like our main agent because

101:14 we're just very simply going to generate that title or the tags. And I can even

101:19 show you in the execution history what that looks like. So in this case, we

101:22 have our web hook that comes in. The chat input starts with the triple pound

101:28 and task. And so sure enough, we are deeming it to be a metadata request is

101:32 what I'm calling it. And so then it then goes down to this LLM that is just

101:36 generating some text here. We just have this JSON response with the tags for the

101:41 conversation, technology, hardware, and gaming. So we're asking about the price

101:45 of the 5090 GPU. And then we do the exact same thing to also generate the

101:51 title GPU specs. And so exactly what we see here is the title of this last

101:55 conversation. So I hope that makes sense. And then if it doesn't start with

101:59 task and the triple pound and so it's actually our request. Then we go to our

102:03 main agent. We don't want our main agent to have to handle those super simple

102:06 tasks. You can also just use a really tiny LLM. Like this would be the perfect

102:11 case to actually use a super tiny LLM like um even like DeepSec R1 1.5B. You

102:15 could because it's just such a simple task. Otherwise though we are going to

102:20 go to our main agent. And so I'm not going to dive into like all these nodes

102:24 in a ton of detail, but basically we are are expecting the chat input to contain

102:30 the prompt for our agent. And the way that we know to expect chat input

102:35 specifically is because going back to the settings for the function here with

102:38 the valves, we are saying right here chat input. So you want to make sure

102:43 that the value that you put in here for input matches exactly with what you are

102:48 expecting from our web hook. And so chat input is the one that I have by default.

102:51 So you can just copy me if you want. Then we go into our agent where we're

102:55 hooked into Olama and we've got our local superbase. I already showed you

102:58 how to connect up all this and that looks exactly the same. The only thing

103:01 that is different now is we have a single tool to search the web with

103:07 CRXNG. So it's a web search tool. I have a description here just telling it what

103:11 is going to get back from using this tool. And then for the workflow ID, this

103:16 is if I go to add a node here and I just go for uh workflow tools, call N8N

103:23 workflow tool. So this is basically taking an N8N workflow and using it as a

103:28 tool for our agent. So this is the node that we have right here. But then I'm

103:32 referencing the ID of this N8N workflow. So this ID because I'm going to just

103:37 call the subworkflow that I have defined below. And again, I don't want to dive

103:40 into all the details of NAN right now and how this all works, but the agent is

103:44 going to decide the query. What should I search the web with? It decides that and

103:49 then it invokes this sub workflow here where we have our call to CR XNG. So the

103:54 name of the container service in our docker and compost stack is just CR XNG

103:59 and it runs on port 8080. And then if you look at the CXNG documentation, you

104:03 can look at how to invoke their API and things like this. So I'm just doing a

104:07 simple search here and then there are a few different nodes because what I want

104:10 to do is I want to split out the search and actually I can show you this by

104:14 going to an execution history where we're actually using this tool. So take

104:18 a look at this. So in this case the LLM decided to invoke this tool and the

104:23 query that it decided is current price of the 5090 GPU. So this is going along

104:28 with the conversation that we had last in open web UI. we get some results from

104:33 CRXNG, which is just going to be a bunch of different websites. And so, we don't

104:37 have the answer quite yet. We just have a bunch of resources that can help us

104:41 get there. And so, I'm going to split out. So, we have a bunch of different

104:45 websites. We're going to now limit to just one. I just want to pull one

104:48 website right now just to keep it really, really simple because now we're

104:52 going to actually visit that website. I'm going to make an HTTP request to

104:57 this website, which yeah, I mean, if it's literally an Nvidia official site

105:01 for the 5090, like this definitely has the information that we need. We're

105:04 going to make a request to it, and then we're also going to use this HTML node

105:08 to make sure that we are only selecting the body of the site. So, we take out

105:12 all the footers and headers and all that junk. So, we just have the key

105:15 information. And then that is what we aggregate and then return back to our AI

105:19 agent. So it now has the content, the core content of this website to get us

105:24 that answer. That is how we invoke our web search tool. And then at the very

105:29 end, we're just going to set this output field. And that's going to be the

105:33 response that we got back either from like generating a title or calling our

105:37 main agent. And this is really important. the output field specifically

105:41 whatever we call it here we have to make sure that that is corresponding to this

105:46 value as the last thing we have to set for the settings for our open web UI

105:50 function. So output here has to match with output here because that is what

105:55 we're going to return in this respond to web hook. Whatever open web UI gets back

105:59 it's getting back from what we return right here. So that is everything for

106:03 our agent. I could probably dive in quite a bit more into explaining how

106:06 this all works and building out a lot more complex agents, which I definitely

106:10 do with local AI in the Dynamis AI agent mastery course. So check that out if you

106:13 are interested. I just wanted to give you a simple example here showing how we

106:17 can talk to our different services like Olama, Superbase, and Seir XNG. And then

106:22 also open web UI as well. So once you have all these settings set, make sure

106:26 of course that you click on save. It's very, very important. These two things

106:29 at the bottom don't really matter, by the way. But yeah, click on save once

106:32 you have all of the settings there. And then you can go ahead and have a

106:36 conversation with your agent just like I did when I was demoing things before we

106:40 dove into the workflow. And by the way, this NAN agent that works with Open Web

106:44 UI, I have as a template for you. You can go ahead and download that in this

106:48 GitHub repository where I'm storing all the agents for this masterass. So we

106:52 have the JSON for it right here. You can go ahead and download this file. Go into

106:57 your N8N instance. Click on the three dots in the top right once you've

107:01 created a new workflow. Import from file and then you can bring in that JSON

107:03 workflow. You'll just have to set up all your own credentials for things like

107:07 Lama and Superbase and CRXNG. But then you'll be good to go and you can just go

107:11 through the same process that I did setting up the function in open web UI

107:15 and it'll be with like 15 minutes you'll have everything up and running to talk

107:20 to N8N in open web UI. Next up I want to create now the Python version of our

107:25 local AI agent. And so this is going to be a onetoone translation. Exactly what

107:30 we built here in NN, we are now going to do in Python. So I can show you how to

107:34 work with both noode and code with our local AI package. And so this GitHub

107:39 repo that has the N workflow we were just looking at and that OpenAI

107:43 compatible demo we saw earlier, this has pretty much everything for the agent. So

107:46 most of this repository is for this agent that we're about to dive into now

107:50 with Python. And in this readme here, I have very detailed instructions for

107:54 setting up everything. And a lot of what we do with the Python agent, especially

107:57 when we are configuring our environment variables, it's going to look very

108:01 similar to a lot of those values that we set in N8N. Like we have our base URL

108:05 here, which you'd want to set to something, you know, like HTTP lama port

108:09 11434. We just need to add this /view one, which I guess is a little bit different,

108:14 but yeah, I've got instructions here for setting up all of our environment

108:18 variables, our API key, which you can actually use OpenAI or Open Router as

108:22 well with this agent, taking advantage of the OpenAI API compatibility. This is

108:27 a live example of this because you can change the base URL, API key, and the

108:31 LLM choice to something from Open Router or OpenAI, and then you're good to go

108:35 immediately. It's really, really easy. We will be using Olama in this case, of

108:39 course, though. And then you want to set your superb basease URL and service key.

108:42 You can get that from your environment variables. Same thing with CRXNG with

108:47 that base URL. We'll set that just like we did in N8N. We have our bearer token

108:51 like in our case was test off. It's just whatever comes after the bearer and the

108:55 space. And then the OpenAI API key you can ignore. That's just for the

108:58 compatible demo that we saw earlier. This is everything that we need for our

109:02 main agent now. And so we're using a Python library called fast API to turn

109:09 our AI agent into an API endpoint just like we did in N8N. And so fast API is

109:13 kind of what gives us this web hook both with the entry point and the exit for

109:17 our agent and then everything in between is going to be the logic where we are

109:21 using our agent. And I'm going to be using paidantic AI. It's my favorite AI

109:25 agent framework with Python right now. Makes it really easy to set up agents

109:29 and we'll so we'll dive into that here. And I don't want to get into the

109:32 nitty-gritty of the Python code here because this isn't a master class on

109:36 specifically building agents. I really just want to show you how we can be

109:40 connecting to our local AI services. This agent is 100% offline. Like I could

109:46 cut the internet to my machine and still use everything here. So we create our

109:50 Superbase client and the instance of our fast API endpoint. I have some models

109:55 here that define the the requests coming in, the response going out. So we have

109:59 the chat input and the session ID just like we saw in N8N. And then the output

110:04 is going to be this output field. And so that corresponds to exactly what we're

110:07 expecting with those settings that we set up in the function in open web UI.

110:11 So this Python agent is also going to work directly with open web UI. And then

110:16 we have some dependencies for our Pantic AI agent because it needs to have an

110:21 HTTP client and the CRXNG base URL to make those requests for the web search

110:25 tool. And then we're setting up our model here. It's an OpenAI model, but we

110:29 can override the base URL and API key to communicate with Olama or Open Router as

110:34 well like we will be doing. And then we create our Pantic AI agent just getting

110:38 that model based on our environment variables. I've got a very simple system

110:43 prompt and then the dependencies here because we need that HTTP client to talk

110:48 to CR XNG. And then I'm just allowing it to retry twice. So if there's any kind

110:51 of error that comes up, the agent can re retry automatically, which is one of the

110:55 really awesome things that we have in Pyantic AI. And then I'm also creating a

111:01 second agent here. This is the agent that is going to be responsible like we

111:05 have in NADM for handling the metadata for open web UI like conversation titles

111:10 and tags for our conversation. And so it's an entirely separate Asian because

111:14 we just have a another system prompt. In this case, I'm just doing something

111:18 really simple here. Uh we don't have any dependencies for this agent because it's

111:21 not going to be using the web search tool. And then for the model, I'm just

111:25 using the exact same model that we have for our primary agent. But like I shared

111:30 with N8N, you could make it so that this is like a much smaller model, like a one

111:33 or three billion parameter model because the task is just so basic or maybe like

111:38 a 7 billion parameter model. So you can tweak that if you want. Just for

111:41 simplicity sake, I'm using the same LLM for both of these agents.

111:46 And then we get to our web search tool. So in Pantic AI, the way that you give a

111:51 tool to your agent is you do at@ and then the name of your agent and then

111:55 tool and then the function that you define below this is now going to be

112:00 given as a tool to the agent. And then this description that we have in the doc

112:04 string here that is given as a part of the prompt to your agent. So it knows

112:09 when and how to use this tool. And so the exact, you know, details of how

112:13 we're using CRX and G here, I won't dive into, but it is the exact same as what

112:17 we did in N8N where we make that request to the search endpoint of CRX andG. We

112:22 go through the page results here. We limit to just the top three results or I

112:26 could even change this to make it even simpler and just the top result. So we

112:30 have the smallest prompt possible to the LLM. And then we get the the content of

112:34 that page specifically. And then we return that to our AI agent with some

112:40 JSON here. So now once it invokes this tool, it has a full page back with

112:44 information to help answer the question from the user. It has that web search

112:47 complete now. And then we have some security here to make sure that the

112:51 bearer token matches what we get into our API endpoint. So that's that header

112:55 authentication that we set up in N8N. So this part right here where we're

112:59 verifying the header authentication that corresponds to this verify token

113:03 function. And then we have a function to fetch conversation history to store a

113:08 new message in conversation history. So both of these are just making requests

113:12 to our locally hosted superbase using that superbase client that we created

113:16 above. And then we have the definition for our actual API endpoint. And so in

113:23 N8N we were using invoke N8N agent for our path to our agent. So this was our

113:29 production URL. In this fast API endpoint, our endpoint is slashinvoke

113:34 python agent. And then we're specifically expecting the chat input

113:39 and session ID. So that is our um chat request request type right here. And

113:43 then sorry I highlighted the wrong thing. We have our response model here

113:46 that has the output field. So we're defining the exact types for the inputs

113:50 and the outputs for this API endpoint. And then we're also using this verify

113:55 token to protect our endpoint at the start. And then the key thing here, if

113:59 the chat input starts with that task, then we're going to call our metadata

114:02 agent. And so it's just going to spit out the title or the tags, whatever that

114:06 might be. Otherwise, we're going to fetch the conversation history, format

114:11 that for Pyantic AI, store the user's message so that we have that

114:15 conversation history stored, create our dependencies so that we can communicate

114:20 with CRXNG, and then we'll just do agent.run. We'll pass in the latest

114:24 message from the user, the past conversation history and the

114:27 dependencies that we created. So it can use those when it invokes the web search

114:31 tool and then we just get the response back from the agent and we'll you can

114:34 print that out in the terminal as well and then we'll just store it in

114:38 superbase and then return the output field. So I'm going kind of fast here. I

114:42 there definitely a lot more videos on my channel where I break down in more

114:45 detail building out agents with podantic AI and turning them into API endpoints

114:49 and things like that. Um but yeah, just going a little bit faster here. here.

114:52 And then the last thing is with any kind of exception that we encounter, we're

114:55 just going to return a response to the front end saying that there was an issue

114:59 and then specifying what that is. And then we are using Ubicorn to host our

115:06 API endpoint specifically on port 8055. So that is everything for our Python

115:11 agent exactly the same as what we set up in N8N. And now going to the readme

115:16 here, I'll open up the preview. The way that we can run this agent, we just have

115:20 to open up a terminal just like we did with the OpenAI compatible demo. I've

115:25 got instructions here for um setting up the database table, which this is using

115:30 the same table as the one in N8N and N8N creates it automatically. So, if you

115:34 have already been using the N8N agent, you don't actually have to run this SQL

115:38 here. Um, and then you want to obviously set up your environment variables like

115:42 we covered. Uh, open your virtual environment and install the requirements

115:46 there. And then you can go ahead and run the command python main.py.

115:52 And so this will start the API endpoint. So it'll just be hanging here because

115:56 now it's waiting for requests to come in on a port 8055. And so what I can do is

116:03 I can go back to open web UI. I can go to the admin panel functions. Go to the

116:08 settings. I can now change this URL. So everything else is the same. I have my

116:11 bearer token, the input field, and the output field the same as N8N. The only

116:16 thing I have to change now is my URL. And so I know this is an N8N pipe and I

116:20 have N8N in the name everywhere, but this does work with just any API

116:23 endpoint that we have created with this format here. And so I'm going to say for

116:27 my URL, it's actually going to be host.doccker.in because I have my API endpoint for

116:33 Python running outside on my host machine. So I need my open web UI

116:38 container to go outside to my host machine. And then specifically the port

116:43 is going to be 8055. And then the endpoint here, I'm going to

116:46 delete this web hook here because it's invoke-python- agent. Take a look at that. All right.

116:52 Boom. So I'm going to go ahead and save this. And then I can go over to my chat.

116:57 And it says n agent connector here still. But this is actually talking to

117:00 my Python agent now. So I'll go ahead and start by asking it the exact same

117:05 question that I asked the N8N agent. And I do have this pipe set up to always say

117:08 that it's calling nan, but this is indeed calling our Python API endpoint.

117:12 And we can see that now. So there we go. We got all the requests coming in, the

117:16 response from the agent, and then also the metadata for the title and the tags

117:20 for the conversation. Take a look at that. So we got our title here. We have

117:24 our tags, and then we have our answer. It's a starting price of $2,000, which

117:28 it's a lot more right now. The starting point, the starting price is kind of

117:32 misleading, but like yeah, this is a good answer. and it did use CRXNG to do

117:36 that web search for us. This is really, really neat. Now, the last thing that we