you2idea@video:~$ watch mNcXue7X8H0 [2:38:37]
// transcript — 7060 segments
0:02 Welcome to the video that I have put the most effort into creating by far on my
0:08 channel to date. This is the local AI master class and we're going to dive
0:12 into everything you need to know about local AI. What it is, why it's so
0:16 important for you no matter what you are building with AI. How you can run your
0:20 own large language models and self-host your own infrastructure. How you can
0:26 build 100% private and offline AI agents and deploy them to the cloud in a secure
0:31 way. I have everything that you need to get started here even if you haven't
0:35 done anything with local AI before and I take things pretty far. There is a lot
0:38 of value packed into this for you. So, buckle up, enjoy the ride and follow
0:43 along as well. So, first things first, let's start with an agenda for the
0:47 master class. There are so many things that I cannot wait to share with you.
0:51 And I have very detailed chapters for this YouTube video so you can easily
0:55 navigate between everything that I'm going to show you. I just want to make
0:59 it super easy for you to get exactly what you want out of this master class.
1:03 Nothing more, nothing less. We'll start by diving into what is local AI and I
1:07 have a quick demo to make this very, very hands-on. And then with that, we'll
1:12 get into the why. Why local AI? Why should you care about it? Why do I
1:16 believe so firmly that it is the future of AI? I'll dive into all my reasoning
1:20 there. And then we'll get into hardware requirements because these local LLMs
1:24 are beasts and you have to have specific hardware to be able to run them. So I'll
1:28 dive into all that based on different large language models and some
1:32 alternatives as well. Then we'll get into all of the tricky stuff. There are
1:35 a few things that are usually pretty daunting for people. So I want to break
1:39 down those barriers just to make you super confident running your own local
1:43 LLMs and infrastructure. And then with that, we'll get into how you can use
1:48 local AI anywhere. Because Olama and other solutions for running your own
1:52 large language models, they are OpenAI API compatible. I'll get into what that
1:56 means when we get to this point. But basically, any agents that you already
2:00 have running with Python or N8N, whatever. If you're using OpenAI or
2:04 Gemini or Enthropic, you can very easily swap them to use local AI instead. So
2:09 you can turn your existing agents into ones that are 100% offline, free, and
2:15 private. And then with that, we will get into the local AI package. This is a set
2:20 of services that I've curated for you to run your entire local AI infrastructure
2:25 like your UI, your database, your large language models, and a lot more. This is
2:28 where we really start to build out our full infrastructure. I'll walk you
2:32 through setting up the local AI package, getting into the nitty-gritty details to
2:35 make sure that you have everything set up at this point. And then once we have
2:40 that set up, we can dive into building a fully local AI agent with N8N. And then
2:44 we'll transition that same agent into Python as well. So that you'll see once
2:49 we have the local AI package set up, how you can build a 100% offline and private
2:54 agent both with no code and with code. And then we'll take those agents and
2:58 deploy them to the cloud, specifically on the Digital Ocean platform. But I'll
3:02 walk you through a process that you can use no matter the cloud provider that
3:05 you are using. and we'll deploy things in a very secure way both for the
3:09 package for our infrastructure and the AI agent itself. And then last, I want
3:14 to end with some additional resources just to make sure you have everything
3:17 that you need to take this master class forward and really use this to build any
3:21 AI agent that you could possibly want 100% local. And also, if you are
3:26 interested in mastering more than just local AI, but building entire AI agents
3:31 in a local AI environment or even with cloud AI, definitely check out
3:35 dynamis.ai. AI. This is my community for early AI adopters just like yourself.
3:40 And a big part of this community is the AI agent mastery course where I dive
3:45 super deep into my full process for building AI agents. I'm talking planning
3:50 and prototyping and coding and using AI coding assistance and building full
3:55 frontends for our AI agents and securing things and deploying things. There's a
3:59 lot more coming soon for this course as well. I'm very actively working on it.
4:03 And a big part of this course is the complete agent that I build throughout
4:07 it. I build both with cloud AI and local AI. And so this master class will help
4:11 you get very very comfortable with local AI. But when it comes to building
4:15 complex agents and really getting deep into building out agents, then you
4:19 definitely want to check out the AI agent mastery course here in Dynamis.AI.
4:23 So with that, let's get back to the master class, diving into what local AI
4:28 is all about. Let's start by laying the foundation. What is local AI in the
4:32 first place? Well, very simply put, local AI is running your own large
4:36 language models and infrastructure like your database and your UI entirely on
4:42 your own machine 100% offline. So when you think about when you typically want
4:46 to build an AI agent, you need a large language model maybe like GPT4.1
4:51 or Claude 4 and then you need something like your database like Superbase and
4:55 you need a way to create a user interface. you have all these different
4:58 components for your agent and typically you're using APIs to access things that
5:03 are hosted on your behalf. But with local AI, we can take all of these
5:07 things completely in our own control running them ourselves. So this is
5:11 possible through open-source large language models and software. So
5:15 everything is running on your own hardware instead of you paying for APIs.
5:20 So we run the large language model ourself on our own machine instead of
5:25 paying for the OpenAI API for example. And so for large language models, there
5:29 are thousands of different open source large language models available for us
5:32 to use in a lot of different ways. And some of these you've probably heard of
5:37 before, like Deepseek R1, Quen 3, Mistral 3.1, Llama 4. These are just a
5:41 couple of examples of the most popular ones that you've probably heard of
5:44 before. We'll be tinkering around with using some of these in this master
5:48 class. And then we also have open-source software. So all of our infrastructure
5:53 that goes along with our agents and LLMs, things like Olama for running our
5:58 LLMs, Superbase for our database, N8N for our no/ lowode workflow automations,
6:04 and open web UI to have a nice user interface to talk to our agents and
6:07 LLMs. And we'll dive into using all of these as well. Now, because local AI
6:12 means running large language models on our own computer, it's not as easy as
6:18 just going to claw.ai AI or chatgbt.com and typing in a prompt. We have to
6:22 actually install something, but it still is very easy to get started. So, let me
6:25 show you right now with a hands-on example. So, here we are within the
6:30 website for Olama. This is just.com. I'll have a link to this in the
6:33 description of the video. This is one of the open- source platforms that allows
6:39 us to very easily download and run local large language models. And so, you just
6:42 have to go to their homepage here and click on this nice big download button.
6:45 You can install it for Windows, Mac or Linux. It really works for any operating
6:49 system. Then once you have it up and running on your machine, you can open up
6:53 any terminal. Like I'm on Windows here, so I'm in a PowerShell session and I can
6:58 run Olama commands now to do things like view the models that I have available on
7:03 my machine. I can download models and I can run them as well. And the way that I
7:08 know how to pull and run specific models is I can just go to this models tab in
7:12 their navigation and I can browse and filter through all of the open source
7:16 LLMs that are available to me like DeepSeek R1. Almost everyone is familiar
7:20 with DeepSeek. It just totally blew up back in February and March. We have
7:25 Gemma 3, Quen 3, Llama 4, a few of them that I mentioned earlier when we had the
7:30 presentation up. And so we can click into any one of these like I can go into
7:35 DeepSeek R1 for example and then I have the command right here that I can copy
7:39 to download and run this specific model in my terminal. And there are a lot of
7:44 different model variants of DeepSeek R1. So we'll get into different sizes and
7:47 hardware requirements and what that all means in a little bit, but I'll just
7:50 take one of them and run it as an example. So I'll just do a really small
7:54 one right now. I'll do a 1.5 billion parameter large language model. And
7:58 again, I'll explain what that means in a little bit. I can copy this command.
8:01 It's just lama run and then the unique ID of this large language model. So I'll
8:06 go back into my terminal. I'll clear it here and then paste in this command. And
8:09 so first it's going to have to pull this large language model. And the total size
8:15 for this is 1.1 GB. And so it'll have to download it. And then because I used the
8:20 run command, it will immediately get me into a chat interface with the model
8:24 once it's downloaded. Also, if you don't want to run it right right away, you
8:27 just want to install it, you can do Olama pull instead of Olama run. And
8:32 then again, to view the models that you have available to you installed already,
8:36 you can just do the Olama list command like I did earlier. And so, right now,
8:39 I'll pause and come back once it's installed in about 30 seconds. All
8:43 right, it is now installed. And now I can just send in a message like hello.
8:47 And then boom, we are now talking to a large language model. But instead of it
8:50 being hosted somewhere else and we're just using a website, this is running on
8:54 my own infrastructure, the large language model and all the billions of
8:59 parameters are getting loaded onto my graphics card and running the inference.
9:02 That's what it's called when we're generating a response from the LLM
9:06 directly within this terminal here. And so I can ask another question like um
9:13 what is the best GPU right now? We'll see what it says. So it's thinking
9:16 first. This is actually a thinking model. Deepseek R1 is a reasoning LLM.
9:20 And then it gives us an answer. It's top GPU models today. 3080 RX6700.
9:27 Obviously, we have a training cutoff for local large language models just like we
9:30 do with ones in the cloud like GPT. And so the information is a little outdated
9:34 here, but yeah, this is a good answer. So we have a large language model that
9:38 we're talking to directly on our machine. And then to close out of this,
9:42 I can just do control D or command D on Mac. And if I do list, we have all the
9:47 other models that you saw earlier, plus now this one that I just installed. So
9:50 these are all available for me to run again just with that Olama run command.
9:54 And it won't have to reinstall if you already have it installed. Run just
9:58 installs it if you don't have it yet already. So that is just a quick demo of
10:03 using Olama. We'll dive a lot more into Olama later, like how we can actually
10:07 use it within our Python code and within our N8N workflows. This is just our
10:11 quick way to try it out within the terminal. Now, to really get into why we
10:15 should care about local AI now that we know what it is, I want to cover the
10:19 pros and cons of local AI and what I like to call cloud AI. That's just when
10:23 you're paying for things to be hosted for you, like using Claude or Gemini or
10:28 using the cloud version of N8N instead of hosting it yourself. And I also want
10:32 to cover the advantages of each because I don't want to sugarcoat things and
10:35 just hype up this master class by telling you that you should always use
10:39 local AI. That is certainly not the case. There is a time and place for both
10:43 of these categories here, but there are so many use cases where local AI is
10:50 absolutely crucial. You have no idea how many businesses I have talked to that
10:54 are willing to put tens of thousands of dollars into running their own LLMs and
10:58 infrastructure because privacy and security is so crucial for the things
11:02 that they're building with AI. And that actually gets into the first advantage
11:06 here of local AI, which is privacy and security. You can run things 100%
11:11 offline. The data that you're giving to your LLMs as prompts, it now doesn't
11:16 leave your hardware. It stays entirely within your own control. And for a lot
11:20 of businesses, that is 100% crucial, especially when they're in highly
11:24 regulated industries like the health industry, finance, uh even real estate.
11:28 Like there's so many use cases where you're working with intellectual
11:32 property or just really sensitive information. You don't want to be
11:36 sending your data off to an LLM provider like Google or OpenAI or Enthropic. And
11:41 so as a business owner, you should definitely be paying attention to this
11:45 if you are working with automation use cases where you're dealing with any kind
11:48 of sensitive data. And then also if you're a freelancer, you're starting an
11:52 AI automation agency, anything where you're building for other businesses,
11:55 you are going to have so many opportunities open up to you when you're
11:59 able to work with local AI because you can handle those use cases now where
12:03 they need to work with sensitive data and you can't just go and use the OpenAI
12:08 API. And that is the main advantage of local AI. It is a very big deal. But
12:11 there are a few other things that are worth focusing on as well. Starting with
12:16 model fine-tuning, you can take any open- source large language model and
12:20 add additional training on top with your own data. Basically making it a domain
12:24 expert on your business or the problem that you are solving. It's so so
12:29 powerful. You can make models through fine-tuning more powerful than the best
12:33 of the best in the cloud depending on what you are able to fine-tune with
12:38 depending on the data that you have. And you can do fine-tuning with some cloud
12:42 models like with GPT, but your options are pretty limited and it can be quite
12:46 expensive. And so it definitely is a huge advantage to local AI. And local AI
12:52 in general can be very coste effective, including the infrastructure as well. So
12:56 your LLMs and your infrastructure. You run it all yourself and you pay for
13:00 nothing besides the electricity bill if it's running on your computer at your
13:04 house or if you have some private server in the cloud. You just have to pay for
13:07 that server and that's it. There's no N8N bill, no Superbase bill, no OpenAI
13:13 bill. You can save a lot of money. It's really, really nice. And on top of that,
13:17 when everything is running on your own infrastructure, the agents that you
13:22 create can run on the same server, the same place as your infrastructure. And
13:26 so it can actually be faster because you don't have network delays calling APIs
13:30 for all your different services for your LLMs and your database and things like
13:34 that. And then with that, we can now get into the advantages of cloud AI.
13:38 Starting with it's a lot easier to set up. There's a reason why I have to have
13:42 this master class for you in the first place. There are some initial hurdles
13:47 that we have to jump over to really have everything fully set up for our local
13:51 LLMs and infrastructure. And you just don't have that with cloud AI because
13:54 you can very simply call into these APIs. You just have to sign up and get
13:58 an API key and that's about it. So, it certainly is easier to get up and
14:02 running and there's less maintenance overall because they are hosting things
14:05 for you. Superbase is hosting the database for you. OpenAI is hosting the
14:09 LLM for you. So, you don't have to manage things on your own hardware. With
14:13 Local AI, you have to apply patches and updates if you have a private server in
14:17 the cloud. You have to manage your own hardware if you're running on your own
14:20 computer, making sure that it's on 24/7, if you want your database on 24/7, that
14:24 kind of thing. It's just less maintenance with cloud AI. And then
14:28 probably the biggest advantage of cloud AI overall is that you have better
14:34 models available to you. Claude 4 sonnet or opus for example is more powerful
14:40 than any local AI that you could run. So we have this gap here and this gap was a
14:45 lot bigger at one point even a year ago. The best local LLMs absolutely crushed
14:50 the best local LLMs and that gap is starting to diminish. And so I really
14:55 see a future where that gap is diminished entirely and all the best
14:59 local LLMs are actually on par with the best cloud ones. That's the future I
15:03 see. That's why I think that cloud that's why I think that local AI is such
15:07 a big deal because the advantages of local AI, those are just going to get
15:12 more prevalent over time when businesses realize they really want private and
15:15 secure solutions. And then the advantages of cloud AI, I think those
15:19 are actually going to diminish over time. That's the key. minimal setup,
15:24 less maintenance. Well, those advantages are going to go away as we have
15:28 platforms and better instructions and solutions to make the setup and
15:32 maintenance easier for local AI and we have the gap that's continuing to
15:35 diminish between the power of these LLMs. All these advantages are going to
15:39 actually go away and then it'll just completely make sense to use local AI
15:43 honestly probably for like every single solution in the future. That's really
15:47 what I see us heading towards. And then the last advantage to cloud AI which
15:51 also I think will go away over time is that you have some features out of the
15:55 box like you have memory that's built directly into chat GPT. Gemini has web
15:59 search baked in even when you use it through the API like these kind of
16:02 capabilities that are out of the box that you have to implement yourself with
16:07 local AI maybe as tools for your agent and you can definitely do that but it is
16:10 nice that these things are out of the box for cloud AI. So those are the pros
16:15 and cons between the two. I hope that this makes it very clear for you to pick
16:19 right now for your own use case. Should I implement local AI or cloud AI? A lot
16:24 of it comes down to the security and privacy requirements for your use case.
16:28 Now, the next big thing that we need to talk about for local AI is hardware
16:33 requirements. Cuz here's the thing, large language models are very resource
16:39 inensive. You can't just run any LLM on any computer. And the reason for that is
16:43 large language models are made up of billions or even trillions of numbers
16:47 called parameters. And they're all connected together in a web that looks
16:51 kind of like this. This is a very simplified view with just a few
16:54 parameters here. But each of the parameters are nodes and they're
16:58 connected together. The input layer is where our prompt comes in and our prompt
17:02 is fed through all these hidden layers and then we have the output at the end.
17:05 This is the response we get back from the LLM. But like I said, this is a very
17:10 simplified view. GPT4, for example, like you can see on the right hand side, is
17:15 estimated to have 1.4 trillion parameters. And so, if you want to fit
17:20 an entire large language model into your graphics card, you have to store all of
17:25 these numbers. And even though we can handle gigabytes at a time in our
17:28 graphics cards through what is called VRAMm, storing billions or trillions of
17:34 numbers is absolutely insane. And so that's why large language models, you
17:37 actually have to have a pretty good graphics card if you want to run some of
17:42 the best ones. And so looking at Olama here, when we see these different sizes,
17:47 going back to their model list, like 1.5 billion parameters or 27 billion
17:51 parameters, there are different sizes for the local LLMs. Obviously, the
17:56 larger a local LM that you are running, the more performance you are going to
17:59 get, but you are going to be limited to what you are capable of running with
18:04 your graphics card or your hardware. So, with that in mind, I now want to dive
18:07 into the nitty-gritty details with you so you know exactly the kind of models
18:11 that you can run, the kind of speeds you can expect depending on your hardware.
18:15 And if you want to invest in new hardware to run local AI, I've got some
18:19 recommendations as well. So there are generally four primary size ranges for
18:25 large language models based on the speed and the power that you are looking for.
18:29 You have models that are around seven or 8 billion parameters. Those are
18:33 generally the smallest that I'd recommend trying to run. There are a lot
18:37 of smaller LLMs available like 1 billion parameters or three billion parameters,
18:41 but I'm so unimpressed when I use those LLMs that I don't really want to focus
18:45 on them here. 7 billion parameters is still tiny compared to the large cloud
18:51 AI models like Claude or GBT, but you can get pretty good results with them
18:55 for just simple chat use cases. And so for these models, assuming a Q4
18:59 quantization, which I'll get into quantization in a little bit, it's
19:02 basically just a way to make the LLM a lot smaller without hurting performance
19:06 that much, a 7 billion parameter model will need about four to 5 GB of VRAM on
19:11 your graphics card. And so if you have something like a 3060 Ti from Nvidia
19:16 with 8 GB of VRAM, you can very comfortably run a 7 billion parameter
19:20 model and you can expect to get very roughly around 25 to 35 tokens per
19:27 second. A token is roughly equivalent to a word. And so your local large language
19:31 model at 7 billion parameters with this graphics card will get about 25 to 35
19:37 words per second out on the screen to you being streamed out. And then if you
19:42 use much more powerful hardware like a 3090 to run a 7 billion parameter model
19:47 then you'll just jack up the speed a lot more. So that's 7 billion or 8 billion
19:51 parameters. Another very common size is something around 14 billion parameters.
19:57 This will take about 8 to 10 GB of VRAM. And so just a couple of options for
20:01 this. You have the 4070Ti which is usually 16 GB of VRAM or you could go as
20:08 low as 12 GB of VRAM with the 3080 Ti. And you could expect to get about 15 to
20:13 25 words per second. And then this is where you start to get into basic tool
20:17 calling. So I find that when you are building with a 7 billion parameter
20:21 model, they don't do tool calling very well. So you can't really build that
20:25 powerful of agents around a 7 billion parameter model. But once you get to
20:29 something around 14 billion parameters, that's when I see agents being able to
20:33 really accept instructions well around tools and system prompts and leveraging
20:37 tools to do things on our behalf. That's when we can really start to use LLMs to
20:42 make things that are agentic. And then the next big category of LLMs
20:47 is somewhere between 30 and 34 billion parameters. You see a lot of LM that
20:51 fall in that size range. This will typically need 16 to 20 gigabyt of VRAM.
20:57 And so a 3090 is a really good example of a graphics card that can run this. It
21:03 has 24 GB of VRAM. I actually have two 3090s myself. And I'll have a link to my
21:08 exact PC that I built for running local AI in the description of this video. So
21:12 I have two 3090s, which we'll need in a second for a 70 billion parameter, but
21:16 one is enough for a 32 billion parameter model. And then also Macs with their new
21:22 M4 chips are very powerful with their unified memory architecture. So if you
21:27 get a Mac M4 Pro with 24 GB of unified memory, you can also run 32 billion
21:32 parameter models. Now the speed isn't going to be the best necessarily, and
21:36 again, this does depend a lot on your computer overall, but you can expect
21:40 something around 10 to 15, maybe up to 20 tokens per second. and 32 billion
21:46 parameters is when you really start to see LLMs that are actually pretty
21:50 impressive. Like 7 billion and 14 billion, they are disappointing quite a
21:54 bit. I'll be totally honest. Especially when you try to use them with more
21:58 complicated agentic tasks. 32 billion when you start to get into this range is
22:02 when I I'm actually genuinely impressed. I'm like, "Oh, this is actually pretty
22:05 close to the performance of some of the best cloud AI." And then 70 billion
22:10 parameters. This is going to take about 35 to 40 GB of VRAM for most consumer
22:19 GPUs like 3090s and 4090s even 5090s. It's not actually enough VRAM. And so
22:22 this is when you have to start to split a large language model across multiple
22:28 GPUs which solutions like Olama will actually help you do this right out of
22:31 the box. So it's not this insane setup even though it might feel kind of
22:35 daunting like oh I have to split the layers of my LLM between GPUs. It's not
22:38 actually that complicated. And so 23090s, 24090s, that will be necessary.
22:44 Um, or you could have more of like an enterprisegrade GPU like an H100. So
22:49 Nvidia has a lot of these non-consumer grade GPUs that have a lot more VRAMm to
22:53 handle things like 70 billion parameter models. And the speed won't be the best
22:58 if you're using something like 23090s, especially because performance is hurt
23:01 when you have to split an LM between GPUs. You could expect something like 8
23:06 to 12 tokens per second. And this is obviously if you have the most complex
23:09 agents that you're really trying to match the performance of cloud AI as
23:13 much as possible, that's when you'd want to use a 7 billion parameter model. And
23:16 then if you're investing in hardware to run local AI, I have a couple of quick
23:20 recommendations here. And a lot of this depends on the size of the model that's
23:24 going to be good enough for your use case. And so I'll dive into some
23:28 alternatives for running local AI directly if you want to do testing
23:32 before you buy infrastructure. I'll get into that in a little bit, but
23:35 recommended builds. If you want to spend around $800 to build a PC, I'd recommend
23:41 getting a 4060Ti graphics card and then 32 GB of RAM. If you want to spend
23:47 $2,000, I'd recommend either getting a PC with a 3090 and 64 GB of RAM or
23:53 getting that Mac M4 Pro with 24 GB of unified memory. And then lastly, if you
23:58 want to spend $4,000, which is about what I spent for my PC, then I'd
24:02 recommend getting two 3090 graphics cards, and I got both of mine used for
24:07 around $700 each. Um, and then also getting 128 GB of RAM, or you can get a
24:15 Mac M4 Max with 64 GB of unified memory. So, I wanted to really get into the
24:19 nitty-gritty details there. So, I know I spent a good amount of time diving into
24:22 super specific numbers, but I hope this is really helpful for you. No matter the
24:26 large language model or your hardware, you now know generally where you're at
24:30 for what you can run. So, to go along with that information overload, I want
24:34 to give you some specifics, individual LLMs that you can try right now based on
24:38 the size range that you know will work for your hardware. So, just a couple of
24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This
24:47 is the most popular local LLM ever. It completely blew up a few months ago. And
24:52 the best part about DeepSeek R1 is they have an option that fits into each of
24:56 the size ranges that I just covered in that chart. So they have a 7 billion
25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And
25:05 then there is also the full real version of R1, which is 671 billion parameters.
25:10 I'm sorry though, you probably don't have the hardware to run that unless
25:13 you're spending tens of thousands on your infrastructure. So, probably stick
25:16 with one of these based on your graphics card or if you have a Mac computer, pick
25:19 the one that'll work for you and just try it out. You can click on any one of
25:23 these sizes here. And then here's your command to download and run it. And this
25:28 is defaulting to a Q4 quantization, which is what I was assuming in the
25:31 chart earlier. And again, I will cover what that actually means in a little bit
25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.
25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do
25:45 have all the other um sizes that fit into those ranges that I mentioned
25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And
25:52 the same kind of deal where you click on the size that you want and you've got
25:55 your command to install it here. And this is a reasoning LLM just like
26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.
26:05 I've had really good results with this as well. There are less options here,
26:08 but you've got 22 or 24 billion parameters, which is going to work well
26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified
26:18 memory. Really, really good model. And then also, there is a version of it that
26:22 is fine-tuned for coding specifically called Devstrol, which is a another
26:26 really cool LLM worth checking out as well if you have the hardware to run it.
26:30 So, that is everything for just general recommendations for local LMS to try
26:34 right now. This is the part of the master class that is going to become
26:38 outdated the fastest because there are new local LMS coming out every single
26:42 month. I don't really know how long my recommendations will last for. But in
26:45 general, you can just go to the model list in Olama, search for the ones,
26:49 finds one that has the size that works with your graphics card and just give it
26:52 a shot. You can install it and run it very easily with Olama. And the other
26:57 thing that I want to mention here is you don't always have to run open- source
27:01 large language models yourself. You can use a platform like Open Router. You can
27:05 just go to open router.ai, sign up, add in some API credits. You can try these
27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for
27:15 your agents before you invest in hardware to actually run them yourself.
27:18 And so within Open Router, I can just search for Quen here. And I can go down
27:23 to Quen and I can go to 32 billion. They have a free offering as well that
27:26 doesn't have the best rate limits. So I'll just go to this one right here,
27:31 Quen 3 32B. So I can try the model out through open router. They actually host
27:35 it for me. So it's an open- source non-local version, but now I can try it
27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now
27:43 I want to buy a 3090 graphics card so that I can install it directly through
27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here
27:51 in open router. And there are other platforms like Grock as well where you
27:55 can run these open source large language models um not on your own infrastructure
27:58 if you just want to do some testing before beforehand or whatever that might
28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's
28:04 everything for my general recommendations for LLMs to try and use
28:09 in your agents. All right, it is time to take a quick breather. This is
28:12 everything that we've covered already in our master class. What is local AI? Why
28:17 we care about it? Why it's the future and hardware requirements. And I really
28:20 wanted to dive deep into this stuff because it sets the stage for everything
28:24 that we do when we actually build agents and deploy our infrastructure. And so
28:28 the last thing that I want to do with you before we really start to get into
28:33 building agents and setting up our package is I want to talk about some of
28:37 the tricky stuff that is usually pretty daunting for anyone getting into local
28:42 AI. I'm talking things like offloading models, quantization, environment
28:47 variables to handle things like uh flash attention, all the stuff that is really
28:51 important that I want to break down simply for you so you can feel confident
28:55 that you have everything set up right, that you know what goes into using local
29:00 LLMs. The first big concept to focus on here is quantization. And this is
29:04 crucial. It's how we can make large language models a lot smaller so they
29:10 can fit on our GPUs without hurting performance too much. We are lowering
29:14 the model precision here. And so what basically what that means is we have
29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits
29:23 with the full size, but we can lower the precision of each of those parameters to
29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of
29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the
29:35 parameters that we already covered. And we can make these numbers less precise
29:40 or smaller without losing much performance. So, we can fit larger LLMs
29:45 within a GPU that normally wouldn't even be close to running the full-size model.
29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4
29:55 quantization like four bit per parameter in that diagram earlier. If you had the
30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could
30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it
30:10 possible. It's like rounding a number that has a long decimal to something
30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're
30:19 doing it for each of the billions of parameters, those numbers that we have.
30:23 And so just to give you a visual representation of this, you can also
30:27 quantize images just like you can quantize LLMs. And so we have our full
30:31 scale image on the lefth hand side here comparing it to different levels of
30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with
30:40 a 16- bit quantization, it almost looks the same. But then once we go down to
30:44 4bit, you can very much see that we have a huge loss in quality for the image.
30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit
30:54 quantization, we don't actually lose that much performance like we lose a lot
30:58 of quality with images. And so that's why it's so useful for us. And so I have
31:01 a table just to kind of describe what this looks like. So FP16, that's the
31:07 16bit precision that all LMS have as a base. That is the full size. The speed
31:11 is obviously going to be very slow because the model is a lot bigger, but
31:16 your quality is perfect compared to what it could be. I mean, obviously that
31:18 doesn't mean that you're going to get perfect answers all the time. I'm just
31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8
31:28 precision, so it's half the size. The speed is going to be a lot better. And
31:33 the quality is nearperfect. So it's not like performance is cut in half just
31:37 because size is. You still have the same number of parameters. Each one is just a
31:42 bit less precise. And so you're still going to get almost the same results.
31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very
31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these
31:57 numbers are very vague on purpose. There's not a huge way to for me to like
32:01 qualify exactly the difference, especially because it changes per LLM
32:05 and your hardware and everything like that. So, I'm just being very general
32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to
32:13 be very very fast, but usually your performance starts to go down quite a
32:17 bit once you go down to a Q2. And then like the note that I have in the bottom
32:22 left here, a Q4 quantization is generally the best balance. And so when
32:26 you are thinking to yourself, which large language model should I run? What
32:31 size should I use? My rule of thumb is to pick the largest large language model
32:37 that can work with your hardware with a Q4 quantization. That is why I assumed
32:42 that in the table earlier. And then also like we saw in Olama earlier, it always
32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4
32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion
32:59 parameter model is still going to be a lot more powerful than the full 7
33:02 billion parameter or 14 billion parameter because you don't actually
33:07 lose that much performance. So that is quantization. So just to make this very
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 just fun fact, you can actually do that. Now, the very last thing that I want to
37:15 cover before we dive into some code, setting up the local AI package, and
37:21 building out some agents is a few very crucial parameters, environment
37:25 variables for Olama. So, these are environment variables that you can set
37:28 on your machine just like any other based on your operating system. And
37:32 Olama does have an FAQ for setting up some of these things, which I'll link to
37:36 in the description as well. But yeah, these are a bit more technical, so
37:40 people skip past setting this stuff up a lot, but it's actually really, really
37:44 important to make things very efficient when running local LLMs. So the first
37:49 environment variable is flash attention. You want to set this to one or true.
37:54 When you have this set to true, it's going to make the attention calculation
37:59 a lot more efficient. It sounds fancy, but basically large language models when
38:04 they are generating a response, they have to calculate which parts of your
38:08 prompt to pay the most attention to. That's the calculation. And you can make
38:12 it a lot more efficient without losing much performance at all by setting up
38:16 the flash attention, setting that to true. And then for another optimization,
38:21 just like we can quantize the LLM itself, you can also quantize or
38:27 compress the context. So your system prompt, the tool descriptions, your
38:31 prompt and conversation history, all that context that's being sent to your
38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for
38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the
38:46 context memory. It's a very simplified explanation, but it's really, really
38:50 useful because a long conversation can also take a lot of VRAM just like larger
38:55 LLM. And so it's good to compress that. And then the third environment variable,
38:58 this is actually probably the most crucial one to set up for Olama. There
39:02 is this crazy thing. I don't know why Olama does it, but by default, they
39:07 limit every single large language model to 2,000 tokens for the context limit,
39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and
39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a
39:21 lot of local large language models can also handle large prompts. But Olamo
39:25 will limit you to default to 2,000 tokens. And so you have to override that
39:30 yourself with this environment variable. And so generally I recommend starting
39:34 with about 8,000 tokens to start. You can move this all the way up to
39:38 something like 32,000 tokens if your local large language model supports
39:42 that. And if you view the model page on Alama, you can see the context link
39:46 that's supported by the LLM. But you definitely want to, you know, jack this
39:50 up more from just 2,000 because a lot of times when you have longer
39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do
39:57 not miss this. If your large language model is starting to go completely off
40:02 the rails and ignore your system prompt and forget that it has these tools that
40:06 you gave it, it's probably because you reached the context length. And so, just
40:10 keep that in mind. I see people miss this a lot. And then the very last
40:14 environment variable, uh, probably the least important out of all these four,
40:18 but if you're running a lot of different large language models at once and you're
40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so
40:25 in Olama, you can limit the number of models that are allowed to be in your
40:29 memory at a single time. With this one, typically you want to set this to either
40:33 one or two. Definitely set this to just one if you are using large language
40:37 models that are basically fit for your GPU. like it's going to fit exactly into
40:41 your VRAM and you're not going to have room for another large language model.
40:44 But if you are running more smaller ones and maybe you could actually fit two on
40:48 your GPU with the VRAM that you have, you can set this to two. So again, more
40:52 technical overall, but it's very important to have these right. And we'll
40:55 get into the local AI package where I already have these set up in the
40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a
41:02 minute ago that I'll have linked in the description. And so there's actually a
41:06 lot of good things to read into here. um like being able to verify that your GPU
41:10 is compatible with Olama. How can you tell if the model's actually loaded on
41:13 your GPU? So, a lot of like sanity check things that they walk you through in the
41:17 FAQ as well. Also talking about environment variables, which I just
41:20 covered. And so, they've got some instructions here depending on your OS
41:23 how to get those set up. So, if there's anything that's confusing to you, this
41:26 is a very good resource to start with. So, I'm trying to make it possible for
41:30 you to look into things further if there's anything that doesn't quite make
41:33 sense for what I explained here. And of course, always let me know in the
41:35 comments if you have any questions on this stuff as well, especially the more
41:39 technical stuff that I just got to cover because it's so important even though I
41:43 know we really want to dive into the meat of things, which we are actually
41:47 going to do now. All right, here is everything that we have covered at this
41:50 point. And congratulations if you have made it this far because I covered all
41:55 the tricky stuff with quantization and the hardware requirements and offloading
41:59 and some of our little configuration and parameters. So, if you got all of that,
42:03 the rest of it is going to be a walk in the park as we start to dive into code,
42:07 getting all of our local AI set up and building out some agents. You understand
42:10 the foundation now that we're going to build on top of to make some cool stuff.
42:15 And so, now the next thing that we're going to do is talk about how we can use
42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show
42:23 you an example. We can take something that is using OpenAI right now,
42:27 transform it into something that is using OAMA and local LLM. So, we'll
42:31 actually dive into some code here. And I've got my fair share of no code stuff
42:35 in this master class as well, but I want to focus on both because I think it's
42:38 really important to use both code and no code whenever applicable. And that
42:42 applies to local AI just like building agents in general. So, I've already
42:45 promised a couple of times that I would dive into OpenAI API compatibility, what
42:50 it is, and why it's so important. And we're going to dive into this now so you
42:54 can really start to see how you can take existing agents and transform them into
42:59 being 100% local with local large language models without really having to
43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI
43:10 has created a standard for exposing large language models through an API.
43:14 It's called the chat completions API. It's kind of like how model context
43:19 protocol MCP is a standard for connecting agents to tools. The chat
43:23 completions API is a standard for exposing large language models over an
43:28 API. So you have this common endpoint along with a few other ones that all of
43:35 these providers implement. This is the way to access the large language model
43:39 to get a response based on some conversation history that you pass in.
43:43 So, Olama is implementing this as of February. We have other providers like
43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.
43:53 Almost every single provider is OpenAI API compatible. And so, not only is it
43:57 very easy to swap between large language models within a specific provider, it's
44:02 also very easy to swap between providers entirely. You can go from Gemini to
44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one
44:13 piece of configuration pointing to a different base URL as it is called. So
44:18 you can access that provider and then the actual API endpoint that you hit
44:22 once you are connected to that specific provider is always the exact same and
44:26 the response that you get back is also always the exact same. And so Olama has
44:31 this implemented now. And I'll link to this article in the description as well
44:33 if you want to read through this because they have a really neat Python example.
44:37 It shows where we create an OpenAI client and the only thing we have to do
44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are
44:47 pointing to Olama that is hosted locally instead of pointing to the URL for
44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And
44:55 then with Olama, you don't actually need an API key because everything's running
44:58 locally. So you just need some placeholder value here. But there is no
45:02 authentication that is going on. You can set that up. I'm not going to dive into
45:05 that right now. But by default, because it's all just running locally, you don't
45:09 even need an API key to connect to Olama. And then once we have our OpenAI
45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in
45:18 exactly the same way. But now we can specify a model that we have downloaded
45:22 locally already through Lama. We pass in our conversation history in the same way
45:27 and we access the response like the content the AI produced the token usage
45:31 like all those things that we get back from the response in the same way.
45:34 They've got a JavaScript example as well. They have a couple of examples
45:38 using different frameworks like the Versell AI SDK and Autogen. Really any
45:44 AI agent framework can work with OpenAI API compatibility to make it very easy
45:47 to swap between these different providers. like Pyantic AI, my favorite
45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily
45:57 within your Pantic AI agents swap between these different providers. And
46:02 so what I have for you now is two code bases that I want to cover. The first
46:07 one is the local AI package, which we'll dive into in a little bit. But right
46:13 now, we have all of the agents that we are going to be creating in this master
46:17 class. So I have a couple for N8N that are also available in this repository.
46:21 And then a couple of scripts that I want to share with you as well. And so the
46:25 very first thing that I want to show you is this simple script that I have called
46:31 OpenAI compatible demo. And so you can download this repository. I'll have this
46:34 linked in the description as well. There's instructions for downloading and
46:38 setting up everything in here. And this is all 100% local AI. And so with that,
46:43 I'm going to go over into my windsurf here where I have this OpenAI compatible
46:47 demo set up. So I've got a comment at the top reminding us what the OpenAI API
46:52 compatibility looks like. We set our base URL to point to Olama hosted
46:59 locally and it's hosted on port 11434 by default. So I can actually show you
47:02 this. I have Ola running in a Docker container, which we're going to dive
47:05 into this when we set up the local AI package, but you can see that it is
47:11 being exposed on port 11434. And by the way, you can see the
47:14 127.0.0.1 in that URL that I have highlighted here, that is synonymous with localhost.
47:21 And so this right here, you could also replace with 127.0.0.1.
47:25 Just a little tidbit there. It's not super important. I just typically leave
47:28 it as localhost. And then you can change the port as well. I'm just sticking to
47:32 what the default is. And then again, we don't need to set our API key. We can
47:36 just set it to any value that we want here. We just need some placeholder even
47:39 though there is no authentication with a llama for real unless you configure
47:43 that. So that's OpenAI compatibility. And the important thing with this script
47:47 here is I have two different configurations here. I have one for
47:51 talking to OpenAI and then one for OALMA. So with OpenAI, we set our base
47:57 URL to point to api.openai.com. We have our OpenAI API key set in our
48:01 environment variables. So you can just set all your environment variables here
48:05 and then rename this to env. I've got instructions for that in the readme of
48:08 course. And then going back to the script, we are using GPT4.1 nano for our
48:13 large language model. There's something super fast and cheap. And then for our
48:17 Lama configuration, we are setting the base URL here, localhost1434
48:22 or just whatever we have set in our environment variables. Same thing for
48:26 the API key. And then same thing for our large language model. And what I'm going
48:31 to be using in this case is Quen 314B. That is one of the large language models
48:34 that I showed you within the Olama website. Definitely a smaller one
48:38 compared to what I could run, but I just want to run something fast. And very
48:41 small large language models are great for simple tasks like summarization or
48:45 just basic chat. And that's what I'm going to be using here just for a simple
48:49 demo. And so whether it's enabled or not, this configuration is just based on
48:53 what we have set for our environment variables. And the important thing here
48:59 is the code that runs for each of these configurations just as we go through
49:03 this demo is exactly the same. We are parameterizing the configuration for the
49:08 base URL and API key. So we are setting up the exact same OpenAI client just
49:13 like we saw in the Olama article but just changing the base URL and API key.
49:17 And so then for example when we use it right here it's client.hat.comp
49:22 completions.create create calling the exact same function no matter if we're
49:26 using OpenAI or Olama. And then we're handling the response in the same way as
49:31 well. And so I'll go back to my terminal now. And so I went through all the steps
49:34 already to set up my virtual environment, install all of my dependencies. And so now I can run the
49:40 command OpenAI compatible demo. And now it's going to present the two
49:43 configuration options for me. And so I can run through OpenAI. So we'll go
49:46 ahead and do that first. And these two demos are going to look exactly the
49:50 same, but that is the point. And so we have our base URL here for OpenAI. We
49:54 have a basic example of a completion with GPT4.1 Nano. There we go. So this
49:59 is the model that was used. Here are the number of tokens. And this is our
50:03 response. And then I can press enter to see a streaming response now as well. So
50:07 we saw it type out our answer in real time. And then I can press enter one
50:10 more time. This is the last part of the demo. Just say multi-turn conversation.
50:14 So we got a couple of messages here in our conversation history. So very nice
50:19 and simple. The point here is to now show you that I can run this and select
50:23 Olama now instead and everything is going to look exactly the same and all
50:27 of the code is the same as wallet. It is only our configuration that is
50:31 different. And so it will take a little bit when you first run this because
50:36 Olama has to load the large language model into your GPU. And so going to the
50:42 logs for Olama, I can show you what this looks like here. And so when we first
50:47 make a request when Quen 314B is not loaded into our GPU yet, you're going to
50:52 see a lot of logs come in here and we'll and you'll have this container up and
50:54 running when you have the local AI package which we'll cover in a little
50:57 bit. So it shows all the metadata about our model like it's Quen 314b. Uh we can
51:05 see here that uh we have a Q4 KM quantization like we saw in the Olama
51:09 website. Uh what other information do we have here? There's just so much to to
51:14 digest here. Um, yeah, another really important thing is we have the uh
51:19 context link. I have that set to 8,192 just like I recommended in the
51:22 environment variables. And then we can see that we offloaded all of the layers
51:26 to the GPU. So I don't have to do any offloading to the CPU or the RAM. I can
51:30 keep everything in the GPU, which is certainly ideal, like I said, to make
51:34 sure this is actually fast. And then when we get a response from quen 314b,
51:41 we are calling the v1/hatcompletions endpoint because it is openi API
51:46 compatible. So that exact endpoint that we hit for openai is the one that we are
51:50 hitting here with a large language model that is running entirely on our computer
51:54 in Olama. And so the response I get back, it's actually a reasoning LLM as
51:58 well. So we even have the thinking tokens here, which is super cool. And so
52:02 we got our response. It's just printing out the first part of it here just to
52:04 keep it short. And then I can press enter. And we can see a streaming demo
52:08 as well. And it's going to be a lot faster this time because we do already
52:11 have the model loaded into our GPU. And so that first request when it first has
52:15 to load a model is always the slower one. And then it's faster going forward
52:19 once that model is already loaded in our GPU. And then as long as we don't swap
52:24 to another large language model and use that one, then it will remain in our GPU
52:28 for some time. And so then all of our responses after are faster. And then we
52:33 just have the last part of our demo here with a multi-turn conversation. So we
52:37 can see conversation history in action as well, just not with streaming here.
52:40 Um, and and everything's a bit slower with this large language model because
52:43 it is a reasoning one. And so you can certainly if you want faster uh
52:47 inference, you can always use a non-reasoning local LLM like Mistl or
52:52 Gemma for example. So that is our very simple demo showing how this works. I
52:55 hope that you can see with this and again this works with other AI agent
52:59 frameworks like eggno or pideantic AI or crew AI as well like they all work in
53:03 this way where you can use openAI API compatibility to swap between providers
53:08 so easily so you don't have to recreate things to use local AI and that's
53:11 something so important that I want to communicate with you because if I'm the
53:15 one introducing you to local AI I also want to show you how it can very easily
53:19 fit into your existing systems and automations. All right. Now, we have
53:23 gotten to the part of the local AI master class that I'm actually the most
53:27 excited for because over the past months, I have very much been pouring my
53:31 heart and soul into building up something to make it infinitely easier
53:35 for you to get everything up and running for local AI. And that is the local AI
53:40 package. And so, right now, we're going to walk through installing it step by
53:44 step. I don't want you to miss anything here because it's so important to get
53:47 this up and running, get it all working well. Because if you have the local AI
53:51 package running on your machine and everything is working, you don't need
53:55 anything else to start building AI agents running 100% offline and
53:59 completely private. And so here's the thing. At this point, we've been
54:04 focusing mostly on Olama and running our local large language models. But there's
54:08 the whole other component to local AI that I introduced at the start of the
54:13 master class for our infrastructure. things like our database and local and
54:18 private web search, our user interface, agent monitoring. We have all these
54:23 other open-source platforms that we also want to run along with our large
54:27 language models and the local AI package is the solution to bring all of that
54:32 together curated for you to install in just a few steps. So, here is the GitHub
54:37 repository for the local AI package. I'll have this linked in the description
54:41 below. Just to be very clear, there are two GitHub repos for this master class.
54:45 We have this one that we covered earlier. This has our N8N and Python
54:49 agents that we'll cover in a bit, as well as the OpenAI compatible demo that
54:53 we saw earlier. So, you want to have this cloned and the local AI package as
54:57 well. Very easy to get both up and running. And if you scroll down in the
55:02 local AI package, I have very comprehensive instructions for setting
55:06 up everything, including how to deploy it to a private server in the cloud,
55:10 which we'll get into at the end of this master class, and a troubleshooting
55:13 section at the bottom. So, everything that I'm about to walk you through here,
55:17 there's instructions in the readme as well if you just want to circle back to
55:21 clarify anything. Also, I dive into all of the platforms that are included in
55:26 the local AI package. And this is very important because like I said, when you
55:30 want to build a 100% offline and private AI agent, it's a lot more than just the
55:35 large language model. You have all of the accompanying infrastructure like
55:39 your database and your UI. And so I have all that included. First of all, I have
55:44 N8N that is our low/noodeode workflow automation platform. We'll be building
55:48 an agent with N8N in the local AI package in a little bit once we have it
55:52 set up. We have Superbase for our open- source database. We have Olama. Of
55:56 course, we want to have this in the package as well for our LLMs. Open Web
56:01 UI, which gives us a chat GPT like interface for us to talk to our LLMs and
56:06 have things like conversation history. Very, very nice. So, we're looking at
56:09 this right here. This is included in the package. Then we have Flowwise. It's
56:13 similar to N8N. It's another really good tool to build AI agents with no slash
56:18 low code. Quadrant, which is an open- source vector database. Neo4j which is a
56:25 knowledge graph engine and then seir xng for open-source completely free and
56:31 private web search caddy which this is going to be very important for us once
56:35 we deploy the local AI package to the cloud and we actually want to have
56:38 domains for our different services like nn and open web UI and then the last
56:42 thing is langfuse this is an open- source LLM engineering platform it helps
56:47 us with agent observability now some of these services are outside of the scope
56:52 for this local AI master class. I don't want to spend a half hour on every
56:56 single one of these services and make this a 10-hour video. I will be focusing
57:02 in this video on N8N, Superbase, Olama, Open WebUI, CRXNG, and then Caddy once
57:08 we deploy everything to the cloud. So, I do cover like half of these services.
57:12 And the other thing that I want to touch on here is that there are quite a few
57:16 things included here. And so you do need about 8 GB of RAM on your machine or
57:21 your cloud server to run everything. It is pretty big overall. And so you can
57:26 remove certain things like if you don't want Quadrant and Langfuse for example,
57:30 you can take those out of the package. More on that later. It doesn't have to
57:34 be super bloated, you can whittle this down to what you need. But yeah, there's
57:37 a lot of different things that go into building AI agents. And so I have all of
57:40 these services here so that no matter what you need, I've got you covered. And
57:44 so with that, we can now move on to installing the local AI package. And
57:48 these instructions will work for you on any operating system, any computer. Even
57:52 if you don't have a really good GPU to run local large language models, you
57:56 still could always use OpenAI or Anthropic, something like that, and then
58:00 run everything else locally to save on costs or just to have everything running
58:04 on your computer. And so there are a couple of prerequisites that you have to
58:08 have before you can do the instructions below. You need Python so you can run
58:12 the start script that boots everything up. Git or GitHub desktop so you can
58:16 clone this GitHub repository, bring it all onto your own machine. And then you
58:21 want Docker or Docker Desktop. And so I've got links for all of these. Docker
58:25 and Docker Desktop we need because all of these local AI services that I've
58:29 curated for you, they all run as individual Docker containers that are
58:34 all combined together in a stack. And so I'll actually show you this is the end
58:36 result once we have everything up and running within your docker desktop. You
58:41 have this local AI docker compos stack that has all of the services running in
58:45 tandem like superbase and reddus and nitn and flowwise caddy neo4j. All of
58:50 these are running within this stack. That is what we're working towards right
58:54 now. And so make sure you have all these things installed. I've got links that'll
58:57 take you to installing no matter your operating system. Very easy to get all
59:01 of this up and running on your machine. Then we can move on to our first command
59:05 here, which is to clone this GitHub repository, bringing all of this code on
59:10 your machine so you can get everything running. And so you want to open up a
59:14 new terminal. So I've got a new PowerShell session open here. Going to
59:18 paste in this command. And I'm going to be doing this completely from scratch
59:22 with you. So you clone the repo and then I'm just going to change my directory
59:26 into local AI package, which was just created from this get clone command. So
59:31 those are the first two steps. The next thing is we have to configure all of our
59:36 environment variables. And believe it or not, this is actually the longest part
59:41 of the process. And once we have this taken care of, it's a breeze getting the
59:44 rest of this up and running. But there's a lot of configuration that we have to
59:49 set up for our different services like credentials for logging into our
59:54 Superbase dashboard or Neo4j. Uh things like our Superbase um anonymous key and
59:59 private key. All these things we have to configure. And so within our terminal
60:04 here, you can do code dot to open this within VS code or windsurf. Open this in
60:09 windsurf. You just want to open up this folder within your IDE and the specific
60:14 IDE that you use. Really doesn't matter. You just want to get to this.env.example
60:20 here. I'm going to copy it and then I'm going to paste it. And then I'm going to
60:24 rename this toenv. So we're taking the example. example, turning it into av file. So,
60:31 you want to make sure that you copy it and rename it like this. Then we can go
60:36 ahead and start setting all of our configuration. And I'll even zoom in on
60:39 this just so that it's very easy for you to see everything that we are setting up
60:44 here. So, first up, we have a couple of credentials for N8N. We have our
60:49 encryption key and our JWT secret. And it's very easy to generate these. In
60:53 fact, we'll be doing this a couple of times, but we'll use this open SSL
60:58 command to generate a random 32 character alpha numeric string that
61:02 we're going to use for things like our encryption key and JWT secret. And so,
61:08 OpenSSL is a command that is available for you by default on Linux and Macs.
61:12 You can just open up any terminal and run this command and it'll spit out a
61:16 long string that you can then just paste in for this value. For Windows, you
61:20 can't just open up any terminal and use OpenSSL, but you can use Git Bash, which
61:26 is going to come with GitHub Desktop when you install it. And so, I'll go
61:29 ahead and just search for that. If you just go to your search bar on your
61:32 bottom left on Windows and search for Git Bash, it's going to open up this
61:37 terminal like this. And so, I can go ahead and copy this command, go in here,
61:42 and paste it in. And then I can run it. And then, boom, there we go. This is I
61:45 know it's really small for you to see right now. I'm going to go ahead and
61:48 copy this because this is now the value that I can use for my encryption key.
61:52 And then you want to do the exact same thing to generate a JWT secret. And then
61:57 the other way that you can do this if you don't want to install git bash or
62:01 it's not working for whatever reason, you can use Python to generate this as
62:05 well. So I can just copy this command and then I can go into the terminal here
62:10 and I can just paste this in. And so it's going to just like with OpenSSL
62:15 generate this random 32 character string that I can copy and then use for my JWT
62:20 secret. There we go. And so I am going to get in the weeds a little bit here
62:24 with each of these different parameters, but I really want to make sure that I'm
62:27 clear on how to set up everything for you so you can really walk through this
62:31 step by step with me. And like I said, setting up the environment variables is
62:35 the longest part by far for getting the local AI package set up. So if you bear
62:39 with me on this, you get through this configuration, you will have everything
62:43 running that you need for local AI for the LLMs and your infrastructure. So
62:47 that's everything for N8N. Now we have some secrets for Superbase. And there
62:53 are some instructions in the Superbase documentation for how to get some of
62:57 these values. So it's this link right here, which I have open up on my
63:01 browser. So we'll we'll reference this in a little bit here. But first, we can
63:05 set up a couple of other things. The first thing we need to define is our
63:11 Postgress password. So, Supphabase uses Postgress under the hood for the
63:14 database. And so, we want to set a password here that we'll use to connect
63:19 to Postgress within N8N or a connection string that we have for our Python code,
63:23 whatever that might be. And this value can be really anything that you want.
63:26 Just note that you have to be very careful at using special characters like
63:31 percent symbols. So if you ever have any issues with Postgress, it's probably
63:36 because you have special characters that are throwing it off. U that's something
63:39 that I've seen happen quite a few times. And so like I said, I want to mention
63:42 troubleshooting steps and things to make sure that it is very clear for you. So
63:47 for this Postgress password here, I'm just going to say test Postgress pass.
63:51 I'm just going to give some kind of random value here. Just end with a
63:54 couple of numbers. I don't care that I'm exposing this information to you because
63:58 this is a local AI package. These passwords are for services that never
64:02 leave my computer. So, it's not like you could hack me by connecting to anything
64:08 here. And then we have a JWT secret. And this is where we get into this link
64:13 right here in the Superbase docs. And so they walk you through generating a JWT
64:18 secret and then using that to create both your anonymous and your service
64:22 role keys. If you're familiar with Superbase at all, we need both of these
64:27 pieces of information. The anonymous key is what we share to our front end. This
64:30 is our public key. And then the service role key has all permissions for
64:33 Superbase. We'll use this in our backends for things like our agents. And
64:39 so you can just go ahead and copy. You can go ahead and copy this JWT secret.
64:43 And then you can paste this in right here. This is 32 characters long just
64:47 like the things that we generated with OpenSSL. I'm just going to be using
64:52 exactly what Superbase tells me to. And then what you can do with this is you
64:56 can select the anonymous key. Click on generate JWT and then I can copy this
65:02 value and then I will paste this for my anonymous token. And so I'm just
65:06 replacing the default value there for the anonymous key. And then going back
65:10 and selecting the service key, I'm going to generate that one as well. So it
65:13 looks very similar. They'll always start with ey, but these values are different
65:18 if you go towards the end. And so I'll go ahead and paste this for my service
65:22 ro key. Boom. There we go. All right. And then for the Superbase dashboard
65:27 that we'll log into to see our tables and our SQL editor and authentication
65:31 and everything like that, we have our username here, which I'm just going to
65:35 keep as superbase. And then for the password, I can just say test superbase
65:39 pass. I'll just kind of use that as my common nomenclature here for my
65:42 passwords cuz I don't really care what that is right now. And then the last
65:45 thing that we have to set up is our pooler tenant ID. And it's not really
65:49 important to dive into what exactly this means. Just know that you can set this
65:52 to really anything that you want. Like I typically will just choose four digits
65:57 here like 1,00 for my pooler tenant ID. So that is everything that we need for
66:01 superbase. And actually most of the configuration is for superbase. Then we
66:06 have Neo4j. This is really simple. You can leave Neo4j for the username and
66:11 then I'll just say test Neo4j pass for my password here. So you just set the
66:15 password for knowledger graph and even if you're not using Neo4j you still have
66:19 to set this but yeah it just takes two seconds. Then we have langfuse. This is
66:23 for agent observability. We have a few secrets that we need here. And for these
66:28 values they can really just be whatever you want. It doesn't matter because
66:31 these are just passwords just like we had passwords for things like Neoforj.
66:35 So I can just say test click house pass. Um and then I can do test mo pass. And
66:43 um I mean it really doesn't matter here. Random Langfuse salt. I'm just doing
66:47 completely whack values here. You probably want something more secure in
66:51 this case, but um I'm just doing something as a placeholder for now. Um
66:56 yeah, that there we go. Okay, good. And then then the last thing that we need
66:59 for Langfuse is an encryption key. And this is also generated with OpenSSL like
67:04 we did for the N8N credentials. And so I'll go back to my git bash terminal.
67:08 And again, you can do this with Python as well. I'll just run the exact same
67:12 command. I'll get a different value this time. And so I'll go ahead and copy
67:16 that. You could technically use the same value over and over if you wanted to,
67:20 but obviously it's way more secure to use a different value for each of the
67:24 encryption keys that you generate with OpenSSL. So there we go. That is our
67:28 encryption key. And that is actually everything that we have to set up for
67:32 our environment variables when we are just running the local AI package on our
67:37 computer. Once we deploy it to the cloud and we actually want domains for our
67:41 different services like open web UI and N8N then we'll have to set up caddy. So
67:45 this is where we'll dive into domains and we'll get into this at the end of
67:49 the master class here. But everything past this point for environment
67:54 variables is completely optional. You can leave all of this exactly as it is
67:59 and everything will work. Most of this is just extra configuration for
68:03 superbase. So, Superbase is definitely the biggest service that's included in
68:08 this list of, you know, curated services for you. And so, there's a lot of
68:11 different configuration things you can play around with if you want to dive
68:15 more into this. You can definitely look at the same documentation page that we
68:19 were using for the Superbase Secrets. And so, you can scroll through this if
68:22 you want to learn more um like setting up email authentication or Google
68:27 authentication. um diving more into all of those different configuration things
68:32 for Superbase if you want to dive more into that. I'm not going to get into all
68:36 of this right now because the core of getting Superbase up and running we
68:40 already have taken care of with the credentials that we set up at the top um
68:44 right here. And so that these are these are just the base things and so that's
68:47 what we'll stick to right now. So that is everything for our environment
68:51 variables. So then going back to our readme now which I have open directly in
68:55 windsurf now instead of my browser we have finished our configuration and I do
68:59 have a note here that you want to set things up for caddy if you're deploying
69:03 to production. Obviously we're doing that later not right now like I said and
69:07 so with that we are good to start everything. Now before we spin up the
69:12 entire local AI package there is one thing that I want to cover. It's
69:14 important to cover this before we run things. If you don't want to run
69:19 everything in the package cuz it is a lot like maybe you only want to use half
69:22 of these services and you don't want Neo4j and Langfuse and Flowwise right
69:28 now. There are two options that you have. The easiest one right now is to go
69:33 into the docker compose file. This is the main file where all of the services
69:38 are curated together and you can just remove the services that you don't want
69:42 to include. So, for example, if you don't want Quadrant right now, cuz it is
69:46 actually one of the larger services. It's like 600 u megabytes of RAM just
69:50 having this running, you can search for Quadrant, and you can just go ahead and
69:54 delete this service from the stack like that. Boom. Now I don't have Quadrant.
69:59 It won't spin up as a part of the stack anymore. And then also I have a volume
70:03 for Quadrant. So, you can remove that as well. Volumes, by the way, is how we are
70:08 able to persist data for these containers. So if we tear down
70:12 everything and then we spin it back up, we still are going to have our open web
70:17 UI conversations and our N8N workflows, everything in Superbase, like all that
70:21 is still going to be saved because we're storing it all in volumes. So we can do
70:25 whatever the heck we want with these containers. We can tear them down. We
70:28 can update them, which I'll show you how to do later. We can spin it back up. And
70:32 all of our data will always be persisted. So you don't have to worry
70:35 about losing information. And you can always back things up if you want to be
70:39 really secure, but I've never done that before and I've been updating this
70:42 package for months and months and months and all of my workflows from 6 months
70:46 ago are still there. I haven't lost anything. And so that's just a quick
70:50 caveat there for how you can remove services if you want. And then another
70:54 thing that we don't have available yet, but I'm very excited to, you know, kind
70:58 of talk about this right now. It's in beta right now. We are creating me and
71:03 one other guy uh that's actually on my Dynamist team. Um Thomas, he's got a
71:07 YouTube channel as well. He's a great guy. We're working together on this.
71:09 He's actually been putting in most of the work creating a front-end
71:13 application for us to manage our local AI package. And one of the big things
71:17 with this is that we're going to make it possible for you to toggle on and off
71:22 the services that you want to have within your local AI package. So you can
71:27 very much customize the package to the services that you want to run. So you
71:31 can keep it lightweight just to the things you care about. Also, we'll be
71:34 able to manage environment variables and monitor the containers. Not all of this
71:38 is up and running at this point, but this is in beta. We're working on it.
71:41 I'm really excited for this. So, not available yet, but at once this is
71:44 available, this will be a really good way for you to customize the package to
71:47 your needs. So, you don't have to go and edit the docker compose file directly.
71:51 So, that's something that I just wanted to get out of the way now. But, we can
71:57 start and actually execute our package now. Get all these containers up and
72:02 running. So the command that you run to start the local AI package is different
72:07 depending on your operating system and the hardware that you have. So for
72:13 example, if you are an Nvidia GPU user, you want to run this start services.py
72:18 script. This boots up all of the containers and you want to specifically
72:23 pass in the profile of GPU NVIDIA. This is going to start Ola in a way where the
72:29 Olama container is able to leverage your GPU automatically. And then if you are
72:34 using an AMD GPU and you're on Linux, then you can run it this way. Which by
72:38 the way, unfortunately, if you have an AMD GPU on Windows, you aren't able to
72:46 run O Lama in a container. And it's the same thing with Mac computers.
72:49 Unfortunately, like you see right here, you cannot expose your GPU to the Docker
72:55 instance. And so if you are an AMD GPU on Windows or running on Mac, you cannot
73:01 run Olama in the local AI package. You just have to install it on your own
73:04 machine like I already showed you in this master class and then you'll just
73:08 run everything else through the local AI package and they can actually go out to
73:12 your machine and communicate to Olama directly. So just a small limitation for
73:17 Mac and AMD on Windows. But if you're running on Linux or an Nvidia GPU on
73:21 Windows like I'm using, then you can go ahead and run this command right here.
73:27 So if you can't run a GPU in the Olama container, then you can always just
73:32 start in CPU mode or you can run with a profile of none. This will actually make
73:36 it so that Olama never starts in the local AI package. So you can just
73:40 leverage the Olama that you have already running on your computer like I showed
73:43 you how to install already. So, just a couple of small caveats that I really
73:46 want to hit on there. I need to make sure that you're using the right
73:51 command. And so, in my case, I'm Nvidia on Windows. So, I'm going to copy this
73:55 command. Go back over into my terminal. I'll just clear it here. So, we have a
73:59 blank slate. And I'll paste in this command. And so, it's going to do quite
74:02 a few things initially. First, it's going to clone the Superbase repository
74:07 because Superbase actually manages the stack in a separate place. And so, we
74:10 have to pull that in. Then there's some configuration for CRXNG for our uh local
74:17 and private web search. And then I have a couple of warnings here saying that
74:20 the Flowwise username and password are not set, which by the way for that if
74:24 you want to set the Flow Wise username and password, it's optional, but you can
74:29 do that if I scroll down right here. So you can set these values, those will
74:32 actually make those warnings go away, but you can also ignore them, too. So
74:35 anyway, I just wanted to mention that really quickly. But now what's happening
74:39 here is it starts by running all of the Superbase containers. And so there's
74:44 quite a bit that goes into Superbase, like I said. So we're running all of
74:47 that. It's getting all that spun up. And then once we run all of these, it's
74:51 going to move on to deploying the rest of our stack. And if you're running this
74:55 for the very first time, it will take a while to download all of these images.
74:59 They're not super small. There's a lot of infrastructure that we're starting up
75:03 here. And so it'll take a bit. You just have to be patient. maybe go grab your
75:06 coffee or make your next meal, whatever that is. And then everything will be up
75:09 and running once you are back. And so yeah, now you can see that we are
75:13 running the rest of the containers here. Um, and so we'll just wait for that to
75:16 be done. And then I'll show you what that looks like in Docker Desktop as
75:19 well. And so I'll give it a second here just to finish. Uh, looks like my
75:24 terminal glitched a little bit. Like I was scrolling and so it kind of broke it
75:27 a bit. But anyway, everything is up and running now. It'll look like this where
75:30 it'll say all of the containers are healthy or running or started. And then
75:34 if I go into Docker Desktop and I expand the local AI compost stack, you want to
75:39 make sure that you have a green dot for everything except for the Olama pull and
75:45 N8N import. These just run once initially and then they go down because
75:49 they're responsible for pulling some things for our local AI package. And so
75:53 yeah, I've got green dots for everything except for two right here. Now I'm
75:57 leaving this in here intentionally actually because there is a bug with
76:02 Superbase specifically if you are on Windows. So you'll see this issue where
76:08 the Superbase pooler is constantly restarting and that also affects N8N
76:12 because N8N relies on the Superbase pooler. So it's constantly restarting as
76:17 well. If you see this problem, I actually talk about this in the
76:21 troubleshooting section of the readme. If you scroll all the way down, if the
76:24 Superbase pooler is restarting, you can check out this GitHub issue. And so I
76:29 linked to this right here, and he tells you exactly which file you want to
76:33 change. It's this one right here. So it's docker volumes poolerpooler.exs.
76:39 And you need to change the file to end in lf. And so I'll show you what I mean
76:43 by that. I'll show you exactly how to do this. It's like a super tiny random
76:47 thing, but this has tripped up so many people. So I want to include this
76:51 explicitly in the master class here. So you want to go within the superbase
76:56 folder within docker volumes and then it's within pooler and then we have
77:00 pooler.exs and basically no matter your IDE you can see the crlf in the bottom right here.
77:08 You want to click on this and then change it to lf and then make sure that
77:13 you save this file. Very easy to fix that. And then what you can do is you
77:19 can run the exact same command to spin everything up again. And so I'm going to
77:22 do this now. It's going to go through all the same steps. It'll be faster this
77:25 time because you already have everything pulled. And this, by the way, is how you
77:28 can just restart everything really quickly if you want to enforce new
77:31 environment variables or anything like that. So I want to include that
77:35 explicitly um for that reason as well. And I'll go ahead and close out of this.
77:39 And and while this is all restarting, the other thing that I want to show you
77:42 in the readme is I also have instructions for upgrading the containers in the local AI package. So
77:49 when N8N has an update or Superbase has an update, it is your responsibility
77:53 because you're managing the infrastructure to update things yourself. And so you very simply just
77:58 have to run these three commands to update everything. You want to tear down
78:03 all of the containers and make sure you specify your profile like GPU Nvidia and
78:09 then you want to pull all of the latest containers and again specifying your
78:13 profile. And then once you do those two things, you'll have the most up-to-date
78:17 versions of the containers downloaded. So you can go ahead and run the start
78:21 services with your profile just like we just did to restart things. Very easy to
78:25 update everything. And even though we are completely tearing down our
78:29 containers here before we upgrade them, we aren't losing any information because
78:33 we are persisting things in the volumes that we have set up at the top of our
78:37 Docker Compose stack. And so this is where we store all of our data in our
78:41 database and and workflows. All these things are persisted. So we don't have
78:45 to worry about losing them. Very easy to upgrade things and you still get to keep
78:48 everything. You don't have to make backups and things like that unless you
78:54 just want to be ultra ultra safe. So now we can go back to our Docker desktop and
79:00 we've got green dots for everything now since we fixed that pooler.exs issue.
79:04 The only thing that we don't have green dots for is the N8N import and then we
79:08 have our Olama pull as well because like I said those are the two things that
79:11 just have to run at the beginning and then they aren't ongoing processes like
79:16 the rest of our services. So, we have everything up and running. And if there
79:22 is anything that is a white dot besides Olama pull or n import or if there's
79:27 anything that is constantly restarting, just feel free to post a comment and
79:31 I'll definitely be sure to help you out. And then also check out the
79:34 troubleshooting section as well. One thing that I'll mention really quick is
79:37 sometimes your N8N will constantly restart and it'll say something like the
79:42 N8N encryption key doesn't match what you have in the config. And the big
79:46 thing to keep in mind for that is you want to make sure that you set this
79:51 value for the encryption key before you ever run it for the first time.
79:53 Otherwise, it's going to generate some random default value and then if you
79:56 change this later, it won't match with what it expects. And so, yeah, my big
79:59 recommendation is like make sure you have everything set up in your
80:03 environment variables before you ever run the start services for the first
80:08 time. This should be run once you have your environment variables set up.
80:11 Otherwise, you risk any of these services creating default values that
80:14 then wouldn't match with the keys and things that you set up later. And so
80:18 with that, we can now go into our browser and actually explore all of
80:22 these local AI services that we have running on our computer now. Now over in
80:26 our browser, we can start visiting the different services that we have spun up.
80:30 Like here is N8N. You just have to go to localhost port 5678. It'll have you
80:35 create a local account when you first visit it. And then you'll have this
80:38 workflow view that should look very familiar to you. if you have used NAND
80:41 in the past. And then we have open web UI localhost port 8080. This is our chat
80:47 GPT like interface where we can directly talk to all of the models that we have
80:52 pulled in our Olama container. Really, really neat. And then we have local host
80:57 port 8000 for our Superbase dashboard. The signin definitely isn't pretty
81:00 compared to the managed version of Superbase. But once you enter in your
81:04 username and password that you have set for the environment variables for the
81:07 dashboard, then you have the very typical view where we have our tables
81:11 and we've got our SQL editor. Everything that you're familiar with with
81:14 Superbase. And that's the key thing with all these different services. They all
81:18 will look the exact same for you pretty much. Um like another one for example,
81:25 if I go to localhost um port 3000, we have languages. This is for agent
81:28 observability and monitoring. And this is something I'm not going to dive into
81:31 in this master class. Like I said, I'm not covering all the services. But yeah,
81:34 I just want to show that like every single one of these pretty much you can
81:39 access in your browser. And by the way, the way that we know the specific port
81:43 to access for each of these services is by taking a look at either what it tells
81:48 us in Docker Desktop. So like we can see that Neo4j is um let's see, we have port
81:53 7474. For uh CR XNG, it's port 8081. For Flowwise, it's port 3001. What's one
82:01 that we've seen already? Um, let me Yeah, like Open Web UI is port 8080. So,
82:06 the port on the left is the one that we access in our browser. And then the port
82:11 on the right is what's mapped on the container. So, when we visit port 8080
82:16 on our computer, that goes into port 8080 on the container. And that's what
82:21 we have exposed. The other way that you can see the port that you need to use is
82:24 just by taking a look at this docker compose file. And you don't need to have
82:29 like a super good understanding of this docker compose file. But if you want to
82:33 customize your stack or even help me by making contributions to local AI
82:36 package, this is the main place to make changes. And so for example, I can go
82:41 down to flowwise and I can see that the port is 3001. Or if I go down to let's say N8N, we can
82:49 see that the port is 5678. And so the port is always going to be
82:52 there somewhere in the service that you have set up. Like for the Langfuse
82:56 worker, it's 3030. That's more of a behind-the-scenes kind of service. But
82:59 let me just find one more example for you here. Um yeah, like Reddus for
83:04 example is 6379. So you can see the ports in the Docker Compose as well. I
83:07 just want to call it out just to at least get you a little bit comfortable
83:11 and familiar with the Docker Compose file in case you want to customize
83:14 things. But the main thing is just leveraging what you see here in Docker
83:18 Desktop. Last thing in Docker Desktop really quickly, if you want to bring
83:21 more local large language models into the mix, you can do it without having to
83:25 restart anything. You just have to find the Olama container in the Docker
83:29 Compose stack. Head on over to the exec tab. And now here we can run any
83:33 commands that we'd want. We're directly within the container here. And we can
83:36 use Lama commands just like we did earlier on our host machine. And so for
83:40 example, I ran Lama list already. So I can see the large language models that
83:43 have already been pulled in my Olama container. If I want to pull more, I can
83:48 just do Olama pull and then find that ID for the model I want to use on the Olama
83:52 website. And like I said, you don't have to restart anything. If I pull it here,
83:56 it's now in the container and I can immediately start using it in Open Web
84:00 UI or N8N. We'll see that in a little bit. And so that's just really important
84:03 because a lot of times you're going to want to start to use different large
84:06 language models and you don't want to have to restart anything. The ones that
84:11 are brought into the machine by default is it's determined by this line right
84:15 here. So if you want to change the ones that are pulled by default, I just have
84:20 Quinn 2.57B instructs like a really small lightweight one that I have
84:23 brought into your Lama container by default. Uh if you want to add in
84:27 different ones, you can just update this line right here to include multiple
84:32 Olama pulls. And so that way you can bring in Quen 3 or Mistral 3.1 small,
84:36 whatever you want. This is just the one I have by default. And then all the
84:40 other ones that you saw in my list here, I've pulled myself. All right. Now that
84:45 we have the local AI package up and running, it is time to build some
84:50 agents. Now, we get to use our local AI package to actually build out an
84:53 application. And so, I'm going to start by introducing you to Open Web UI, and
84:58 we'll use it to talk to our Olama LLM. So, we have an application kind of right
85:02 out of the box for us. Then I'll dive into building a local AI agent with N8N,
85:08 even connecting it to Open Web UI. So we have this custom agent that we built in
85:12 N8N and then we immediately have a really nice UI to chat with it. And then
85:16 we'll transition to Python building the exact same agent in Python as well. Like
85:21 I said, I want to focus on both no code and code to really make this a complete
85:24 master class so that whether you want to build with N8N or Python, you can see
85:28 how to connect to our different services that we have running locally like
85:32 Superbase and CRXNG and Open Web UI. So, we'll cover all of that and then I'll
85:36 get into deployments after this. But yeah, let's go ahead right now focus on
85:40 open web UI and building out some agents. So, back over in Open Web UI,
85:45 remember this is localhost port 8080. You want to set up your connection to
85:49 Olama so we can start talking with our local LLMs in this nice interface. And
85:54 so bottom left, go to the admin panel, then go to settings and then the
85:58 connections tab. Here we can set up our connections both to OpenAI with our API
86:02 key, which we're not going to do right now, but then also the Olama API. This
86:07 is what we want to set up. Now, usually by default, this value is just
86:12 localhost. And this is actually wrong. This is something that is so important
86:16 to understand. And this will apply when we set up credentials in N8N and Python
86:20 as well. When you are within a container, localhost means that you are
86:26 referencing still within the container. Open web UI needs to reach out to the
86:32 Olama container, not itself. So localhost is not correct here. This is
86:36 generally the default just because open web UI assumes that you're running on
86:39 your machine and so then you would also have Lama running on your machine. So
86:42 local host usually works when you're outside of containers. But here we have
86:47 to change this. This is super important to get right. And so there are two
86:51 options we have. If you are running on a Mac or AMD on Windows and you want to
86:56 use Lama running on your machine not within a container, then you want to do
87:00 host.doccker.in. This is the way in docker to tell the container to look outside to the host
87:07 machine where you're running the containers and you're running
87:11 separately. Very important to know that. And then if you are running Olama in the
87:16 container like I am doing. I have Ola running in my Docker desktop. You want
87:21 to change this to Olama, you're specifically calling out the service
87:26 that is running the Olama container in your Docker Compose stack. And the way
87:30 that we know that this is the name specifically is because we just go back
87:35 to our allimportant Docker Compose file. Olama. So whenever there's an X and a
87:40 dash, you just ignore that. It's just the thing after it. So, ola is the name
87:45 of our service running the container. And then if we wanted to connect to
87:49 something else like flow-wise, flow-wise is the name of the service. Open WebUI,
87:55 it's open- web UI. All of these tople keywords, these are the names when we
87:59 want our containers to be talking to each other. And all of this is possible
88:03 because they are within the same Docker network. And so I'll just show you that
88:06 so you know what I'm talking about here. If I go back to Docker Desktop, we have
88:11 this local AI compos stack. All of these containers can now communicate
88:14 internally with each other by referencing the names like Reddus or
88:19 CRXNG. So, we'll be seeing that a lot when we're building out our agents as
88:22 well. So, I wanted to spend a couple minutes to focus on that. And so, you
88:25 can go ahead and click on save in the very bottom right. I know my face is
88:28 covering this right now, but you have a save button here. Make sure you actually
88:32 do that. Um, and for this API key, I don't know why it's asking me to fill it
88:34 out. I don't really care about connecting to open AI. So I'll just put
88:37 some random value there and click save. And then boom, there we go. We are good.
88:40 And then a lot of times with open web UI, it also helps to refresh otherwise
88:43 it doesn't load the models for some reason. So I just did a refresh of the
88:49 site here. Control F5. And then now we can select all of the local LLMs that we
88:53 have pulled in our Olama container. And so for example, I can do Quinn 2.57B.
88:57 That's the one that I just have by default. I can say hello. And it's going
89:02 to take a little bit cuz it has to load this model onto my GPU just like we saw
89:06 with quen 3 earlier. But then in a second here we'll get a response. And
89:10 there are actually multiple calls that are being done here. We have one to get
89:15 our response, one to get a title for our conversation on the lefth hand side. And
89:19 then also if you click on the three dots here, you can see that it created a
89:23 couple of tags for this conversation. So couple of things that are fired off all
89:26 at once there. And I can test conversation history. What did I just
89:30 say? So yeah, I mean everything's working really well here. We have chat
89:34 history, conversation history on the lefth hand side. There's so much that we
89:37 get out of the box. And so I wanted to show you this really quickly. Now we can
89:41 move on to building an agent in N8N. And I'll even show you how to connect it to
89:46 Open Web UI as well through this N8N agent connector. Really exciting stuff.
89:49 So let's get right into it. So I'm going to start really simple here by building
89:53 a basic agent. The main thing that I want to focus on is just connecting to
89:57 our different local AI services. So I am going to assume that you have a basic
90:01 knowledge of N8N here because this is not an N8N master class. And so I'm
90:04 starting with a chat trigger so we can talk to our agent directly in the UI.
90:08 We'll connect this to open web UI in a bit as well. And then I want to connect
90:14 an AI agent node. And so what we want to do is connect for the chat model and
90:18 then local superbase for our conversation history, our agent memory.
90:22 And so for the chat model I'm going to do lama chat model. I'm going to create
90:26 brand new credentials. You can see me do this from scratch. The URL that you want
90:32 for the base URL is exactly the same as what we just entered into open web UI.
90:35 And so if you are running Olama on your host machine like an AMD on Windows or
90:40 you are running on a Mac or you just don't want to run the Olama container,
90:46 then it is host.doccker.in. And then if you are referencing the
90:49 Olama container, we just reference Olama. That's the name of the service
90:53 running the Olama container in our stack. And then the port is 11434 by
90:57 default. And you can test this connection. So it'll do a quick ping to
91:01 the container to make sure that we are good to go. And I'll even show you what
91:04 that looks like. So right here in my Olama container, I have the logs up. And
91:09 the last two requests were just a simple get request to the root endpoint. We
91:13 have two of those right here. And if I click on retry and I go back to the
91:18 logs, boom, we are at three now. So it made three requests. So it's just making
91:22 that simple ping each time to make sure the container is available. And so I'm
91:26 going to go ahead and click on save and then close out. So now we have our
91:29 credentials and then we can automatically select the model that we
91:33 have loaded now in our container. And so just to keep things really lightweight,
91:36 I'm going to go with the 7 billion parameter model right now from Quen 2.5.
91:40 Cool. All right. So that is everything that we need to connect Olama. It is
91:44 that easy. And then we could even test it right now. So, I'm going to go ahead
91:47 and save this workflow. And I'm going to just say hello. And uh we don't need the
91:52 conversation history or tools or anything at this point. We're already
91:55 getting a response here from the LLM. It's working on loading the model into
91:59 my GPU as we speak. And so there we go. We got our answer looking really good.
92:04 Cool. So now we can add memory as well. So I'm going to add Postgress because
92:08 remember Superbase uses Postgress under the hood. And then I'm going to create
92:12 brand new credentials here. And this is actually probably the hardest one to set
92:16 up out of all of the credentials for connecting to our local AI service. And
92:20 so I'm going to show you what the Docker Compose file looks like just that it's
92:24 clear how I'm getting these different values. And so I'll point out all of
92:28 them. So the first one for our host it is DB because this is the name of the
92:35 specific Superbase service that we have that is the underlying Postgress
92:38 database. And I can show you how I got that really quick. If you go to the
92:42 superbase folder that we pull when we run that start services script, I go to
92:47 docker and then docker compose. If I search for db and there's quite a few
92:53 dependencies on db here. So let me find the actual reference to it. Where is db?
92:58 Here we go. So yeah, it's really short. Uh db is the name of our service that
93:03 actually is the superbase DB. So this is the container name that this is what
93:07 you'll see in docker desktop. But then this is the underlying service that we
93:10 want to reference when we have our containers communicating with each
93:14 other. Like in this case we have our N8N container talking to our superbase
93:18 database container. And then the database and username are both going to
93:22 be Postgress. Those are the values that we have by default. If you scroll down a
93:26 bit in thev you can see these right here. The Postgress database is
93:29 Postgress and the user is also Postgress. And you can customize these
93:33 things but these are some of the optional parameters that I didn't touch
93:36 in the setup with you. And so you can just leave those as is. Now the
93:41 Postgress password, this is one of them that we set. That was the first
93:44 superbase value that we set there. Make sure you have that from what you have in
93:48 thev. And then everything else you can kind of leave as the defaults here. The port is
93:53 going to be 5432. So that is everything for setting up our connection to
93:57 Postgress. You can test this connection as well. And then we can move on to
94:01 adding in some tools and things like that as well. But yeah, this is like the
94:06 very first basic version of the agent that I wanted to show you. And hopefully
94:10 with this you can see how no matter the service that you have running in the
94:13 local AI package. It's very easy to figure out how to connect to it both
94:17 with the help of N8N because N8N always makes it really easy to connect to
94:20 things. Then also just knowing that like you just have to reference that service
94:25 name that we have for the container in the Docker Compose stack. That's how we
94:28 can talk to it. So you could add in quadrant or you could add in language.
94:31 Like you can connect anything that you want into our agent here. And so now we
94:36 have conversation history. Next up, I want to show you how to build a bit more
94:40 of a complicated agent with N8N using some tools. And then also I'm going to
94:44 show you how to connect it to Open Web UI. And so right now this is a live
94:47 demo. Instead of connecting to one of the Olama LLMs, I'm going straight to
94:53 N8N. I have this custom N8N agent connector. And so we are talking to this
94:57 agent that I'll show you how to build in a little bit. This one has a tool to use
95:02 CRXNG for local and private web search. This is one of the platforms that we
95:06 have included in the local AI package. And so this response is going to take a
95:10 little bit here because it has to search the web. And the response that it
95:13 generates with this question is pretty long. Like there we go. Okay. So we got
95:16 the answer. It's pretty long. But yeah, we are able to search the internet now
95:21 with a local agent. N8N connected to open web UI. We're getting pretty fancy
95:25 here. And we also have the title that was generated on the left. And then we
95:30 have the tags here as well. And so the way that this all works, I'm going to
95:33 start by explaining how we can connect N8N to Open Web UI. And this is just
95:37 crucial. Makes it so easy for us to test agents locally as we are developing
95:42 them. And so if you go to the settings and the admin panel in the bottom left
95:47 and go to functions, open web UI has this thing called functions which gives
95:51 us the ability to add in custom functionality kind of as like custom
95:56 models that we can then use like you saw with the N8N agent connector. And so
96:02 what I have here is this thing that I call the N8N pipe. And I'll have a link
96:05 to this in the description as well. I created this myself and I uploaded it to
96:10 the open web UI directory of functions. And so you can go to this link right
96:14 here. You can even just Google the N8N pipe for open web UI. And then you click
96:19 on this get button. It'll just have you enter in the URL for your open web UI.
96:23 So I can just like paste in this right here. Click on import to open web UI and
96:28 it'll automatically redirect you to your open web UI instance. So you'll have
96:33 this function now. And we don't have to dive into the code for all how how all
96:36 of this works. I worked pretty hard to create this for you. Uh actually quite a
96:40 while ago I made this. And the thing that we need to care about is
96:44 configuring this to talk to our N8N agent. And so if you click on the
96:49 valves, the setting icon in the top right, there are a few values that we
96:54 have to set. And so now I'm going to go over to showing you how to build things
96:57 in N8N. Then all of this will click and it'll make sense. I right now looking at
97:00 these values, you're probably like, how the heck do I get all of these? But
97:02 don't worry, we'll dive into all of that. But first, let's go into our N8N
97:07 agent. I'll explain how all of this works. So, first of all, we have our
97:12 chat trigger that gives us the ability to communicate with our agent very
97:16 easily in the workflow. We have a new trigger now for the web hook. And so,
97:22 this is turning our agent into an API endpoint. So, we're able to talk to it
97:27 with other services like open web UI. And so to configure the web hook here,
97:31 you want to make sure that it is a post request type. And then you can define a
97:35 custom path here. Whatever you set here is going to determine what our URL is.
97:40 So we have our test URL. And then also if you toggle the workflow to active,
97:44 this is really important. The workflow in N does have to be active. Then you
97:49 have access to this production URL. And this is actually the first value that we
97:54 need to set within the valves for this open web UI function. We have our N8N
97:59 URL. And because this is a container talking to another container, we don't
98:03 actually want to use this localhost value that it has here for us. We want
98:08 to specify N8N because N8N again is the name of the service running the N8N
98:13 container in our Docker Compose stack. So N8N port 5678. And then this is the
98:18 custom URL that we can determine based on this. And then the other thing that
98:23 we want to do is set up header authentication. We don't want to expose
98:27 this endpoint without any kind of security. And so we want to set up some
98:31 authentication. And so you can select header off from the authentication
98:34 dropdown. And then for the credentials here, I'll just create brand new ones to
98:38 show you what this looks like. The name needs to be authorization with a capital
98:43 A. This has to be very specific. The name in the top left and the name of
98:46 your credentials. This can be whatever you want, but this has to be
98:51 authorization. And then the value here, the way that we want to format this is
98:55 it's going to be bearer and then the and then a space and then whatever you want
99:00 your bearer token to be. So this is what you get to define, but it needs to start
99:05 with a bearer capital B and a space. And then whatever you type after bearer
99:09 space, this goes in as the NAN bearer token. So you don't include a bearer
99:13 space here because that it's just assumed that it's going to be like that.
99:16 It's going to be prefixed with that. So you just type in like test off is what I
99:21 have. So my bearer token is bearer test off like that. And then this is what I
99:24 enter in for this field. Now I already have mine set up. So I'm just going to
99:27 go ahead and close out of this. And then the last thing that we have to set up
99:30 for the web hook. And don't worry, this is the node that we spend the most time
99:33 with. You want to go to the drop down here and change this to respond using
99:38 the respond to web hook node. very important because then at the end of our
99:40 workflow and we get the response from our agent, we're going to send that back
99:45 to whatever requested our API which is going to be open web UI in this case.
99:48 And so that's everything for our configuration for the web hook. Now the
99:53 next thing that we have to do is we have to determine is open web UI sending in a
99:58 request to get a response for our main agent or is it just looking to generate
100:03 that conversation title or the tags for our conversation? Because like we were
100:06 looking at earlier, I'm going to close out of this for now and go back to a
100:10 conversation, our last conversation here. We get our main response, but then
100:15 also there is a request to an LLM to create a very simple title for our
100:19 conversation and the tags that we can see in the top right. And so our N8
100:24 workflow actually gets invoked three separate times for just the first
100:30 message in a new conversation. And so we need to determine, are we getting a main
100:34 response? Like should we go to our main agent or should we just go to a simple
100:39 LLM that I have set up here to help generate the tags or title? And so the
100:44 way that we can determine that is whenever Open Web UI is requesting
100:47 something like a title for a conversation, it always prefixes the
100:53 prompt with three pound symbols, a space, and then the word task. And so we
100:58 can key off of this. If the prompt starts with this, and that prompt just
101:02 is coming in from our web hook here. If it does start with it, then we're just
101:06 going to go to this simple LLM, we're just going to be using Quen 2.514b
101:11 instruct. We have no tools, no memory or anything like our main agent because
101:14 we're just very simply going to generate that title or the tags. And I can even
101:19 show you in the execution history what that looks like. So in this case, we
101:22 have our web hook that comes in. The chat input starts with the triple pound
101:28 and task. And so sure enough, we are deeming it to be a metadata request is
101:32 what I'm calling it. And so then it then goes down to this LLM that is just
101:36 generating some text here. We just have this JSON response with the tags for the
101:41 conversation, technology, hardware, and gaming. So we're asking about the price
101:45 of the 5090 GPU. And then we do the exact same thing to also generate the
101:51 title GPU specs. And so exactly what we see here is the title of this last
101:55 conversation. So I hope that makes sense. And then if it doesn't start with
101:59 task and the triple pound and so it's actually our request. Then we go to our
102:03 main agent. We don't want our main agent to have to handle those super simple
102:06 tasks. You can also just use a really tiny LLM. Like this would be the perfect
102:11 case to actually use a super tiny LLM like um even like DeepSec R1 1.5B. You
102:15 could because it's just such a simple task. Otherwise though we are going to
102:20 go to our main agent. And so I'm not going to dive into like all these nodes
102:24 in a ton of detail, but basically we are are expecting the chat input to contain
102:30 the prompt for our agent. And the way that we know to expect chat input
102:35 specifically is because going back to the settings for the function here with
102:38 the valves, we are saying right here chat input. So you want to make sure
102:43 that the value that you put in here for input matches exactly with what you are
102:48 expecting from our web hook. And so chat input is the one that I have by default.
102:51 So you can just copy me if you want. Then we go into our agent where we're
102:55 hooked into Olama and we've got our local superbase. I already showed you
102:58 how to connect up all this and that looks exactly the same. The only thing
103:01 that is different now is we have a single tool to search the web with
103:07 CRXNG. So it's a web search tool. I have a description here just telling it what
103:11 is going to get back from using this tool. And then for the workflow ID, this
103:16 is if I go to add a node here and I just go for uh workflow tools, call N8N
103:23 workflow tool. So this is basically taking an N8N workflow and using it as a
103:28 tool for our agent. So this is the node that we have right here. But then I'm
103:32 referencing the ID of this N8N workflow. So this ID because I'm going to just
103:37 call the subworkflow that I have defined below. And again, I don't want to dive
103:40 into all the details of NAN right now and how this all works, but the agent is
103:44 going to decide the query. What should I search the web with? It decides that and
103:49 then it invokes this sub workflow here where we have our call to CR XNG. So the
103:54 name of the container service in our docker and compost stack is just CR XNG
103:59 and it runs on port 8080. And then if you look at the CXNG documentation, you
104:03 can look at how to invoke their API and things like this. So I'm just doing a
104:07 simple search here and then there are a few different nodes because what I want
104:10 to do is I want to split out the search and actually I can show you this by
104:14 going to an execution history where we're actually using this tool. So take
104:18 a look at this. So in this case the LLM decided to invoke this tool and the
104:23 query that it decided is current price of the 5090 GPU. So this is going along
104:28 with the conversation that we had last in open web UI. we get some results from
104:33 CRXNG, which is just going to be a bunch of different websites. And so, we don't
104:37 have the answer quite yet. We just have a bunch of resources that can help us
104:41 get there. And so, I'm going to split out. So, we have a bunch of different
104:45 websites. We're going to now limit to just one. I just want to pull one
104:48 website right now just to keep it really, really simple because now we're
104:52 going to actually visit that website. I'm going to make an HTTP request to
104:57 this website, which yeah, I mean, if it's literally an Nvidia official site
105:01 for the 5090, like this definitely has the information that we need. We're
105:04 going to make a request to it, and then we're also going to use this HTML node
105:08 to make sure that we are only selecting the body of the site. So, we take out
105:12 all the footers and headers and all that junk. So, we just have the key
105:15 information. And then that is what we aggregate and then return back to our AI
105:19 agent. So it now has the content, the core content of this website to get us
105:24 that answer. That is how we invoke our web search tool. And then at the very
105:29 end, we're just going to set this output field. And that's going to be the
105:33 response that we got back either from like generating a title or calling our
105:37 main agent. And this is really important. the output field specifically
105:41 whatever we call it here we have to make sure that that is corresponding to this
105:46 value as the last thing we have to set for the settings for our open web UI
105:50 function. So output here has to match with output here because that is what
105:55 we're going to return in this respond to web hook. Whatever open web UI gets back
105:59 it's getting back from what we return right here. So that is everything for
106:03 our agent. I could probably dive in quite a bit more into explaining how
106:06 this all works and building out a lot more complex agents, which I definitely
106:10 do with local AI in the Dynamis AI agent mastery course. So check that out if you
106:13 are interested. I just wanted to give you a simple example here showing how we
106:17 can talk to our different services like Olama, Superbase, and Seir XNG. And then
106:22 also open web UI as well. So once you have all these settings set, make sure
106:26 of course that you click on save. It's very, very important. These two things
106:29 at the bottom don't really matter, by the way. But yeah, click on save once
106:32 you have all of the settings there. And then you can go ahead and have a
106:36 conversation with your agent just like I did when I was demoing things before we
106:40 dove into the workflow. And by the way, this NAN agent that works with Open Web
106:44 UI, I have as a template for you. You can go ahead and download that in this
106:48 GitHub repository where I'm storing all the agents for this masterass. So we
106:52 have the JSON for it right here. You can go ahead and download this file. Go into
106:57 your N8N instance. Click on the three dots in the top right once you've
107:01 created a new workflow. Import from file and then you can bring in that JSON
107:03 workflow. You'll just have to set up all your own credentials for things like
107:07 Lama and Superbase and CRXNG. But then you'll be good to go and you can just go
107:11 through the same process that I did setting up the function in open web UI
107:15 and it'll be with like 15 minutes you'll have everything up and running to talk
107:20 to N8N in open web UI. Next up I want to create now the Python version of our
107:25 local AI agent. And so this is going to be a onetoone translation. Exactly what
107:30 we built here in NN, we are now going to do in Python. So I can show you how to
107:34 work with both noode and code with our local AI package. And so this GitHub
107:39 repo that has the N workflow we were just looking at and that OpenAI
107:43 compatible demo we saw earlier, this has pretty much everything for the agent. So
107:46 most of this repository is for this agent that we're about to dive into now
107:50 with Python. And in this readme here, I have very detailed instructions for
107:54 setting up everything. And a lot of what we do with the Python agent, especially
107:57 when we are configuring our environment variables, it's going to look very
108:01 similar to a lot of those values that we set in N8N. Like we have our base URL
108:05 here, which you'd want to set to something, you know, like HTTP lama port
108:09 11434. We just need to add this /view one, which I guess is a little bit different,
108:14 but yeah, I've got instructions here for setting up all of our environment
108:18 variables, our API key, which you can actually use OpenAI or Open Router as
108:22 well with this agent, taking advantage of the OpenAI API compatibility. This is
108:27 a live example of this because you can change the base URL, API key, and the
108:31 LLM choice to something from Open Router or OpenAI, and then you're good to go
108:35 immediately. It's really, really easy. We will be using Olama in this case, of
108:39 course, though. And then you want to set your superb basease URL and service key.
108:42 You can get that from your environment variables. Same thing with CRXNG with
108:47 that base URL. We'll set that just like we did in N8N. We have our bearer token
108:51 like in our case was test off. It's just whatever comes after the bearer and the
108:55 space. And then the OpenAI API key you can ignore. That's just for the
108:58 compatible demo that we saw earlier. This is everything that we need for our
109:02 main agent now. And so we're using a Python library called fast API to turn
109:09 our AI agent into an API endpoint just like we did in N8N. And so fast API is
109:13 kind of what gives us this web hook both with the entry point and the exit for
109:17 our agent and then everything in between is going to be the logic where we are
109:21 using our agent. And I'm going to be using paidantic AI. It's my favorite AI
109:25 agent framework with Python right now. Makes it really easy to set up agents
109:29 and we'll so we'll dive into that here. And I don't want to get into the
109:32 nitty-gritty of the Python code here because this isn't a master class on
109:36 specifically building agents. I really just want to show you how we can be
109:40 connecting to our local AI services. This agent is 100% offline. Like I could
109:46 cut the internet to my machine and still use everything here. So we create our
109:50 Superbase client and the instance of our fast API endpoint. I have some models
109:55 here that define the the requests coming in, the response going out. So we have
109:59 the chat input and the session ID just like we saw in N8N. And then the output
110:04 is going to be this output field. And so that corresponds to exactly what we're
110:07 expecting with those settings that we set up in the function in open web UI.
110:11 So this Python agent is also going to work directly with open web UI. And then
110:16 we have some dependencies for our Pantic AI agent because it needs to have an
110:21 HTTP client and the CRXNG base URL to make those requests for the web search
110:25 tool. And then we're setting up our model here. It's an OpenAI model, but we
110:29 can override the base URL and API key to communicate with Olama or Open Router as
110:34 well like we will be doing. And then we create our Pantic AI agent just getting
110:38 that model based on our environment variables. I've got a very simple system
110:43 prompt and then the dependencies here because we need that HTTP client to talk
110:48 to CR XNG. And then I'm just allowing it to retry twice. So if there's any kind
110:51 of error that comes up, the agent can re retry automatically, which is one of the
110:55 really awesome things that we have in Pyantic AI. And then I'm also creating a
111:01 second agent here. This is the agent that is going to be responsible like we
111:05 have in NADM for handling the metadata for open web UI like conversation titles
111:10 and tags for our conversation. And so it's an entirely separate Asian because
111:14 we just have a another system prompt. In this case, I'm just doing something
111:18 really simple here. Uh we don't have any dependencies for this agent because it's
111:21 not going to be using the web search tool. And then for the model, I'm just
111:25 using the exact same model that we have for our primary agent. But like I shared
111:30 with N8N, you could make it so that this is like a much smaller model, like a one
111:33 or three billion parameter model because the task is just so basic or maybe like
111:38 a 7 billion parameter model. So you can tweak that if you want. Just for
111:41 simplicity sake, I'm using the same LLM for both of these agents.
111:46 And then we get to our web search tool. So in Pantic AI, the way that you give a
111:51 tool to your agent is you do at@ and then the name of your agent and then
111:55 tool and then the function that you define below this is now going to be
112:00 given as a tool to the agent. And then this description that we have in the doc
112:04 string here that is given as a part of the prompt to your agent. So it knows
112:09 when and how to use this tool. And so the exact, you know, details of how
112:13 we're using CRX and G here, I won't dive into, but it is the exact same as what
112:17 we did in N8N where we make that request to the search endpoint of CRX andG. We
112:22 go through the page results here. We limit to just the top three results or I
112:26 could even change this to make it even simpler and just the top result. So we
112:30 have the smallest prompt possible to the LLM. And then we get the the content of
112:34 that page specifically. And then we return that to our AI agent with some
112:40 JSON here. So now once it invokes this tool, it has a full page back with
112:44 information to help answer the question from the user. It has that web search
112:47 complete now. And then we have some security here to make sure that the
112:51 bearer token matches what we get into our API endpoint. So that's that header
112:55 authentication that we set up in N8N. So this part right here where we're
112:59 verifying the header authentication that corresponds to this verify token
113:03 function. And then we have a function to fetch conversation history to store a
113:08 new message in conversation history. So both of these are just making requests
113:12 to our locally hosted superbase using that superbase client that we created
113:16 above. And then we have the definition for our actual API endpoint. And so in
113:23 N8N we were using invoke N8N agent for our path to our agent. So this was our
113:29 production URL. In this fast API endpoint, our endpoint is slashinvoke
113:34 python agent. And then we're specifically expecting the chat input
113:39 and session ID. So that is our um chat request request type right here. And
113:43 then sorry I highlighted the wrong thing. We have our response model here
113:46 that has the output field. So we're defining the exact types for the inputs
113:50 and the outputs for this API endpoint. And then we're also using this verify
113:55 token to protect our endpoint at the start. And then the key thing here, if
113:59 the chat input starts with that task, then we're going to call our metadata
114:02 agent. And so it's just going to spit out the title or the tags, whatever that
114:06 might be. Otherwise, we're going to fetch the conversation history, format
114:11 that for Pyantic AI, store the user's message so that we have that
114:15 conversation history stored, create our dependencies so that we can communicate
114:20 with CRXNG, and then we'll just do agent.run. We'll pass in the latest
114:24 message from the user, the past conversation history and the
114:27 dependencies that we created. So it can use those when it invokes the web search
114:31 tool and then we just get the response back from the agent and we'll you can
114:34 print that out in the terminal as well and then we'll just store it in
114:38 superbase and then return the output field. So I'm going kind of fast here. I
114:42 there definitely a lot more videos on my channel where I break down in more
114:45 detail building out agents with podantic AI and turning them into API endpoints
114:49 and things like that. Um but yeah, just going a little bit faster here. here.
114:52 And then the last thing is with any kind of exception that we encounter, we're
114:55 just going to return a response to the front end saying that there was an issue
114:59 and then specifying what that is. And then we are using Ubicorn to host our
115:06 API endpoint specifically on port 8055. So that is everything for our Python
115:11 agent exactly the same as what we set up in N8N. And now going to the readme
115:16 here, I'll open up the preview. The way that we can run this agent, we just have
115:20 to open up a terminal just like we did with the OpenAI compatible demo. I've
115:25 got instructions here for um setting up the database table, which this is using
115:30 the same table as the one in N8N and N8N creates it automatically. So, if you
115:34 have already been using the N8N agent, you don't actually have to run this SQL
115:38 here. Um, and then you want to obviously set up your environment variables like
115:42 we covered. Uh, open your virtual environment and install the requirements
115:46 there. And then you can go ahead and run the command python main.py.
115:52 And so this will start the API endpoint. So it'll just be hanging here because
115:56 now it's waiting for requests to come in on a port 8055. And so what I can do is
116:03 I can go back to open web UI. I can go to the admin panel functions. Go to the
116:08 settings. I can now change this URL. So everything else is the same. I have my
116:11 bearer token, the input field, and the output field the same as N8N. The only
116:16 thing I have to change now is my URL. And so I know this is an N8N pipe and I
116:20 have N8N in the name everywhere, but this does work with just any API
116:23 endpoint that we have created with this format here. And so I'm going to say for
116:27 my URL, it's actually going to be host.doccker.in because I have my API endpoint for
116:33 Python running outside on my host machine. So I need my open web UI
116:38 container to go outside to my host machine. And then specifically the port
116:43 is going to be 8055. And then the endpoint here, I'm going to
116:46 delete this web hook here because it's invoke-python- agent. Take a look at that. All right.
116:52 Boom. So I'm going to go ahead and save this. And then I can go over to my chat.
116:57 And it says n agent connector here still. But this is actually talking to
117:00 my Python agent now. So I'll go ahead and start by asking it the exact same
117:05 question that I asked the N8N agent. And I do have this pipe set up to always say
117:08 that it's calling nan, but this is indeed calling our Python API endpoint.
117:12 And we can see that now. So there we go. We got all the requests coming in, the
117:16 response from the agent, and then also the metadata for the title and the tags
117:20 for the conversation. Take a look at that. So we got our title here. We have
117:24 our tags, and then we have our answer. It's a starting price of $2,000, which
117:28 it's a lot more right now. The starting point, the starting price is kind of
117:32 misleading, but like yeah, this is a good answer. and it did use CRXNG to do
117:36 that web search for us. This is really, really neat. Now, the last thing that we
24:30 for what you can run. So, to go along with that information overload, I want
24:34 to give you some specifics, individual LLMs that you can try right now based on
24:38 the size range that you know will work for your hardware. So, just a couple of
24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This
24:47 is the most popular local LLM ever. It completely blew up a few months ago. And
24:52 the best part about DeepSeek R1 is they have an option that fits into each of
24:56 the size ranges that I just covered in that chart. So they have a 7 billion
25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And
25:05 then there is also the full real version of R1, which is 671 billion parameters.
25:10 I'm sorry though, you probably don't have the hardware to run that unless
25:13 you're spending tens of thousands on your infrastructure. So, probably stick
25:16 with one of these based on your graphics card or if you have a Mac computer, pick
25:19 the one that'll work for you and just try it out. You can click on any one of
25:23 these sizes here. And then here's your command to download and run it. And this
25:28 is defaulting to a Q4 quantization, which is what I was assuming in the
25:31 chart earlier. And again, I will cover what that actually means in a little bit
25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.
25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do
25:45 have all the other um sizes that fit into those ranges that I mentioned
25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And
25:52 the same kind of deal where you click on the size that you want and you've got
25:55 your command to install it here. And this is a reasoning LLM just like
26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.
26:05 I've had really good results with this as well. There are less options here,
26:08 but you've got 22 or 24 billion parameters, which is going to work well
26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified
26:18 memory. Really, really good model. And then also, there is a version of it that
26:22 is fine-tuned for coding specifically called Devstrol, which is a another
26:26 really cool LLM worth checking out as well if you have the hardware to run it.
26:30 So, that is everything for just general recommendations for local LMS to try
26:34 right now. This is the part of the master class that is going to become
26:38 outdated the fastest because there are new local LMS coming out every single
26:42 month. I don't really know how long my recommendations will last for. But in
26:45 general, you can just go to the model list in Olama, search for the ones,
26:49 finds one that has the size that works with your graphics card and just give it
26:52 a shot. You can install it and run it very easily with Olama. And the other
26:57 thing that I want to mention here is you don't always have to run open- source
27:01 large language models yourself. You can use a platform like Open Router. You can
27:05 just go to open router.ai, sign up, add in some API credits. You can try these
27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for
27:15 your agents before you invest in hardware to actually run them yourself.
27:18 And so within Open Router, I can just search for Quen here. And I can go down
27:23 to Quen and I can go to 32 billion. They have a free offering as well that
27:26 doesn't have the best rate limits. So I'll just go to this one right here,
27:31 Quen 3 32B. So I can try the model out through open router. They actually host
27:35 it for me. So it's an open- source non-local version, but now I can try it
27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now
27:43 I want to buy a 3090 graphics card so that I can install it directly through
27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here
27:51 in open router. And there are other platforms like Grock as well where you
27:55 can run these open source large language models um not on your own infrastructure
27:58 if you just want to do some testing before beforehand or whatever that might
28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's
28:04 everything for my general recommendations for LLMs to try and use
28:09 in your agents. All right, it is time to take a quick breather. This is
28:12 everything that we've covered already in our master class. What is local AI? Why
28:17 we care about it? Why it's the future and hardware requirements. And I really
28:20 wanted to dive deep into this stuff because it sets the stage for everything
28:24 that we do when we actually build agents and deploy our infrastructure. And so
28:28 the last thing that I want to do with you before we really start to get into
28:33 building agents and setting up our package is I want to talk about some of
28:37 the tricky stuff that is usually pretty daunting for anyone getting into local
28:42 AI. I'm talking things like offloading models, quantization, environment
28:47 variables to handle things like uh flash attention, all the stuff that is really
28:51 important that I want to break down simply for you so you can feel confident
28:55 that you have everything set up right, that you know what goes into using local
29:00 LLMs. The first big concept to focus on here is quantization. And this is
29:04 crucial. It's how we can make large language models a lot smaller so they
29:10 can fit on our GPUs without hurting performance too much. We are lowering
29:14 the model precision here. And so what basically what that means is we have
29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits
29:23 with the full size, but we can lower the precision of each of those parameters to
29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of
29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the
29:35 parameters that we already covered. And we can make these numbers less precise
29:40 or smaller without losing much performance. So, we can fit larger LLMs
29:45 within a GPU that normally wouldn't even be close to running the full-size model.
29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4
29:55 quantization like four bit per parameter in that diagram earlier. If you had the
30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could
30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it
30:10 possible. It's like rounding a number that has a long decimal to something
30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're
30:19 doing it for each of the billions of parameters, those numbers that we have.
30:23 And so just to give you a visual representation of this, you can also
30:27 quantize images just like you can quantize LLMs. And so we have our full
30:31 scale image on the lefth hand side here comparing it to different levels of
30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with
30:40 a 16- bit quantization, it almost looks the same. But then once we go down to
30:44 4bit, you can very much see that we have a huge loss in quality for the image.
30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit
30:54 quantization, we don't actually lose that much performance like we lose a lot
30:58 of quality with images. And so that's why it's so useful for us. And so I have
31:01 a table just to kind of describe what this looks like. So FP16, that's the
31:07 16bit precision that all LMS have as a base. That is the full size. The speed
31:11 is obviously going to be very slow because the model is a lot bigger, but
31:16 your quality is perfect compared to what it could be. I mean, obviously that
31:18 doesn't mean that you're going to get perfect answers all the time. I'm just
31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8
31:28 precision, so it's half the size. The speed is going to be a lot better. And
31:33 the quality is nearperfect. So it's not like performance is cut in half just
31:37 because size is. You still have the same number of parameters. Each one is just a
31:42 bit less precise. And so you're still going to get almost the same results.
31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very
31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these
31:57 numbers are very vague on purpose. There's not a huge way to for me to like
32:01 qualify exactly the difference, especially because it changes per LLM
32:05 and your hardware and everything like that. So, I'm just being very general
32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to
32:13 be very very fast, but usually your performance starts to go down quite a
32:17 bit once you go down to a Q2. And then like the note that I have in the bottom
32:22 left here, a Q4 quantization is generally the best balance. And so when
32:26 you are thinking to yourself, which large language model should I run? What
32:31 size should I use? My rule of thumb is to pick the largest large language model
32:37 that can work with your hardware with a Q4 quantization. That is why I assumed
32:42 that in the table earlier. And then also like we saw in Olama earlier, it always
32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4
32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion
32:59 parameter model is still going to be a lot more powerful than the full 7
33:02 billion parameter or 14 billion parameter because you don't actually
33:07 lose that much performance. So that is quantization. So just to make this very
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 just fun fact, you can actually do that. Now, the very last thing that I want to
37:15 cover before we dive into some code, setting up the local AI package, and
37:21 building out some agents is a few very crucial parameters, environment
37:25 variables for Olama. So, these are environment variables that you can set
37:28 on your machine just like any other based on your operating system. And
37:32 Olama does have an FAQ for setting up some of these things, which I'll link to
37:36 in the description as well. But yeah, these are a bit more technical, so
37:40 people skip past setting this stuff up a lot, but it's actually really, really
37:44 important to make things very efficient when running local LLMs. So the first
37:49 environment variable is flash attention. You want to set this to one or true.
37:54 When you have this set to true, it's going to make the attention calculation
37:59 a lot more efficient. It sounds fancy, but basically large language models when
38:04 they are generating a response, they have to calculate which parts of your
38:08 prompt to pay the most attention to. That's the calculation. And you can make
38:12 it a lot more efficient without losing much performance at all by setting up
38:16 the flash attention, setting that to true. And then for another optimization,
38:21 just like we can quantize the LLM itself, you can also quantize or
38:27 compress the context. So your system prompt, the tool descriptions, your
38:31 prompt and conversation history, all that context that's being sent to your
38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for
38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the
38:46 context memory. It's a very simplified explanation, but it's really, really
38:50 useful because a long conversation can also take a lot of VRAM just like larger
38:55 LLM. And so it's good to compress that. And then the third environment variable,
38:58 this is actually probably the most crucial one to set up for Olama. There
39:02 is this crazy thing. I don't know why Olama does it, but by default, they
39:07 limit every single large language model to 2,000 tokens for the context limit,
39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and
39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a
39:21 lot of local large language models can also handle large prompts. But Olamo
39:25 will limit you to default to 2,000 tokens. And so you have to override that
39:30 yourself with this environment variable. And so generally I recommend starting
39:34 with about 8,000 tokens to start. You can move this all the way up to
39:38 something like 32,000 tokens if your local large language model supports
39:42 that. And if you view the model page on Alama, you can see the context link
39:46 that's supported by the LLM. But you definitely want to, you know, jack this
39:50 up more from just 2,000 because a lot of times when you have longer
39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do
39:57 not miss this. If your large language model is starting to go completely off
40:02 the rails and ignore your system prompt and forget that it has these tools that
40:06 you gave it, it's probably because you reached the context length. And so, just
40:10 keep that in mind. I see people miss this a lot. And then the very last
40:14 environment variable, uh, probably the least important out of all these four,
40:18 but if you're running a lot of different large language models at once and you're
40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so
40:25 in Olama, you can limit the number of models that are allowed to be in your
40:29 memory at a single time. With this one, typically you want to set this to either
40:33 one or two. Definitely set this to just one if you are using large language
40:37 models that are basically fit for your GPU. like it's going to fit exactly into
40:41 your VRAM and you're not going to have room for another large language model.
40:44 But if you are running more smaller ones and maybe you could actually fit two on
40:48 your GPU with the VRAM that you have, you can set this to two. So again, more
40:52 technical overall, but it's very important to have these right. And we'll
40:55 get into the local AI package where I already have these set up in the
40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a
41:02 minute ago that I'll have linked in the description. And so there's actually a
41:06 lot of good things to read into here. um like being able to verify that your GPU
41:10 is compatible with Olama. How can you tell if the model's actually loaded on
41:13 your GPU? So, a lot of like sanity check things that they walk you through in the
41:17 FAQ as well. Also talking about environment variables, which I just
41:20 covered. And so, they've got some instructions here depending on your OS
41:23 how to get those set up. So, if there's anything that's confusing to you, this
41:26 is a very good resource to start with. So, I'm trying to make it possible for
41:30 you to look into things further if there's anything that doesn't quite make
41:33 sense for what I explained here. And of course, always let me know in the
41:35 comments if you have any questions on this stuff as well, especially the more
41:39 technical stuff that I just got to cover because it's so important even though I
41:43 know we really want to dive into the meat of things, which we are actually
41:47 going to do now. All right, here is everything that we have covered at this
41:50 point. And congratulations if you have made it this far because I covered all
41:55 the tricky stuff with quantization and the hardware requirements and offloading
41:59 and some of our little configuration and parameters. So, if you got all of that,
42:03 the rest of it is going to be a walk in the park as we start to dive into code,
42:07 getting all of our local AI set up and building out some agents. You understand
42:10 the foundation now that we're going to build on top of to make some cool stuff.
42:15 And so, now the next thing that we're going to do is talk about how we can use
42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show
42:23 you an example. We can take something that is using OpenAI right now,
42:27 transform it into something that is using OAMA and local LLM. So, we'll
42:31 actually dive into some code here. And I've got my fair share of no code stuff
42:35 in this master class as well, but I want to focus on both because I think it's
42:38 really important to use both code and no code whenever applicable. And that
42:42 applies to local AI just like building agents in general. So, I've already
42:45 promised a couple of times that I would dive into OpenAI API compatibility, what
42:50 it is, and why it's so important. And we're going to dive into this now so you
42:54 can really start to see how you can take existing agents and transform them into
42:59 being 100% local with local large language models without really having to
43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI
43:10 has created a standard for exposing large language models through an API.
43:14 It's called the chat completions API. It's kind of like how model context
43:19 protocol MCP is a standard for connecting agents to tools. The chat
43:23 completions API is a standard for exposing large language models over an
43:28 API. So you have this common endpoint along with a few other ones that all of
43:35 these providers implement. This is the way to access the large language model
43:39 to get a response based on some conversation history that you pass in.
43:43 So, Olama is implementing this as of February. We have other providers like
43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.
43:53 Almost every single provider is OpenAI API compatible. And so, not only is it
43:57 very easy to swap between large language models within a specific provider, it's
44:02 also very easy to swap between providers entirely. You can go from Gemini to
44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one
44:13 piece of configuration pointing to a different base URL as it is called. So
44:18 you can access that provider and then the actual API endpoint that you hit
44:22 once you are connected to that specific provider is always the exact same and
44:26 the response that you get back is also always the exact same. And so Olama has
44:31 this implemented now. And I'll link to this article in the description as well
44:33 if you want to read through this because they have a really neat Python example.
44:37 It shows where we create an OpenAI client and the only thing we have to do
44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are
44:47 pointing to Olama that is hosted locally instead of pointing to the URL for
44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And
44:55 then with Olama, you don't actually need an API key because everything's running
44:58 locally. So you just need some placeholder value here. But there is no
45:02 authentication that is going on. You can set that up. I'm not going to dive into
45:05 that right now. But by default, because it's all just running locally, you don't
45:09 even need an API key to connect to Olama. And then once we have our OpenAI
45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in
45:18 exactly the same way. But now we can specify a model that we have downloaded
45:22 locally already through Lama. We pass in our conversation history in the same way
45:27 and we access the response like the content the AI produced the token usage
45:31 like all those things that we get back from the response in the same way.
45:34 They've got a JavaScript example as well. They have a couple of examples
45:38 using different frameworks like the Versell AI SDK and Autogen. Really any
45:44 AI agent framework can work with OpenAI API compatibility to make it very easy
45:47 to swap between these different providers. like Pyantic AI, my favorite
45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily
45:57 within your Pantic AI agents swap between these different providers. And
46:02 so what I have for you now is two code bases that I want to cover. The first
46:07 one is the local AI package, which we'll dive into in a little bit. But right
46:13 now, we have all of the agents that we are going to be creating in this master
46:17 class. So I have a couple for N8N that are also available in this repository.
46:21 And then a couple of scripts that I want to share with you as well. And so the
46:25 very first thing that I want to show you is this simple script that I have called
46:31 OpenAI compatible demo. And so you can download this repository. I'll have this
46:34 linked in the description as well. There's instructions for downloading and
46:38 setting up everything in here. And this is all 100% local AI. And so with that,
46:43 I'm going to go over into my windsurf here where I have this OpenAI compatible
46:47 demo set up. So I've got a comment at the top reminding us what the OpenAI API
46:52 compatibility looks like. We set our base URL to point to Olama hosted
46:59 locally and it's hosted on port 11434 by default. So I can actually show you
47:02 this. I have Ola running in a Docker container, which we're going to dive
47:05 into this when we set up the local AI package, but you can see that it is
47:11 being exposed on port 11434. And by the way, you can see the
47:14 127.0.0.1 in that URL that I have highlighted here, that is synonymous with localhost.
47:21 And so this right here, you could also replace with 127.0.0.1.
47:25 Just a little tidbit there. It's not super important. I just typically leave
47:28 it as localhost. And then you can change the port as well. I'm just sticking to
47:32 what the default is. And then again, we don't need to set our API key. We can
47:36 just set it to any value that we want here. We just need some placeholder even
47:39 though there is no authentication with a llama for real unless you configure
47:43 that. So that's OpenAI compatibility. And the important thing with this script
47:47 here is I have two different configurations here. I have one for
47:51 talking to OpenAI and then one for OALMA. So with OpenAI, we set our base
47:57 URL to point to api.openai.com. We have our OpenAI API key set in our
48:01 environment variables. So you can just set all your environment variables here
48:05 and then rename this to env. I've got instructions for that in the readme of
48:08 course. And then going back to the script, we are using GPT4.1 nano for our
48:13 large language model. There's something super fast and cheap. And then for our
48:17 Lama configuration, we are setting the base URL here, localhost1434
48:22 or just whatever we have set in our environment variables. Same thing for
48:26 the API key. And then same thing for our large language model. And what I'm going
48:31 to be using in this case is Quen 314B. That is one of the large language models
48:34 that I showed you within the Olama website. Definitely a smaller one
48:38 compared to what I could run, but I just want to run something fast. And very
48:41 small large language models are great for simple tasks like summarization or
48:45 just basic chat. And that's what I'm going to be using here just for a simple
48:49 demo. And so whether it's enabled or not, this configuration is just based on
48:53 what we have set for our environment variables. And the important thing here
48:59 is the code that runs for each of these configurations just as we go through
49:03 this demo is exactly the same. We are parameterizing the configuration for the
49:08 base URL and API key. So we are setting up the exact same OpenAI client just
49:13 like we saw in the Olama article but just changing the base URL and API key.
49:17 And so then for example when we use it right here it's client.hat.comp
49:22 completions.create create calling the exact same function no matter if we're
49:26 using OpenAI or Olama. And then we're handling the response in the same way as
49:31 well. And so I'll go back to my terminal now. And so I went through all the steps
49:34 already to set up my virtual environment, install all of my dependencies. And so now I can run the
49:40 command OpenAI compatible demo. And now it's going to present the two
49:43 configuration options for me. And so I can run through OpenAI. So we'll go
49:46 ahead and do that first. And these two demos are going to look exactly the
49:50 same, but that is the point. And so we have our base URL here for OpenAI. We
49:54 have a basic example of a completion with GPT4.1 Nano. There we go. So this
49:59 is the model that was used. Here are the number of tokens. And this is our
50:03 response. And then I can press enter to see a streaming response now as well. So
50:07 we saw it type out our answer in real time. And then I can press enter one
50:10 more time. This is the last part of the demo. Just say multi-turn conversation.
50:14 So we got a couple of messages here in our conversation history. So very nice
50:19 and simple. The point here is to now show you that I can run this and select
50:23 Olama now instead and everything is going to look exactly the same and all
50:27 of the code is the same as wallet. It is only our configuration that is
50:31 different. And so it will take a little bit when you first run this because
50:36 Olama has to load the large language model into your GPU. And so going to the
50:42 logs for Olama, I can show you what this looks like here. And so when we first
50:47 make a request when Quen 314B is not loaded into our GPU yet, you're going to
50:52 see a lot of logs come in here and we'll and you'll have this container up and
50:54 running when you have the local AI package which we'll cover in a little
50:57 bit. So it shows all the metadata about our model like it's Quen 314b. Uh we can
51:05 see here that uh we have a Q4 KM quantization like we saw in the Olama
51:09 website. Uh what other information do we have here? There's just so much to to
51:14 digest here. Um, yeah, another really important thing is we have the uh
51:19 context link. I have that set to 8,192 just like I recommended in the
51:22 environment variables. And then we can see that we offloaded all of the layers
51:26 to the GPU. So I don't have to do any offloading to the CPU or the RAM. I can
51:30 keep everything in the GPU, which is certainly ideal, like I said, to make
51:34 sure this is actually fast. And then when we get a response from quen 314b,
51:41 we are calling the v1/hatcompletions endpoint because it is openi API
51:46 compatible. So that exact endpoint that we hit for openai is the one that we are
51:50 hitting here with a large language model that is running entirely on our computer
51:54 in Olama. And so the response I get back, it's actually a reasoning LLM as
51:58 well. So we even have the thinking tokens here, which is super cool. And so
52:02 we got our response. It's just printing out the first part of it here just to
52:04 keep it short. And then I can press enter. And we can see a streaming demo
52:08 as well. And it's going to be a lot faster this time because we do already
52:11 have the model loaded into our GPU. And so that first request when it first has
52:15 to load a model is always the slower one. And then it's faster going forward
52:19 once that model is already loaded in our GPU. And then as long as we don't swap
52:24 to another large language model and use that one, then it will remain in our GPU
52:28 for some time. And so then all of our responses after are faster. And then we
52:33 just have the last part of our demo here with a multi-turn conversation. So we
52:37 can see conversation history in action as well, just not with streaming here.
52:40 Um, and and everything's a bit slower with this large language model because
52:43 it is a reasoning one. And so you can certainly if you want faster uh
52:47 inference, you can always use a non-reasoning local LLM like Mistl or
52:52 Gemma for example. So that is our very simple demo showing how this works. I
52:55 hope that you can see with this and again this works with other AI agent
52:59 frameworks like eggno or pideantic AI or crew AI as well like they all work in
53:03 this way where you can use openAI API compatibility to swap between providers
53:08 so easily so you don't have to recreate things to use local AI and that's
53:11 something so important that I want to communicate with you because if I'm the
53:15 one introducing you to local AI I also want to show you how it can very easily
53:19 fit into your existing systems and automations. All right. Now, we have
53:23 gotten to the part of the local AI master class that I'm actually the most
53:27 excited for because over the past months, I have very much been pouring my
53:31 heart and soul into building up something to make it infinitely easier
53:35 for you to get everything up and running for local AI. And that is the local AI
53:40 package. And so, right now, we're going to walk through installing it step by
53:44 step. I don't want you to miss anything here because it's so important to get
53:47 this up and running, get it all working well. Because if you have the local AI
53:51 package running on your machine and everything is working, you don't need
53:55 anything else to start building AI agents running 100% offline and
53:59 completely private. And so here's the thing. At this point, we've been
54:04 focusing mostly on Olama and running our local large language models. But there's
54:08 the whole other component to local AI that I introduced at the start of the
54:13 master class for our infrastructure. things like our database and local and
54:18 private web search, our user interface, agent monitoring. We have all these
54:23 other open-source platforms that we also want to run along with our large
54:27 language models and the local AI package is the solution to bring all of that
54:32 together curated for you to install in just a few steps. So, here is the GitHub
54:37 repository for the local AI package. I'll have this linked in the description
54:41 below. Just to be very clear, there are two GitHub repos for this master class.
54:45 We have this one that we covered earlier. This has our N8N and Python
54:49 agents that we'll cover in a bit, as well as the OpenAI compatible demo that
54:53 we saw earlier. So, you want to have this cloned and the local AI package as
54:57 well. Very easy to get both up and running. And if you scroll down in the
55:02 local AI package, I have very comprehensive instructions for setting
55:06 up everything, including how to deploy it to a private server in the cloud,
55:10 which we'll get into at the end of this master class, and a troubleshooting
55:13 section at the bottom. So, everything that I'm about to walk you through here,
55:17 there's instructions in the readme as well if you just want to circle back to
55:21 clarify anything. Also, I dive into all of the platforms that are included in
55:26 the local AI package. And this is very important because like I said, when you
55:30 want to build a 100% offline and private AI agent, it's a lot more than just the
55:35 large language model. You have all of the accompanying infrastructure like
55:39 your database and your UI. And so I have all that included. First of all, I have
55:44 N8N that is our low/noodeode workflow automation platform. We'll be building
55:48 an agent with N8N in the local AI package in a little bit once we have it
55:52 set up. We have Superbase for our open- source database. We have Olama. Of
55:56 course, we want to have this in the package as well for our LLMs. Open Web
56:01 UI, which gives us a chat GPT like interface for us to talk to our LLMs and
56:06 have things like conversation history. Very, very nice. So, we're looking at
56:09 this right here. This is included in the package. Then we have Flowwise. It's
56:13 similar to N8N. It's another really good tool to build AI agents with no slash
56:18 low code. Quadrant, which is an open- source vector database. Neo4j which is a
56:25 knowledge graph engine and then seir xng for open-source completely free and
56:31 private web search caddy which this is going to be very important for us once
56:35 we deploy the local AI package to the cloud and we actually want to have
56:38 domains for our different services like nn and open web UI and then the last
56:42 thing is langfuse this is an open- source LLM engineering platform it helps
56:47 us with agent observability now some of these services are outside of the scope
56:52 for this local AI master class. I don't want to spend a half hour on every
56:56 single one of these services and make this a 10-hour video. I will be focusing
57:02 in this video on N8N, Superbase, Olama, Open WebUI, CRXNG, and then Caddy once
57:08 we deploy everything to the cloud. So, I do cover like half of these services.
57:12 And the other thing that I want to touch on here is that there are quite a few
57:16 things included here. And so you do need about 8 GB of RAM on your machine or
57:21 your cloud server to run everything. It is pretty big overall. And so you can
57:26 remove certain things like if you don't want Quadrant and Langfuse for example,
57:30 you can take those out of the package. More on that later. It doesn't have to
57:34 be super bloated, you can whittle this down to what you need. But yeah, there's
57:37 a lot of different things that go into building AI agents. And so I have all of
57:40 these services here so that no matter what you need, I've got you covered. And
57:44 so with that, we can now move on to installing the local AI package. And
57:48 these instructions will work for you on any operating system, any computer. Even
57:52 if you don't have a really good GPU to run local large language models, you
57:56 still could always use OpenAI or Anthropic, something like that, and then
58:00 run everything else locally to save on costs or just to have everything running
58:04 on your computer. And so there are a couple of prerequisites that you have to
58:08 have before you can do the instructions below. You need Python so you can run
58:12 the start script that boots everything up. Git or GitHub desktop so you can
58:16 clone this GitHub repository, bring it all onto your own machine. And then you
58:21 want Docker or Docker Desktop. And so I've got links for all of these. Docker
58:25 and Docker Desktop we need because all of these local AI services that I've
58:29 curated for you, they all run as individual Docker containers that are
58:34 all combined together in a stack. And so I'll actually show you this is the end
58:36 result once we have everything up and running within your docker desktop. You
58:41 have this local AI docker compos stack that has all of the services running in
58:45 tandem like superbase and reddus and nitn and flowwise caddy neo4j. All of
58:50 these are running within this stack. That is what we're working towards right
58:54 now. And so make sure you have all these things installed. I've got links that'll
58:57 take you to installing no matter your operating system. Very easy to get all
59:01 of this up and running on your machine. Then we can move on to our first command
59:05 here, which is to clone this GitHub repository, bringing all of this code on
59:10 your machine so you can get everything running. And so you want to open up a
59:14 new terminal. So I've got a new PowerShell session open here. Going to
59:18 paste in this command. And I'm going to be doing this completely from scratch
59:22 with you. So you clone the repo and then I'm just going to change my directory
59:26 into local AI package, which was just created from this get clone command. So
59:31 those are the first two steps. The next thing is we have to configure all of our
59:36 environment variables. And believe it or not, this is actually the longest part
59:41 of the process. And once we have this taken care of, it's a breeze getting the
59:44 rest of this up and running. But there's a lot of configuration that we have to
59:49 set up for our different services like credentials for logging into our
59:54 Superbase dashboard or Neo4j. Uh things like our Superbase um anonymous key and
59:59 private key. All these things we have to configure. And so within our terminal
60:04 here, you can do code dot to open this within VS code or windsurf. Open this in
60:09 windsurf. You just want to open up this folder within your IDE and the specific
60:14 IDE that you use. Really doesn't matter. You just want to get to this.env.example
60:20 here. I'm going to copy it and then I'm going to paste it. And then I'm going to
60:24 rename this toenv. So we're taking the example. example, turning it into av file. So,
60:31 you want to make sure that you copy it and rename it like this. Then we can go
60:36 ahead and start setting all of our configuration. And I'll even zoom in on
60:39 this just so that it's very easy for you to see everything that we are setting up
60:44 here. So, first up, we have a couple of credentials for N8N. We have our
60:49 encryption key and our JWT secret. And it's very easy to generate these. In
60:53 fact, we'll be doing this a couple of times, but we'll use this open SSL
60:58 command to generate a random 32 character alpha numeric string that
61:02 we're going to use for things like our encryption key and JWT secret. And so,
61:08 OpenSSL is a command that is available for you by default on Linux and Macs.
61:12 You can just open up any terminal and run this command and it'll spit out a
61:16 long string that you can then just paste in for this value. For Windows, you
61:20 can't just open up any terminal and use OpenSSL, but you can use Git Bash, which
61:26 is going to come with GitHub Desktop when you install it. And so, I'll go
61:29 ahead and just search for that. If you just go to your search bar on your
61:32 bottom left on Windows and search for Git Bash, it's going to open up this
61:37 terminal like this. And so, I can go ahead and copy this command, go in here,
61:42 and paste it in. And then I can run it. And then, boom, there we go. This is I
61:45 know it's really small for you to see right now. I'm going to go ahead and
61:48 copy this because this is now the value that I can use for my encryption key.
61:52 And then you want to do the exact same thing to generate a JWT secret. And then
61:57 the other way that you can do this if you don't want to install git bash or
62:01 it's not working for whatever reason, you can use Python to generate this as
62:05 well. So I can just copy this command and then I can go into the terminal here
62:10 and I can just paste this in. And so it's going to just like with OpenSSL
62:15 generate this random 32 character string that I can copy and then use for my JWT
62:20 secret. There we go. And so I am going to get in the weeds a little bit here
62:24 with each of these different parameters, but I really want to make sure that I'm
62:27 clear on how to set up everything for you so you can really walk through this
62:31 step by step with me. And like I said, setting up the environment variables is
62:35 the longest part by far for getting the local AI package set up. So if you bear
62:39 with me on this, you get through this configuration, you will have everything
62:43 running that you need for local AI for the LLMs and your infrastructure. So
62:47 that's everything for N8N. Now we have some secrets for Superbase. And there
62:53 are some instructions in the Superbase documentation for how to get some of
62:57 these values. So it's this link right here, which I have open up on my
63:01 browser. So we'll we'll reference this in a little bit here. But first, we can
63:05 set up a couple of other things. The first thing we need to define is our
63:11 Postgress password. So, Supphabase uses Postgress under the hood for the
63:14 database. And so, we want to set a password here that we'll use to connect
63:19 to Postgress within N8N or a connection string that we have for our Python code,
63:23 whatever that might be. And this value can be really anything that you want.
63:26 Just note that you have to be very careful at using special characters like
63:31 percent symbols. So if you ever have any issues with Postgress, it's probably
63:36 because you have special characters that are throwing it off. U that's something
63:39 that I've seen happen quite a few times. And so like I said, I want to mention
63:42 troubleshooting steps and things to make sure that it is very clear for you. So
63:47 for this Postgress password here, I'm just going to say test Postgress pass.
63:51 I'm just going to give some kind of random value here. Just end with a
63:54 couple of numbers. I don't care that I'm exposing this information to you because
63:58 this is a local AI package. These passwords are for services that never
64:02 leave my computer. So, it's not like you could hack me by connecting to anything
64:08 here. And then we have a JWT secret. And this is where we get into this link
64:13 right here in the Superbase docs. And so they walk you through generating a JWT
64:18 secret and then using that to create both your anonymous and your service
64:22 role keys. If you're familiar with Superbase at all, we need both of these
64:27 pieces of information. The anonymous key is what we share to our front end. This
64:30 is our public key. And then the service role key has all permissions for
64:33 Superbase. We'll use this in our backends for things like our agents. And
64:39 so you can just go ahead and copy. You can go ahead and copy this JWT secret.
64:43 And then you can paste this in right here. This is 32 characters long just
64:47 like the things that we generated with OpenSSL. I'm just going to be using
64:52 exactly what Superbase tells me to. And then what you can do with this is you
64:56 can select the anonymous key. Click on generate JWT and then I can copy this
65:02 value and then I will paste this for my anonymous token. And so I'm just
65:06 replacing the default value there for the anonymous key. And then going back
65:10 and selecting the service key, I'm going to generate that one as well. So it
65:13 looks very similar. They'll always start with ey, but these values are different
65:18 if you go towards the end. And so I'll go ahead and paste this for my service
65:22 ro key. Boom. There we go. All right. And then for the Superbase dashboard
65:27 that we'll log into to see our tables and our SQL editor and authentication
65:31 and everything like that, we have our username here, which I'm just going to
65:35 keep as superbase. And then for the password, I can just say test superbase
65:39 pass. I'll just kind of use that as my common nomenclature here for my
65:42 passwords cuz I don't really care what that is right now. And then the last
65:45 thing that we have to set up is our pooler tenant ID. And it's not really
65:49 important to dive into what exactly this means. Just know that you can set this
65:52 to really anything that you want. Like I typically will just choose four digits
65:57 here like 1,00 for my pooler tenant ID. So that is everything that we need for
66:01 superbase. And actually most of the configuration is for superbase. Then we
66:06 have Neo4j. This is really simple. You can leave Neo4j for the username and
66:11 then I'll just say test Neo4j pass for my password here. So you just set the
66:15 password for knowledger graph and even if you're not using Neo4j you still have
66:19 to set this but yeah it just takes two seconds. Then we have langfuse. This is
66:23 for agent observability. We have a few secrets that we need here. And for these
66:28 values they can really just be whatever you want. It doesn't matter because
66:31 these are just passwords just like we had passwords for things like Neoforj.
66:35 So I can just say test click house pass. Um and then I can do test mo pass. And
66:43 um I mean it really doesn't matter here. Random Langfuse salt. I'm just doing
66:47 completely whack values here. You probably want something more secure in
66:51 this case, but um I'm just doing something as a placeholder for now. Um
66:56 yeah, that there we go. Okay, good. And then then the last thing that we need
66:59 for Langfuse is an encryption key. And this is also generated with OpenSSL like
67:04 we did for the N8N credentials. And so I'll go back to my git bash terminal.
67:08 And again, you can do this with Python as well. I'll just run the exact same
67:12 command. I'll get a different value this time. And so I'll go ahead and copy
67:16 that. You could technically use the same value over and over if you wanted to,
67:20 but obviously it's way more secure to use a different value for each of the
67:24 encryption keys that you generate with OpenSSL. So there we go. That is our
67:28 encryption key. And that is actually everything that we have to set up for
67:32 our environment variables when we are just running the local AI package on our
67:37 computer. Once we deploy it to the cloud and we actually want domains for our
67:41 different services like open web UI and N8N then we'll have to set up caddy. So
67:45 this is where we'll dive into domains and we'll get into this at the end of
67:49 the master class here. But everything past this point for environment
67:54 variables is completely optional. You can leave all of this exactly as it is
67:59 and everything will work. Most of this is just extra configuration for
68:03 superbase. So, Superbase is definitely the biggest service that's included in
68:08 this list of, you know, curated services for you. And so, there's a lot of
68:11 different configuration things you can play around with if you want to dive
68:15 more into this. You can definitely look at the same documentation page that we
68:19 were using for the Superbase Secrets. And so, you can scroll through this if
68:22 you want to learn more um like setting up email authentication or Google
68:27 authentication. um diving more into all of those different configuration things
68:32 for Superbase if you want to dive more into that. I'm not going to get into all
68:36 of this right now because the core of getting Superbase up and running we
68:40 already have taken care of with the credentials that we set up at the top um
68:44 right here. And so that these are these are just the base things and so that's
68:47 what we'll stick to right now. So that is everything for our environment
68:51 variables. So then going back to our readme now which I have open directly in
68:55 windsurf now instead of my browser we have finished our configuration and I do
68:59 have a note here that you want to set things up for caddy if you're deploying
69:03 to production. Obviously we're doing that later not right now like I said and
69:07 so with that we are good to start everything. Now before we spin up the
69:12 entire local AI package there is one thing that I want to cover. It's
69:14 important to cover this before we run things. If you don't want to run
69:19 everything in the package cuz it is a lot like maybe you only want to use half
69:22 of these services and you don't want Neo4j and Langfuse and Flowwise right
69:28 now. There are two options that you have. The easiest one right now is to go
69:33 into the docker compose file. This is the main file where all of the services
69:38 are curated together and you can just remove the services that you don't want
69:42 to include. So, for example, if you don't want Quadrant right now, cuz it is
69:46 actually one of the larger services. It's like 600 u megabytes of RAM just
69:50 having this running, you can search for Quadrant, and you can just go ahead and
69:54 delete this service from the stack like that. Boom. Now I don't have Quadrant.
69:59 It won't spin up as a part of the stack anymore. And then also I have a volume
70:03 for Quadrant. So, you can remove that as well. Volumes, by the way, is how we are
70:08 able to persist data for these containers. So if we tear down
70:12 everything and then we spin it back up, we still are going to have our open web
70:17 UI conversations and our N8N workflows, everything in Superbase, like all that
70:21 is still going to be saved because we're storing it all in volumes. So we can do
70:25 whatever the heck we want with these containers. We can tear them down. We
70:28 can update them, which I'll show you how to do later. We can spin it back up. And
70:32 all of our data will always be persisted. So you don't have to worry
70:35 about losing information. And you can always back things up if you want to be
70:39 really secure, but I've never done that before and I've been updating this
70:42 package for months and months and months and all of my workflows from 6 months
70:46 ago are still there. I haven't lost anything. And so that's just a quick
70:50 caveat there for how you can remove services if you want. And then another
70:54 thing that we don't have available yet, but I'm very excited to, you know, kind
70:58 of talk about this right now. It's in beta right now. We are creating me and
71:03 one other guy uh that's actually on my Dynamist team. Um Thomas, he's got a
71:07 YouTube channel as well. He's a great guy. We're working together on this.
71:09 He's actually been putting in most of the work creating a front-end
71:13 application for us to manage our local AI package. And one of the big things
71:17 with this is that we're going to make it possible for you to toggle on and off
71:22 the services that you want to have within your local AI package. So you can
71:27 very much customize the package to the services that you want to run. So you
71:31 can keep it lightweight just to the things you care about. Also, we'll be
71:34 able to manage environment variables and monitor the containers. Not all of this
71:38 is up and running at this point, but this is in beta. We're working on it.
71:41 I'm really excited for this. So, not available yet, but at once this is
71:44 available, this will be a really good way for you to customize the package to
71:47 your needs. So, you don't have to go and edit the docker compose file directly.
71:51 So, that's something that I just wanted to get out of the way now. But, we can
71:57 start and actually execute our package now. Get all these containers up and
72:02 running. So the command that you run to start the local AI package is different
72:07 depending on your operating system and the hardware that you have. So for
72:13 example, if you are an Nvidia GPU user, you want to run this start services.py
72:18 script. This boots up all of the containers and you want to specifically
72:23 pass in the profile of GPU NVIDIA. This is going to start Ola in a way where the
72:29 Olama container is able to leverage your GPU automatically. And then if you are
72:34 using an AMD GPU and you're on Linux, then you can run it this way. Which by
72:38 the way, unfortunately, if you have an AMD GPU on Windows, you aren't able to
72:46 run O Lama in a container. And it's the same thing with Mac computers.
72:49 Unfortunately, like you see right here, you cannot expose your GPU to the Docker
72:55 instance. And so if you are an AMD GPU on Windows or running on Mac, you cannot
73:01 run Olama in the local AI package. You just have to install it on your own
73:04 machine like I already showed you in this master class and then you'll just
73:08 run everything else through the local AI package and they can actually go out to
73:12 your machine and communicate to Olama directly. So just a small limitation for
73:17 Mac and AMD on Windows. But if you're running on Linux or an Nvidia GPU on
73:21 Windows like I'm using, then you can go ahead and run this command right here.
73:27 So if you can't run a GPU in the Olama container, then you can always just
73:32 start in CPU mode or you can run with a profile of none. This will actually make
73:36 it so that Olama never starts in the local AI package. So you can just
73:40 leverage the Olama that you have already running on your computer like I showed
73:43 you how to install already. So, just a couple of small caveats that I really
73:46 want to hit on there. I need to make sure that you're using the right
73:51 command. And so, in my case, I'm Nvidia on Windows. So, I'm going to copy this
73:55 command. Go back over into my terminal. I'll just clear it here. So, we have a
73:59 blank slate. And I'll paste in this command. And so, it's going to do quite
74:02 a few things initially. First, it's going to clone the Superbase repository
74:07 because Superbase actually manages the stack in a separate place. And so, we
74:10 have to pull that in. Then there's some configuration for CRXNG for our uh local
74:17 and private web search. And then I have a couple of warnings here saying that
74:20 the Flowwise username and password are not set, which by the way for that if
74:24 you want to set the Flow Wise username and password, it's optional, but you can
74:29 do that if I scroll down right here. So you can set these values, those will
74:32 actually make those warnings go away, but you can also ignore them, too. So
74:35 anyway, I just wanted to mention that really quickly. But now what's happening
74:39 here is it starts by running all of the Superbase containers. And so there's
74:44 quite a bit that goes into Superbase, like I said. So we're running all of
74:47 that. It's getting all that spun up. And then once we run all of these, it's
74:51 going to move on to deploying the rest of our stack. And if you're running this
74:55 for the very first time, it will take a while to download all of these images.
74:59 They're not super small. There's a lot of infrastructure that we're starting up
75:03 here. And so it'll take a bit. You just have to be patient. maybe go grab your
75:06 coffee or make your next meal, whatever that is. And then everything will be up
75:09 and running once you are back. And so yeah, now you can see that we are
75:13 running the rest of the containers here. Um, and so we'll just wait for that to
75:16 be done. And then I'll show you what that looks like in Docker Desktop as
75:19 well. And so I'll give it a second here just to finish. Uh, looks like my
75:24 terminal glitched a little bit. Like I was scrolling and so it kind of broke it
75:27 a bit. But anyway, everything is up and running now. It'll look like this where
75:30 it'll say all of the containers are healthy or running or started. And then
75:34 if I go into Docker Desktop and I expand the local AI compost stack, you want to
75:39 make sure that you have a green dot for everything except for the Olama pull and
75:45 N8N import. These just run once initially and then they go down because
75:49 they're responsible for pulling some things for our local AI package. And so
75:53 yeah, I've got green dots for everything except for two right here. Now I'm
75:57 leaving this in here intentionally actually because there is a bug with
76:02 Superbase specifically if you are on Windows. So you'll see this issue where
76:08 the Superbase pooler is constantly restarting and that also affects N8N
76:12 because N8N relies on the Superbase pooler. So it's constantly restarting as
76:17 well. If you see this problem, I actually talk about this in the
76:21 troubleshooting section of the readme. If you scroll all the way down, if the
76:24 Superbase pooler is restarting, you can check out this GitHub issue. And so I
76:29 linked to this right here, and he tells you exactly which file you want to
76:33 change. It's this one right here. So it's docker volumes poolerpooler.exs.
76:39 And you need to change the file to end in lf. And so I'll show you what I mean
76:43 by that. I'll show you exactly how to do this. It's like a super tiny random
76:47 thing, but this has tripped up so many people. So I want to include this
76:51 explicitly in the master class here. So you want to go within the superbase
76:56 folder within docker volumes and then it's within pooler and then we have
77:00 pooler.exs and basically no matter your IDE you can see the crlf in the bottom right here.
77:08 You want to click on this and then change it to lf and then make sure that
77:13 you save this file. Very easy to fix that. And then what you can do is you
77:19 can run the exact same command to spin everything up again. And so I'm going to
77:22 do this now. It's going to go through all the same steps. It'll be faster this
77:25 time because you already have everything pulled. And this, by the way, is how you
77:28 can just restart everything really quickly if you want to enforce new
77:31 environment variables or anything like that. So I want to include that
77:35 explicitly um for that reason as well. And I'll go ahead and close out of this.
77:39 And and while this is all restarting, the other thing that I want to show you
77:42 in the readme is I also have instructions for upgrading the containers in the local AI package. So
77:49 when N8N has an update or Superbase has an update, it is your responsibility
77:53 because you're managing the infrastructure to update things yourself. And so you very simply just
77:58 have to run these three commands to update everything. You want to tear down
78:03 all of the containers and make sure you specify your profile like GPU Nvidia and
78:09 then you want to pull all of the latest containers and again specifying your
78:13 profile. And then once you do those two things, you'll have the most up-to-date
78:17 versions of the containers downloaded. So you can go ahead and run the start
78:21 services with your profile just like we just did to restart things. Very easy to
78:25 update everything. And even though we are completely tearing down our
78:29 containers here before we upgrade them, we aren't losing any information because
78:33 we are persisting things in the volumes that we have set up at the top of our
78:37 Docker Compose stack. And so this is where we store all of our data in our
78:41 database and and workflows. All these things are persisted. So we don't have
78:45 to worry about losing them. Very easy to upgrade things and you still get to keep
78:48 everything. You don't have to make backups and things like that unless you
78:54 just want to be ultra ultra safe. So now we can go back to our Docker desktop and
79:00 we've got green dots for everything now since we fixed that pooler.exs issue.
79:04 The only thing that we don't have green dots for is the N8N import and then we
79:08 have our Olama pull as well because like I said those are the two things that
79:11 just have to run at the beginning and then they aren't ongoing processes like
79:16 the rest of our services. So, we have everything up and running. And if there
79:22 is anything that is a white dot besides Olama pull or n import or if there's
79:27 anything that is constantly restarting, just feel free to post a comment and
79:31 I'll definitely be sure to help you out. And then also check out the
79:34 troubleshooting section as well. One thing that I'll mention really quick is
79:37 sometimes your N8N will constantly restart and it'll say something like the
79:42 N8N encryption key doesn't match what you have in the config. And the big
79:46 thing to keep in mind for that is you want to make sure that you set this
79:51 value for the encryption key before you ever run it for the first time.
79:53 Otherwise, it's going to generate some random default value and then if you
79:56 change this later, it won't match with what it expects. And so, yeah, my big
79:59 recommendation is like make sure you have everything set up in your
80:03 environment variables before you ever run the start services for the first
80:08 time. This should be run once you have your environment variables set up.
80:11 Otherwise, you risk any of these services creating default values that
80:14 then wouldn't match with the keys and things that you set up later. And so
80:18 with that, we can now go into our browser and actually explore all of
80:22 these local AI services that we have running on our computer now. Now over in
80:26 our browser, we can start visiting the different services that we have spun up.
80:30 Like here is N8N. You just have to go to localhost port 5678. It'll have you
80:35 create a local account when you first visit it. And then you'll have this
80:38 workflow view that should look very familiar to you. if you have used NAND
80:41 in the past. And then we have open web UI localhost port 8080. This is our chat
80:47 GPT like interface where we can directly talk to all of the models that we have
80:52 pulled in our Olama container. Really, really neat. And then we have local host
80:57 port 8000 for our Superbase dashboard. The signin definitely isn't pretty
81:00 compared to the managed version of Superbase. But once you enter in your
81:04 username and password that you have set for the environment variables for the
81:07 dashboard, then you have the very typical view where we have our tables
81:11 and we've got our SQL editor. Everything that you're familiar with with
81:14 Superbase. And that's the key thing with all these different services. They all
81:18 will look the exact same for you pretty much. Um like another one for example,
81:25 if I go to localhost um port 3000, we have languages. This is for agent
81:28 observability and monitoring. And this is something I'm not going to dive into
81:31 in this master class. Like I said, I'm not covering all the services. But yeah,
81:34 I just want to show that like every single one of these pretty much you can
81:39 access in your browser. And by the way, the way that we know the specific port
81:43 to access for each of these services is by taking a look at either what it tells
81:48 us in Docker Desktop. So like we can see that Neo4j is um let's see, we have port
81:53 7474. For uh CR XNG, it's port 8081. For Flowwise, it's port 3001. What's one
82:01 that we've seen already? Um, let me Yeah, like Open Web UI is port 8080. So,
82:06 the port on the left is the one that we access in our browser. And then the port
82:11 on the right is what's mapped on the container. So, when we visit port 8080
82:16 on our computer, that goes into port 8080 on the container. And that's what
82:21 we have exposed. The other way that you can see the port that you need to use is
82:24 just by taking a look at this docker compose file. And you don't need to have
82:29 like a super good understanding of this docker compose file. But if you want to
82:33 customize your stack or even help me by making contributions to local AI
82:36 package, this is the main place to make changes. And so for example, I can go
82:41 down to flowwise and I can see that the port is 3001. Or if I go down to let's say N8N, we can
82:49 see that the port is 5678. And so the port is always going to be
82:52 there somewhere in the service that you have set up. Like for the Langfuse
82:56 worker, it's 3030. That's more of a behind-the-scenes kind of service. But
82:59 let me just find one more example for you here. Um yeah, like Reddus for
83:04 example is 6379. So you can see the ports in the Docker Compose as well. I
83:07 just want to call it out just to at least get you a little bit comfortable
83:11 and familiar with the Docker Compose file in case you want to customize
83:14 things. But the main thing is just leveraging what you see here in Docker
83:18 Desktop. Last thing in Docker Desktop really quickly, if you want to bring
83:21 more local large language models into the mix, you can do it without having to
83:25 restart anything. You just have to find the Olama container in the Docker
83:29 Compose stack. Head on over to the exec tab. And now here we can run any
83:33 commands that we'd want. We're directly within the container here. And we can
83:36 use Lama commands just like we did earlier on our host machine. And so for
83:40 example, I ran Lama list already. So I can see the large language models that
83:43 have already been pulled in my Olama container. If I want to pull more, I can
83:48 just do Olama pull and then find that ID for the model I want to use on the Olama
83:52 website. And like I said, you don't have to restart anything. If I pull it here,
83:56 it's now in the container and I can immediately start using it in Open Web
84:00 UI or N8N. We'll see that in a little bit. And so that's just really important
84:03 because a lot of times you're going to want to start to use different large
84:06 language models and you don't want to have to restart anything. The ones that
84:11 are brought into the machine by default is it's determined by this line right
84:15 here. So if you want to change the ones that are pulled by default, I just have
84:20 Quinn 2.57B instructs like a really small lightweight one that I have
84:23 brought into your Lama container by default. Uh if you want to add in
84:27 different ones, you can just update this line right here to include multiple
84:32 Olama pulls. And so that way you can bring in Quen 3 or Mistral 3.1 small,
84:36 whatever you want. This is just the one I have by default. And then all the
84:40 other ones that you saw in my list here, I've pulled myself. All right. Now that
84:45 we have the local AI package up and running, it is time to build some
84:50 agents. Now, we get to use our local AI package to actually build out an
84:53 application. And so, I'm going to start by introducing you to Open Web UI, and
84:58 we'll use it to talk to our Olama LLM. So, we have an application kind of right
85:02 out of the box for us. Then I'll dive into building a local AI agent with N8N,
85:08 even connecting it to Open Web UI. So we have this custom agent that we built in
85:12 N8N and then we immediately have a really nice UI to chat with it. And then
85:16 we'll transition to Python building the exact same agent in Python as well. Like
85:21 I said, I want to focus on both no code and code to really make this a complete
85:24 master class so that whether you want to build with N8N or Python, you can see
85:28 how to connect to our different services that we have running locally like
85:32 Superbase and CRXNG and Open Web UI. So, we'll cover all of that and then I'll
85:36 get into deployments after this. But yeah, let's go ahead right now focus on
85:40 open web UI and building out some agents. So, back over in Open Web UI,
85:45 remember this is localhost port 8080. You want to set up your connection to
85:49 Olama so we can start talking with our local LLMs in this nice interface. And
85:54 so bottom left, go to the admin panel, then go to settings and then the
85:58 connections tab. Here we can set up our connections both to OpenAI with our API
86:02 key, which we're not going to do right now, but then also the Olama API. This
86:07 is what we want to set up. Now, usually by default, this value is just
86:12 localhost. And this is actually wrong. This is something that is so important
86:16 to understand. And this will apply when we set up credentials in N8N and Python
86:20 as well. When you are within a container, localhost means that you are
86:26 referencing still within the container. Open web UI needs to reach out to the
86:32 Olama container, not itself. So localhost is not correct here. This is
86:36 generally the default just because open web UI assumes that you're running on
86:39 your machine and so then you would also have Lama running on your machine. So
86:42 local host usually works when you're outside of containers. But here we have
86:47 to change this. This is super important to get right. And so there are two
86:51 options we have. If you are running on a Mac or AMD on Windows and you want to
86:56 use Lama running on your machine not within a container, then you want to do
87:00 host.doccker.in. This is the way in docker to tell the container to look outside to the host
87:07 machine where you're running the containers and you're running
87:11 separately. Very important to know that. And then if you are running Olama in the
87:16 container like I am doing. I have Ola running in my Docker desktop. You want
87:21 to change this to Olama, you're specifically calling out the service
87:26 that is running the Olama container in your Docker Compose stack. And the way
87:30 that we know that this is the name specifically is because we just go back
87:35 to our allimportant Docker Compose file. Olama. So whenever there's an X and a
87:40 dash, you just ignore that. It's just the thing after it. So, ola is the name
87:45 of our service running the container. And then if we wanted to connect to
87:49 something else like flow-wise, flow-wise is the name of the service. Open WebUI,
87:55 it's open- web UI. All of these tople keywords, these are the names when we
87:59 want our containers to be talking to each other. And all of this is possible
88:03 because they are within the same Docker network. And so I'll just show you that
88:06 so you know what I'm talking about here. If I go back to Docker Desktop, we have
88:11 this local AI compos stack. All of these containers can now communicate
88:14 internally with each other by referencing the names like Reddus or
88:19 CRXNG. So, we'll be seeing that a lot when we're building out our agents as
88:22 well. So, I wanted to spend a couple minutes to focus on that. And so, you
88:25 can go ahead and click on save in the very bottom right. I know my face is
88:28 covering this right now, but you have a save button here. Make sure you actually
88:32 do that. Um, and for this API key, I don't know why it's asking me to fill it
88:34 out. I don't really care about connecting to open AI. So I'll just put
88:37 some random value there and click save. And then boom, there we go. We are good.
88:40 And then a lot of times with open web UI, it also helps to refresh otherwise
88:43 it doesn't load the models for some reason. So I just did a refresh of the
88:49 site here. Control F5. And then now we can select all of the local LLMs that we
88:53 have pulled in our Olama container. And so for example, I can do Quinn 2.57B.
88:57 That's the one that I just have by default. I can say hello. And it's going
89:02 to take a little bit cuz it has to load this model onto my GPU just like we saw
89:06 with quen 3 earlier. But then in a second here we'll get a response. And
89:10 there are actually multiple calls that are being done here. We have one to get
89:15 our response, one to get a title for our conversation on the lefth hand side. And
89:19 then also if you click on the three dots here, you can see that it created a
89:23 couple of tags for this conversation. So couple of things that are fired off all
89:26 at once there. And I can test conversation history. What did I just
89:30 say? So yeah, I mean everything's working really well here. We have chat
89:34 history, conversation history on the lefth hand side. There's so much that we
89:37 get out of the box. And so I wanted to show you this really quickly. Now we can
89:41 move on to building an agent in N8N. And I'll even show you how to connect it to
89:46 Open Web UI as well through this N8N agent connector. Really exciting stuff.
89:49 So let's get right into it. So I'm going to start really simple here by building
89:53 a basic agent. The main thing that I want to focus on is just connecting to
89:57 our different local AI services. So I am going to assume that you have a basic
90:01 knowledge of N8N here because this is not an N8N master class. And so I'm
90:04 starting with a chat trigger so we can talk to our agent directly in the UI.
90:08 We'll connect this to open web UI in a bit as well. And then I want to connect
90:14 an AI agent node. And so what we want to do is connect for the chat model and
90:18 then local superbase for our conversation history, our agent memory.
90:22 And so for the chat model I'm going to do lama chat model. I'm going to create
90:26 brand new credentials. You can see me do this from scratch. The URL that you want
90:32 for the base URL is exactly the same as what we just entered into open web UI.
90:35 And so if you are running Olama on your host machine like an AMD on Windows or
90:40 you are running on a Mac or you just don't want to run the Olama container,
90:46 then it is host.doccker.in. And then if you are referencing the
90:49 Olama container, we just reference Olama. That's the name of the service
90:53 running the Olama container in our stack. And then the port is 11434 by
90:57 default. And you can test this connection. So it'll do a quick ping to
91:01 the container to make sure that we are good to go. And I'll even show you what
91:04 that looks like. So right here in my Olama container, I have the logs up. And
91:09 the last two requests were just a simple get request to the root endpoint. We
91:13 have two of those right here. And if I click on retry and I go back to the
91:18 logs, boom, we are at three now. So it made three requests. So it's just making
91:22 that simple ping each time to make sure the container is available. And so I'm
91:26 going to go ahead and click on save and then close out. So now we have our
91:29 credentials and then we can automatically select the model that we
91:33 have loaded now in our container. And so just to keep things really lightweight,
91:36 I'm going to go with the 7 billion parameter model right now from Quen 2.5.
91:40 Cool. All right. So that is everything that we need to connect Olama. It is
91:44 that easy. And then we could even test it right now. So, I'm going to go ahead
91:47 and save this workflow. And I'm going to just say hello. And uh we don't need the
91:52 conversation history or tools or anything at this point. We're already
91:55 getting a response here from the LLM. It's working on loading the model into
91:59 my GPU as we speak. And so there we go. We got our answer looking really good.
92:04 Cool. So now we can add memory as well. So I'm going to add Postgress because
92:08 remember Superbase uses Postgress under the hood. And then I'm going to create
92:12 brand new credentials here. And this is actually probably the hardest one to set
92:16 up out of all of the credentials for connecting to our local AI service. And
92:20 so I'm going to show you what the Docker Compose file looks like just that it's
92:24 clear how I'm getting these different values. And so I'll point out all of
92:28 them. So the first one for our host it is DB because this is the name of the
92:35 specific Superbase service that we have that is the underlying Postgress
92:38 database. And I can show you how I got that really quick. If you go to the
92:42 superbase folder that we pull when we run that start services script, I go to
92:47 docker and then docker compose. If I search for db and there's quite a few
92:53 dependencies on db here. So let me find the actual reference to it. Where is db?
92:58 Here we go. So yeah, it's really short. Uh db is the name of our service that
93:03 actually is the superbase DB. So this is the container name that this is what
93:07 you'll see in docker desktop. But then this is the underlying service that we
93:10 want to reference when we have our containers communicating with each
93:14 other. Like in this case we have our N8N container talking to our superbase
93:18 database container. And then the database and username are both going to
93:22 be Postgress. Those are the values that we have by default. If you scroll down a
93:26 bit in thev you can see these right here. The Postgress database is
93:29 Postgress and the user is also Postgress. And you can customize these
93:33 things but these are some of the optional parameters that I didn't touch
93:36 in the setup with you. And so you can just leave those as is. Now the
93:41 Postgress password, this is one of them that we set. That was the first
93:44 superbase value that we set there. Make sure you have that from what you have in
93:48 thev. And then everything else you can kind of leave as the defaults here. The port is
93:53 going to be 5432. So that is everything for setting up our connection to
93:57 Postgress. You can test this connection as well. And then we can move on to
94:01 adding in some tools and things like that as well. But yeah, this is like the
94:06 very first basic version of the agent that I wanted to show you. And hopefully
94:10 with this you can see how no matter the service that you have running in the
94:13 local AI package. It's very easy to figure out how to connect to it both
94:17 with the help of N8N because N8N always makes it really easy to connect to
94:20 things. Then also just knowing that like you just have to reference that service
94:25 name that we have for the container in the Docker Compose stack. That's how we
94:28 can talk to it. So you could add in quadrant or you could add in language.
94:31 Like you can connect anything that you want into our agent here. And so now we
94:36 have conversation history. Next up, I want to show you how to build a bit more
94:40 of a complicated agent with N8N using some tools. And then also I'm going to
94:44 show you how to connect it to Open Web UI. And so right now this is a live
94:47 demo. Instead of connecting to one of the Olama LLMs, I'm going straight to
94:53 N8N. I have this custom N8N agent connector. And so we are talking to this
94:57 agent that I'll show you how to build in a little bit. This one has a tool to use
95:02 CRXNG for local and private web search. This is one of the platforms that we
95:06 have included in the local AI package. And so this response is going to take a
95:10 little bit here because it has to search the web. And the response that it
95:13 generates with this question is pretty long. Like there we go. Okay. So we got
95:16 the answer. It's pretty long. But yeah, we are able to search the internet now
95:21 with a local agent. N8N connected to open web UI. We're getting pretty fancy
95:25 here. And we also have the title that was generated on the left. And then we
95:30 have the tags here as well. And so the way that this all works, I'm going to
95:33 start by explaining how we can connect N8N to Open Web UI. And this is just
95:37 crucial. Makes it so easy for us to test agents locally as we are developing
95:42 them. And so if you go to the settings and the admin panel in the bottom left
95:47 and go to functions, open web UI has this thing called functions which gives
95:51 us the ability to add in custom functionality kind of as like custom
95:56 models that we can then use like you saw with the N8N agent connector. And so
96:02 what I have here is this thing that I call the N8N pipe. And I'll have a link
96:05 to this in the description as well. I created this myself and I uploaded it to
96:10 the open web UI directory of functions. And so you can go to this link right
96:14 here. You can even just Google the N8N pipe for open web UI. And then you click
96:19 on this get button. It'll just have you enter in the URL for your open web UI.
96:23 So I can just like paste in this right here. Click on import to open web UI and
96:28 it'll automatically redirect you to your open web UI instance. So you'll have
96:33 this function now. And we don't have to dive into the code for all how how all
96:36 of this works. I worked pretty hard to create this for you. Uh actually quite a
96:40 while ago I made this. And the thing that we need to care about is
96:44 configuring this to talk to our N8N agent. And so if you click on the
96:49 valves, the setting icon in the top right, there are a few values that we
96:54 have to set. And so now I'm going to go over to showing you how to build things
96:57 in N8N. Then all of this will click and it'll make sense. I right now looking at
97:00 these values, you're probably like, how the heck do I get all of these? But
97:02 don't worry, we'll dive into all of that. But first, let's go into our N8N
97:07 agent. I'll explain how all of this works. So, first of all, we have our
97:12 chat trigger that gives us the ability to communicate with our agent very
97:16 easily in the workflow. We have a new trigger now for the web hook. And so,
97:22 this is turning our agent into an API endpoint. So, we're able to talk to it
97:27 with other services like open web UI. And so to configure the web hook here,
97:31 you want to make sure that it is a post request type. And then you can define a
97:35 custom path here. Whatever you set here is going to determine what our URL is.
97:40 So we have our test URL. And then also if you toggle the workflow to active,
97:44 this is really important. The workflow in N does have to be active. Then you
97:49 have access to this production URL. And this is actually the first value that we
97:54 need to set within the valves for this open web UI function. We have our N8N
97:59 URL. And because this is a container talking to another container, we don't
98:03 actually want to use this localhost value that it has here for us. We want
98:08 to specify N8N because N8N again is the name of the service running the N8N
98:13 container in our Docker Compose stack. So N8N port 5678. And then this is the
98:18 custom URL that we can determine based on this. And then the other thing that
98:23 we want to do is set up header authentication. We don't want to expose
98:27 this endpoint without any kind of security. And so we want to set up some
98:31 authentication. And so you can select header off from the authentication
98:34 dropdown. And then for the credentials here, I'll just create brand new ones to
98:38 show you what this looks like. The name needs to be authorization with a capital
98:43 A. This has to be very specific. The name in the top left and the name of
98:46 your credentials. This can be whatever you want, but this has to be
98:51 authorization. And then the value here, the way that we want to format this is
98:55 it's going to be bearer and then the and then a space and then whatever you want
99:00 your bearer token to be. So this is what you get to define, but it needs to start
99:05 with a bearer capital B and a space. And then whatever you type after bearer
99:09 space, this goes in as the NAN bearer token. So you don't include a bearer
99:13 space here because that it's just assumed that it's going to be like that.
99:16 It's going to be prefixed with that. So you just type in like test off is what I
99:21 have. So my bearer token is bearer test off like that. And then this is what I
99:24 enter in for this field. Now I already have mine set up. So I'm just going to
99:27 go ahead and close out of this. And then the last thing that we have to set up
99:30 for the web hook. And don't worry, this is the node that we spend the most time
99:33 with. You want to go to the drop down here and change this to respond using
99:38 the respond to web hook node. very important because then at the end of our
99:40 workflow and we get the response from our agent, we're going to send that back
99:45 to whatever requested our API which is going to be open web UI in this case.
99:48 And so that's everything for our configuration for the web hook. Now the
99:53 next thing that we have to do is we have to determine is open web UI sending in a
99:58 request to get a response for our main agent or is it just looking to generate
100:03 that conversation title or the tags for our conversation? Because like we were
100:06 looking at earlier, I'm going to close out of this for now and go back to a
100:10 conversation, our last conversation here. We get our main response, but then
100:15 also there is a request to an LLM to create a very simple title for our
100:19 conversation and the tags that we can see in the top right. And so our N8
100:24 workflow actually gets invoked three separate times for just the first
100:30 message in a new conversation. And so we need to determine, are we getting a main
100:34 response? Like should we go to our main agent or should we just go to a simple
100:39 LLM that I have set up here to help generate the tags or title? And so the
100:44 way that we can determine that is whenever Open Web UI is requesting
100:47 something like a title for a conversation, it always prefixes the
100:53 prompt with three pound symbols, a space, and then the word task. And so we
100:58 can key off of this. If the prompt starts with this, and that prompt just
101:02 is coming in from our web hook here. If it does start with it, then we're just
101:06 going to go to this simple LLM, we're just going to be using Quen 2.514b
101:11 instruct. We have no tools, no memory or anything like our main agent because
101:14 we're just very simply going to generate that title or the tags. And I can even
101:19 show you in the execution history what that looks like. So in this case, we
101:22 have our web hook that comes in. The chat input starts with the triple pound
101:28 and task. And so sure enough, we are deeming it to be a metadata request is
101:32 what I'm calling it. And so then it then goes down to this LLM that is just
101:36 generating some text here. We just have this JSON response with the tags for the
101:41 conversation, technology, hardware, and gaming. So we're asking about the price
101:45 of the 5090 GPU. And then we do the exact same thing to also generate the
101:51 title GPU specs. And so exactly what we see here is the title of this last
101:55 conversation. So I hope that makes sense. And then if it doesn't start with
101:59 task and the triple pound and so it's actually our request. Then we go to our
102:03 main agent. We don't want our main agent to have to handle those super simple
102:06 tasks. You can also just use a really tiny LLM. Like this would be the perfect
102:11 case to actually use a super tiny LLM like um even like DeepSec R1 1.5B. You
102:15 could because it's just such a simple task. Otherwise though we are going to
102:20 go to our main agent. And so I'm not going to dive into like all these nodes
102:24 in a ton of detail, but basically we are are expecting the chat input to contain
102:30 the prompt for our agent. And the way that we know to expect chat input
102:35 specifically is because going back to the settings for the function here with
102:38 the valves, we are saying right here chat input. So you want to make sure
102:43 that the value that you put in here for input matches exactly with what you are
102:48 expecting from our web hook. And so chat input is the one that I have by default.
102:51 So you can just copy me if you want. Then we go into our agent where we're
102:55 hooked into Olama and we've got our local superbase. I already showed you
102:58 how to connect up all this and that looks exactly the same. The only thing
103:01 that is different now is we have a single tool to search the web with
103:07 CRXNG. So it's a web search tool. I have a description here just telling it what
103:11 is going to get back from using this tool. And then for the workflow ID, this
103:16 is if I go to add a node here and I just go for uh workflow tools, call N8N
103:23 workflow tool. So this is basically taking an N8N workflow and using it as a
103:28 tool for our agent. So this is the node that we have right here. But then I'm
103:32 referencing the ID of this N8N workflow. So this ID because I'm going to just
103:37 call the subworkflow that I have defined below. And again, I don't want to dive
103:40 into all the details of NAN right now and how this all works, but the agent is
103:44 going to decide the query. What should I search the web with? It decides that and
103:49 then it invokes this sub workflow here where we have our call to CR XNG. So the
103:54 name of the container service in our docker and compost stack is just CR XNG
103:59 and it runs on port 8080. And then if you look at the CXNG documentation, you
104:03 can look at how to invoke their API and things like this. So I'm just doing a
104:07 simple search here and then there are a few different nodes because what I want
104:10 to do is I want to split out the search and actually I can show you this by
104:14 going to an execution history where we're actually using this tool. So take
104:18 a look at this. So in this case the LLM decided to invoke this tool and the
104:23 query that it decided is current price of the 5090 GPU. So this is going along
104:28 with the conversation that we had last in open web UI. we get some results from
104:33 CRXNG, which is just going to be a bunch of different websites. And so, we don't
104:37 have the answer quite yet. We just have a bunch of resources that can help us
104:41 get there. And so, I'm going to split out. So, we have a bunch of different
104:45 websites. We're going to now limit to just one. I just want to pull one
104:48 website right now just to keep it really, really simple because now we're
104:52 going to actually visit that website. I'm going to make an HTTP request to
104:57 this website, which yeah, I mean, if it's literally an Nvidia official site
105:01 for the 5090, like this definitely has the information that we need. We're
105:04 going to make a request to it, and then we're also going to use this HTML node
105:08 to make sure that we are only selecting the body of the site. So, we take out
105:12 all the footers and headers and all that junk. So, we just have the key
105:15 information. And then that is what we aggregate and then return back to our AI
105:19 agent. So it now has the content, the core content of this website to get us
105:24 that answer. That is how we invoke our web search tool. And then at the very
105:29 end, we're just going to set this output field. And that's going to be the
105:33 response that we got back either from like generating a title or calling our
105:37 main agent. And this is really important. the output field specifically
105:41 whatever we call it here we have to make sure that that is corresponding to this
105:46 value as the last thing we have to set for the settings for our open web UI
105:50 function. So output here has to match with output here because that is what
105:55 we're going to return in this respond to web hook. Whatever open web UI gets back
105:59 it's getting back from what we return right here. So that is everything for
106:03 our agent. I could probably dive in quite a bit more into explaining how
106:06 this all works and building out a lot more complex agents, which I definitely
106:10 do with local AI in the Dynamis AI agent mastery course. So check that out if you
106:13 are interested. I just wanted to give you a simple example here showing how we
106:17 can talk to our different services like Olama, Superbase, and Seir XNG. And then
106:22 also open web UI as well. So once you have all these settings set, make sure
106:26 of course that you click on save. It's very, very important. These two things
106:29 at the bottom don't really matter, by the way. But yeah, click on save once
106:32 you have all of the settings there. And then you can go ahead and have a
106:36 conversation with your agent just like I did when I was demoing things before we
106:40 dove into the workflow. And by the way, this NAN agent that works with Open Web
106:44 UI, I have as a template for you. You can go ahead and download that in this
106:48 GitHub repository where I'm storing all the agents for this masterass. So we
106:52 have the JSON for it right here. You can go ahead and download this file. Go into
106:57 your N8N instance. Click on the three dots in the top right once you've
107:01 created a new workflow. Import from file and then you can bring in that JSON
107:03 workflow. You'll just have to set up all your own credentials for things like
107:07 Lama and Superbase and CRXNG. But then you'll be good to go and you can just go
107:11 through the same process that I did setting up the function in open web UI
107:15 and it'll be with like 15 minutes you'll have everything up and running to talk
107:20 to N8N in open web UI. Next up I want to create now the Python version of our
107:25 local AI agent. And so this is going to be a onetoone translation. Exactly what
107:30 we built here in NN, we are now going to do in Python. So I can show you how to
107:34 work with both noode and code with our local AI package. And so this GitHub
107:39 repo that has the N workflow we were just looking at and that OpenAI
107:43 compatible demo we saw earlier, this has pretty much everything for the agent. So
107:46 most of this repository is for this agent that we're about to dive into now
107:50 with Python. And in this readme here, I have very detailed instructions for
107:54 setting up everything. And a lot of what we do with the Python agent, especially
107:57 when we are configuring our environment variables, it's going to look very
108:01 similar to a lot of those values that we set in N8N. Like we have our base URL
108:05 here, which you'd want to set to something, you know, like HTTP lama port
108:09 11434. We just need to add this /view one, which I guess is a little bit different,
108:14 but yeah, I've got instructions here for setting up all of our environment
108:18 variables, our API key, which you can actually use OpenAI or Open Router as
108:22 well with this agent, taking advantage of the OpenAI API compatibility. This is
108:27 a live example of this because you can change the base URL, API key, and the
108:31 LLM choice to something from Open Router or OpenAI, and then you're good to go
108:35 immediately. It's really, really easy. We will be using Olama in this case, of
108:39 course, though. And then you want to set your superb basease URL and service key.
108:42 You can get that from your environment variables. Same thing with CRXNG with
108:47 that base URL. We'll set that just like we did in N8N. We have our bearer token
108:51 like in our case was test off. It's just whatever comes after the bearer and the
108:55 space. And then the OpenAI API key you can ignore. That's just for the
108:58 compatible demo that we saw earlier. This is everything that we need for our
109:02 main agent now. And so we're using a Python library called fast API to turn
109:09 our AI agent into an API endpoint just like we did in N8N. And so fast API is
109:13 kind of what gives us this web hook both with the entry point and the exit for
109:17 our agent and then everything in between is going to be the logic where we are
109:21 using our agent. And I'm going to be using paidantic AI. It's my favorite AI
109:25 agent framework with Python right now. Makes it really easy to set up agents
109:29 and we'll so we'll dive into that here. And I don't want to get into the
109:32 nitty-gritty of the Python code here because this isn't a master class on
109:36 specifically building agents. I really just want to show you how we can be
109:40 connecting to our local AI services. This agent is 100% offline. Like I could
109:46 cut the internet to my machine and still use everything here. So we create our
109:50 Superbase client and the instance of our fast API endpoint. I have some models
109:55 here that define the the requests coming in, the response going out. So we have
109:59 the chat input and the session ID just like we saw in N8N. And then the output
110:04 is going to be this output field. And so that corresponds to exactly what we're
110:07 expecting with those settings that we set up in the function in open web UI.
110:11 So this Python agent is also going to work directly with open web UI. And then
110:16 we have some dependencies for our Pantic AI agent because it needs to have an
110:21 HTTP client and the CRXNG base URL to make those requests for the web search
110:25 tool. And then we're setting up our model here. It's an OpenAI model, but we
110:29 can override the base URL and API key to communicate with Olama or Open Router as
110:34 well like we will be doing. And then we create our Pantic AI agent just getting
110:38 that model based on our environment variables. I've got a very simple system
110:43 prompt and then the dependencies here because we need that HTTP client to talk
110:48 to CR XNG. And then I'm just allowing it to retry twice. So if there's any kind
110:51 of error that comes up, the agent can re retry automatically, which is one of the
110:55 really awesome things that we have in Pyantic AI. And then I'm also creating a
111:01 second agent here. This is the agent that is going to be responsible like we
111:05 have in NADM for handling the metadata for open web UI like conversation titles
111:10 and tags for our conversation. And so it's an entirely separate Asian because
111:14 we just have a another system prompt. In this case, I'm just doing something
111:18 really simple here. Uh we don't have any dependencies for this agent because it's
111:21 not going to be using the web search tool. And then for the model, I'm just
111:25 using the exact same model that we have for our primary agent. But like I shared
111:30 with N8N, you could make it so that this is like a much smaller model, like a one
111:33 or three billion parameter model because the task is just so basic or maybe like
111:38 a 7 billion parameter model. So you can tweak that if you want. Just for
111:41 simplicity sake, I'm using the same LLM for both of these agents.
111:46 And then we get to our web search tool. So in Pantic AI, the way that you give a
111:51 tool to your agent is you do at@ and then the name of your agent and then
111:55 tool and then the function that you define below this is now going to be
112:00 given as a tool to the agent. And then this description that we have in the doc
112:04 string here that is given as a part of the prompt to your agent. So it knows
112:09 when and how to use this tool. And so the exact, you know, details of how
112:13 we're using CRX and G here, I won't dive into, but it is the exact same as what
112:17 we did in N8N where we make that request to the search endpoint of CRX andG. We
112:22 go through the page results here. We limit to just the top three results or I
112:26 could even change this to make it even simpler and just the top result. So we
112:30 have the smallest prompt possible to the LLM. And then we get the the content of
112:34 that page specifically. And then we return that to our AI agent with some
112:40 JSON here. So now once it invokes this tool, it has a full page back with
112:44 information to help answer the question from the user. It has that web search
112:47 complete now. And then we have some security here to make sure that the
112:51 bearer token matches what we get into our API endpoint. So that's that header
112:55 authentication that we set up in N8N. So this part right here where we're
112:59 verifying the header authentication that corresponds to this verify token
113:03 function. And then we have a function to fetch conversation history to store a
113:08 new message in conversation history. So both of these are just making requests
113:12 to our locally hosted superbase using that superbase client that we created
113:16 above. And then we have the definition for our actual API endpoint. And so in
113:23 N8N we were using invoke N8N agent for our path to our agent. So this was our
113:29 production URL. In this fast API endpoint, our endpoint is slashinvoke
113:34 python agent. And then we're specifically expecting the chat input
113:39 and session ID. So that is our um chat request request type right here. And
113:43 then sorry I highlighted the wrong thing. We have our response model here
113:46 that has the output field. So we're defining the exact types for the inputs
113:50 and the outputs for this API endpoint. And then we're also using this verify
113:55 token to protect our endpoint at the start. And then the key thing here, if
113:59 the chat input starts with that task, then we're going to call our metadata
114:02 agent. And so it's just going to spit out the title or the tags, whatever that
114:06 might be. Otherwise, we're going to fetch the conversation history, format
114:11 that for Pyantic AI, store the user's message so that we have that
114:15 conversation history stored, create our dependencies so that we can communicate
114:20 with CRXNG, and then we'll just do agent.run. We'll pass in the latest
114:24 message from the user, the past conversation history and the
114:27 dependencies that we created. So it can use those when it invokes the web search
114:31 tool and then we just get the response back from the agent and we'll you can
114:34 print that out in the terminal as well and then we'll just store it in
114:38 superbase and then return the output field. So I'm going kind of fast here. I
114:42 there definitely a lot more videos on my channel where I break down in more
114:45 detail building out agents with podantic AI and turning them into API endpoints
114:49 and things like that. Um but yeah, just going a little bit faster here. here.
114:52 And then the last thing is with any kind of exception that we encounter, we're
114:55 just going to return a response to the front end saying that there was an issue
114:59 and then specifying what that is. And then we are using Ubicorn to host our
115:06 API endpoint specifically on port 8055. So that is everything for our Python
115:11 agent exactly the same as what we set up in N8N. And now going to the readme
115:16 here, I'll open up the preview. The way that we can run this agent, we just have
115:20 to open up a terminal just like we did with the OpenAI compatible demo. I've
115:25 got instructions here for um setting up the database table, which this is using
115:30 the same table as the one in N8N and N8N creates it automatically. So, if you
115:34 have already been using the N8N agent, you don't actually have to run this SQL
115:38 here. Um, and then you want to obviously set up your environment variables like
115:42 we covered. Uh, open your virtual environment and install the requirements
115:46 there. And then you can go ahead and run the command python main.py.
115:52 And so this will start the API endpoint. So it'll just be hanging here because
115:56 now it's waiting for requests to come in on a port 8055. And so what I can do is
116:03 I can go back to open web UI. I can go to the admin panel functions. Go to the
116:08 settings. I can now change this URL. So everything else is the same. I have my
116:11 bearer token, the input field, and the output field the same as N8N. The only
116:16 thing I have to change now is my URL. And so I know this is an N8N pipe and I
116:20 have N8N in the name everywhere, but this does work with just any API
116:23 endpoint that we have created with this format here. And so I'm going to say for
116:27 my URL, it's actually going to be host.doccker.in because I have my API endpoint for
116:33 Python running outside on my host machine. So I need my open web UI
116:38 container to go outside to my host machine. And then specifically the port
116:43 is going to be 8055. And then the endpoint here, I'm going to
116:46 delete this web hook here because it's invoke-python- agent. Take a look at that. All right.
116:52 Boom. So I'm going to go ahead and save this. And then I can go over to my chat.
116:57 And it says n agent connector here still. But this is actually talking to
117:00 my Python agent now. So I'll go ahead and start by asking it the exact same
117:05 question that I asked the N8N agent. And I do have this pipe set up to always say
117:08 that it's calling nan, but this is indeed calling our Python API endpoint.
117:12 And we can see that now. So there we go. We got all the requests coming in, the
117:16 response from the agent, and then also the metadata for the title and the tags
117:20 for the conversation. Take a look at that. So we got our title here. We have
117:24 our tags, and then we have our answer. It's a starting price of $2,000, which
117:28 it's a lot more right now. The starting point, the starting price is kind of
117:32 misleading, but like yeah, this is a good answer. and it did use CRXNG to do
117:36 that web search for us. This is really, really neat. Now, the last thing that we
117:40 want to do for our Python agent before we can work on deploying things to the
117:45 cloud, to a private server in the cloud, is we want to containerize it. Now, the
117:49 reason that we want to do this, and this is the Docker file that I have set up to
117:53 turn our Python agent into a container, just like our local AI services, is if
117:59 we have our agent containerized, then we can have it communicate within the
118:03 Docker network just like we have our different local AI services
118:07 communicating with each other. Because right now running directly with Python
118:12 to communicate with Olama, for example, we need our URL to be localhost, not
118:18 Olama. Remember, you can only use the specific name of the container service
118:23 when you are within the docker compose stack. And so we'd have to actually say
118:28 localhost right now. But if we add the container for the agent into the stack
118:32 as well, then we can communicate directly within the private network.
118:36 Like I can say lama and then for sir xng I could use this URL instead. Right now
118:41 we have to actually use localhost port 8081. And so it's really nice for
118:46 security reasons and just to make your deployment um in a nice package to have
118:51 the agents that you're running in the same network as your infrastructure. And
118:55 so that's what we're going to do right now. And so within the read me that I
118:59 have for instructions on setting up everything. I have the instructions that
119:02 we follow to run things with Python. I also have the instructions to run it
119:07 with Docker. And so all you want to do is run this single command. It's
119:11 actually very easy because I have this Docker file set up to turn our agent
119:15 into a container and I've got security and everything taken care of. We're
119:19 running it on port 8055 just like we did with Python. And then I have this very
119:24 simple Docker Compose file. It's just a single service that we're going to tack
119:28 on to all of the other services that we already have running for the local AI
119:32 package. And I'm calling this one the Python local AI agent. And so we're
119:36 using all of our environment variables from our ENV just like we did with the
119:40 local AI package. And then what I have at the top here is I am including the
119:45 docker compose file for the local AI package. So that just kind of solidifies
119:49 the connection there. Otherwise, you'll get this kind of weird error that says
119:52 there are orphaned containers when you run this, even though they aren't
119:55 actually. And so this is optional. You'll just get an orphan container
119:58 warning that you can ignore. But if you don't want to have that warning, you can
120:01 include this right here. You just have to make sure that this path corresponds
120:06 to the path to your docker compose in the local AI package. So in my case, I
120:10 just had to go up to directories and then go into the local AI package
120:13 folder. So yeah, you this is optional, but I want to include this here just to
120:18 make things in tiptop shape for you. So yeah, this is the docker compose. And
120:21 then what we can do now is I'll go back over to my terminal and I will paste in
120:25 this command. And what this will do is it will start or restart my Python local
120:31 AI agent container. And make sure that you specify this here because if you
120:35 don't then it's going to try to rebuild the entire local AI package because we
120:38 have this include. So very important you want to just rebuild or build for the
120:43 first time this agent container. And so I'll go ahead and run this and it's
120:47 going to give me those flow-wise warnings. So I don't have my username
120:49 and password set, but remember we can ignore those. But anyway, it's going to
120:52 build the Python local AI agent container here. And there's a couple of
120:56 steps that it has to do. It has to update some internal packages and then
121:00 also install all of the pit packages we have for our Python requirements for
121:04 things like fast API and Pantic AI. So it'll take a little bit to complete,
121:07 usually just a minute or two. And so I'll go ahead and pause and come back
121:11 once this is done. And there we go. Your output should look something like this.
121:14 It goes through all the build steps and then it says that the container is
121:18 started at the bottom. And this is now in our local AI Docker Compose stack. So
121:23 going back over to Docker Desktop. It'll take a little bit to find it here
121:26 because there are so many services that we have here. But if we scroll down,
121:29 okay, there we go. At the bottom here, we have our Python local AI agent
121:34 waiting on port 8055 just like when we ran it directly with Python, but now it
121:38 is within a container that is directly within our stack. And so now, like I was
121:41 saying, this is super important, so I'm hitting on it again. Now when we set up
121:45 our environment variables for our container, we are going to be
121:49 referencing the service names of our different local AI services like circ or
121:55 lama instead of localhost. And so this whole like localhost versus
121:58 host.docer.ernal versus using the service name, that's the thing I see people get tripped up on
122:03 the most when they're configuring different things in a Docker environment
122:06 like the local AI package. That's why I'm spending a good amount of time
122:09 really hammering that in because I want you to get it right. And of course, if
122:12 you have any issues that come up with this, just let me know. I'd love to help
122:16 walk you through what exactly your configuration should look like. And so,
122:20 we have our agent now up and running in a container. And I'm not going to go and
122:23 demo this again right now because the next thing that we're going to move into
122:26 doing and then I'll give a final demo here is deploying everything to a
122:30 private server that we have in the cloud. All right. So, we have really
122:34 gotten through all of the hard stuff already. So, if you have made it this
122:38 far, congratulations. you really have what it takes to start building AI
122:43 agents with local AI now and the sky is the limit for what you can accomplish.
122:47 And so the last thing that I want to really focus on here in this master
122:50 class is taking everything that we've been building on our own computer with
122:54 our infrastructure and our agents and deploying it to a private machine in the
122:59 cloud because then we can have our entire infrastructure and agents running
123:02 24/7. We don't have to rely on having our own computer up all the time. It's
123:05 really nice to have it there because then we can also share it with other
123:10 people as well. So local AI is still considered local as long as it's running
123:14 on a machine in the cloud that you control. And so this is not just going
123:18 to open AAI or Superbase and paying for their API. This is still us running
123:22 everything ourselves on a private server. That's what I'm going to show
123:25 you how to do right now. And this process that I cover with you is going
123:29 to work no matter the cloud provider that you end up choosing. And there are
123:32 some caveats to that that I'll explain in a little bit. But yeah, you can pick
123:36 from a lot of different options. So the cloud platform that we will be deploying
123:41 to today is Digital Ocean. I use Digital Ocean a lot. It's where I deploy most of
123:46 my AI agents. So I highly recommend it. And the best part about Digital Ocean is
123:51 they have both GPU machines if you need to have a lot of power for your local
123:56 LLMs and they have very affordable CPU instances. If you want to deploy
124:00 everything for the local AI package except Olama, you can definitely go a
124:03 more hybrid route if you don't want to pay a lot because these GPU instances in
124:07 the cloud can be pretty expensive like one, two, even $5 per hour. So, what you
124:12 can do with a hybrid setup is deploy everything in the local AI package. So,
124:16 at least you have all that locally and you're not paying for those
124:19 subscriptions. But then you could still use something like OpenAI, Open Router
124:23 or Anthropic for your LLMs. So, Digital Ocean gives us the ability to do both,
124:27 and we'll dive into that when we set things up. Another really good option
124:33 for GPU instances is Tensor Do. Tensor Do is not as nice looking to me as
124:37 Digital Ocean. I generally feel like I have a better experience with Digital
124:40 Ocean, but I have deployed the local AI package to Tensor Do before on a 4090
124:46 GPU that they offer for 37 cents an hour. It's very affordable for GPU
124:49 instances. And so, this is a good platform as well. And then also if
124:55 you're okay with not running Olama on a GPU instance, like you just want a very
124:59 affordable way to host everything in a local AI package except the LLMs, then
125:03 you can use Hostinger. Hostinger is another really really good option. super
125:08 super affordable like $7 a month for a KBM2 which I'd recommend getting if you
125:14 want to deploy everything except Olama because the requirement for the local AI
125:17 package except for running the more resource intense local LLMs is you have
125:23 to have 8 GB of RAM. So don't get a cloud machine that has four or 2 GB. You
125:27 want to have 8 GB of RAM then you'll be good to go. So you can literally do it
125:31 for $7 a month through Hostinger and it's going to be something like $28 a
125:35 month through Digital Ocean unless you want a GPU instance. So I just want to
125:39 spend a couple minutes talking about different platform options. The one
125:43 thing I will say is that the local AI package runs as a bunch of Docker
125:47 containers, right? And so what you have to avoid is using a platform like
125:54 RunPod. So RunPod is a platform for running local AI. The problem is when
125:59 you pay for a GPU instance, you don't actually get the underlying machine.
126:05 You're just sshing into a container. So, you're accessing a container. And I'll
126:09 just save you the pain right now. It is so hard and basically impossible to run
126:15 Docker containers within Docker containers. So, you really can't run the
126:20 local AI package on RunPod. There are other platforms as well like Lambda Labs
126:25 is another one that I've used before. not for the local AI package for other
126:28 things but this also runs containers like you're accessing a container so you
126:34 can't do the local AI package vast.ai AI is another option, but this also is
126:39 you're renting a GPU, but it's you're accessing a container. So, you again
126:43 can't run the local AI package. And so, based on the platform that you choose,
126:47 you have to make sure that you are accessing the underlying machine when
126:51 you rent a GPU instance like Digital Ocean is the one that I will be using.
126:56 You could use a GPU instance through the Google cloud or Azure or AWS if you want
127:00 to go more enterprise. Those all give you access to the underlying machine.
127:04 It's your own private server just like we have in Digital Ocean. So you can use
127:07 that to deploy the local AI package and the agent that we built with Python. So
127:11 that's what we're going to do right now. So once you are signed into Digital
127:14 Ocean and you have your profile and billing set up and you have a project
127:18 created or you can just use the default one, now we can go ahead and create our
127:22 private server in the cloud to host the local AI package and our agent. And so
127:25 you can click on create in the top right. And there are two options here.
127:29 If you want a CPU instance, so that hybrid approach where you're hosting
127:34 everything except for the LLMs, you can select a droplet. Otherwise, what we're
127:37 going to be doing right now so I can demo the full full thing is we will
127:42 create a GPU droplet. Now, these are going to be more expensive. Like I said,
127:48 like running an H100 GPU is $3.40 an hour. It's pretty expensive. But like I
127:52 said at the start of this master class, I know so many businesses that are
127:56 willing to put tens of thousands of dollars per year into running their own
127:59 infrastructure and LLMs. And that biggest cost that contributes to that
128:03 being tens of thousands of dollars is having a GPU droplet that is running 247
128:09 in the cloud. So the hybrid approach I definitely recommend if you don't want
128:12 to pay more, you could go as low as $7 a month with Hostinger. So, there's a very
128:16 wide range of options for you depending on what you want to pay hosting the
128:21 package, LLMs, and your agents. And the other thing I will say is that if you
128:25 want to, you could just create this instance for a day and poke around with
128:29 things and then tear it down. So, you only have to pay, you know, like 20
128:31 bucks or something like that. So, there's a lot of different options for
128:35 flexibility here. And so, I'm going to pick the Toronto data center because
128:38 there's more options for GPUs available here and it's relatively close to me.
128:42 And then for the image, I'll select AI/ML ready and it's recommended because
128:47 you get the Linux bundled with all the required GPU drivers and it does run the
128:52 Ubuntu distribution of Linux. And so this process that I'm going to walk you
128:55 through for deploying local AI package to the cloud is going to work for any
129:00 Ubuntu instance that you have running on AWS or Hostinger or Tensor Do. It's just
129:05 a very standard distribution of Linux. And then for the GPU, there are a couple
129:09 of different options that we have here with Digital Ocean. H100 is an absolute
129:15 beast. 80 GB of VRAM, so it could easily run Q4 large language models, over a 100
129:21 billion parameters, even 240 GB of RAM. So, I'm not going to run this one. I'm
129:24 just kind of pointing out that this is an absolute beast. I think the one that
129:27 I'm going to choose here is going to be the RTX 6000 ADA. So, it's 48 GB of
129:34 VRAM. So, this is enough to run 70 billion parameters or smaller of LLMs at
129:40 a Q4 quantization and it comes with 64 GB of RAM and it's going to be about
129:45 $1.90 per hour. So, I'm going to select this and then I have an SSH key that is
129:50 created already. If you don't have an SSH key, then you can click on this
129:54 button to add one. And then you can follow the instructions on the right
129:57 hand side here. No matter your OS, they got instructions to help you out. You
130:00 just have to paste in your public key and then give it a name. So, I've got
130:03 mine added already. And then the only other thing that I really have to select
130:07 here is a unique name. So, I'll just say local AI package. And I'll just say GPU
130:11 because I already have the regular version just deployed on a CPU instance.
130:16 And then for my project, I will select Dynamis. I can add it along with my
130:19 other instance that I've got up and running. And there we go. So now I can
130:22 go ahead and just click on create GPU droplet. It is that easy to get our
130:27 instance ready for us to access it and start installing everything and getting
130:30 everything up and running just like we did on our computer. And so I'll go
130:34 ahead and pause and come back once our machine is created in just a few
130:37 minutes. And boom, there we go. Just after a minute, we have our GPU droplet
130:42 up and running. And so the one thing I will say is I had to request access to
130:46 create a GPU instance on the Digital Ocean platform. However, they approved
130:50 it in less than 24 hours. So, it's very easy to get that if you do want to
130:53 create a GPU instance. Otherwise, you can just create one of their normal
130:58 droplets, one of their CPU instances. Now, before we connect to this machine,
131:01 the one thing that you want to take note of is the public IPv4 address. We'll use
131:06 this to set up subdomains in a little bit. And so, this is how we get to it
131:12 for a GPU droplet. For a CPU droplet, it looks a little bit different. You'll
131:15 usually see the IPv4 somewhere at the top right here. And so, take note of
131:18 that. Save it for later. We'll be using that in a little bit. And then to
131:22 connect to our droplet, we can either do it through SSH with our IPv4 and the SSH
131:27 key that we set up when we configured this instance or we can access the web
131:31 console. For a CPU instance, usually you go to like an access tab and then you
131:35 can launch the web console. For the GPU instance, I can just click this to
131:38 launch it right here. So we have a separate window that comes up and boom,
131:42 we now have access to our instance. It is that easy to get connected. And now
131:46 we can go through the same process that we did on our computer to install the
131:51 local AI package. Now there are some different steps that we have to take and
131:55 that's why I'm including this at the end of the master class especially. Um so if
131:59 you scroll down in the read me here there are some specific instructions for
132:03 deploying to the cloud. And so you have to make sure that you have a Linux
132:07 machine preferably on the Ubuntu distribution which is that is what we
132:11 are using. And then there are a couple of extra steps. And so the first thing
132:15 that we have to set up is our firewall. We have to open up a couple of ports so
132:20 that we can access our machine from the internet. Set up our subdomains for
132:24 things like N8N and Open Web UI. And so you want to take this command UFW
132:29 enable. I'll just go ahead and paste it in. And so we are going to and you can
132:33 just type Y to continue here. It's going to disrupt SSH connections, but we don't
132:36 really care. So UFW enable. So we're enabling the firewall. And then you want
132:41 to copy this command to allow both ports 80 and 443. And so 80 is HTTP and then
132:48 443 is HTTPS. And then you can just do the last command here, UFW reload. So
132:54 now we have those ports available for us to communicate with all of our services.
132:59 And so this is the entry point to Caddy. Caddy is the service in the local AI
133:03 package that is going to allow us to set up subdomains for all of our services.
133:08 is and so any kind of communication to our droplet is going to go through caddy
133:13 and then caddy will distribute to our different services based on the port or
133:17 the subdomain that we are using. So this is called a reverse proxy. You've
133:21 probably heard of like EngineX or traffic before. Caddy is something very
133:25 much like that. And it also makes it so that we can get HTTPS so we can have
133:29 secure endpoints set up automatically and it manages that encryption for us.
133:33 It's a very beautiful platform. So, let me scroll back down to the specific
133:36 steps for deploying to the cloud. There is a quick warning here about how Docker
133:40 manages ports and things, but this is as secure as we possibly can make it. So,
133:43 trust me, we've put a lot of effort and I actually had someone from the Dynamis
133:48 community, um, Benny right here. He actually helped me a lot with security
133:53 for the local AI package. So, thank you, Benny, for helping out with that. We're
133:56 really making sure that because local AI like the whole thing is like you want to
133:59 be private and secure and so we're making sure that this package handles
134:03 all the best practices for that. So very much top of mind for us. Um and then we
134:07 can go ahead and go through the usual steps for setting up the local AI
134:11 package. The only other thing we have to do that's unique for cloud is we have to
134:16 set up a records for our DNS provider so we can have our subdomain set up for our
134:20 different services. So we'll get into that in a little bit. But first I just
134:24 want to get the local AI package up and running. And so I'm going to go ahead
134:28 and paste this command here to clone the repository. Git comes automatically
134:33 installed with our GPU droplets. And then I can change my directory into the
134:38 local AI package. And let me zoom in on this a little bit here so it's very easy
134:41 for you to see because now then what I want to do is I want to copy
134:45 the.env.ample file to a new file called. So then if I do an ls command so we can see all the
134:51 files that are available in our directory. We I guess it doesn't show um
134:56 I do ls- a there. Now we can see the env.env.example. example. And so now I can do nano.env.
135:05 This is going to give us a basically a text editor directly in the terminal
135:08 here so that we can set all of our environment variables just like we did
135:12 with the local AI package on our computer. So this time I'm not going to
135:15 go into the nitty-gritty details of setting up all these environment
135:19 variables because it is the exact same process. In fact, you can literally
135:23 reuse all of the secrets that you already set up when you hosted it on
135:27 your own computer as long as those are actually secure. Like I know a lot of
135:29 times you might just do some kind of placeholder stuff when it's just running
135:32 on your computer and then you want it more secure in the cloud. So make sure
135:36 you have real values for everything, but yeah, you can reuse a lot of the same
135:39 things. Um though for best security practice, you probably do want to make
135:42 everything different. But yeah, the one thing that I want to focus on with you
135:46 here that does change is our configuration for Caddy. Now that we are
135:50 deploying to the cloud, we want to have subdomains for our different services.
135:54 And so like NN for example, we want to set the host name for that. You want to
135:59 do the same thing for N8N open web UI superbase and then your let's encrypt
136:04 email. Obviously you want to uncomment that as well because this is the email
136:08 that you want to set for your SSL encryption. And so I'm just going to do
136:13 coal dynamis.ai. And you can just set this to whatever email that you want to use. And so
136:18 basically what you want to do here is just uncomment the line for each of the
136:23 services that you want to have subdomains created for. So, if you're
136:26 also using Flowwise, which I'm just not in this master class here, but if you
136:30 are, then you want to uncomment this line as well. If you're using Langfuse,
136:34 then you uncomment this line. I'm going to leave them commented right now just
136:38 for simplicity sake. The two that I would generally recommend not
136:43 uncommenting ever is Olama and CRXNG. We don't really want to expose them through
136:45 a subdomain because we're just going to use them as internal tooling for our
136:49 agents and applications that we have running on this server. So, we want to
136:53 keep those nice and private. But for everything that we do want to expose
136:56 that is protected with a username and password like N8N open web UI and
137:01 superbase we can uncomment those. And so we got that set up now. But we have to
137:05 obviously provide real values for them as well. And so for example I'm just
137:09 going to say nyt for YouTube dynamis and then I'll do. So you want to
137:17 define your exact URL that you want to have for this domain. Obviously, it has
137:21 to be a domain that you control because we'll go and we'll set up the records in
137:25 the DNS in a little bit. And so, I'll do the exact same thing for open web UI.
137:28 So, it's open web UI and I'm just doing YT because I already have the local AI
137:33 package hosted on my domain. And so, I can't do just open web UI because it's
137:36 already taken. So, open web UIT.dynamis.ai. And then finally, for Superbase, it'll
137:44 be superbaseyt.dynamis.ai. Boom. All right. There we go. So, go
137:47 ahead and take care of this. set up the rest of your environment variables and
137:51 then we can go ahead and move on. And the way that you exit out of this and
137:55 save your changes is you do controll X or command X on Mac, type Y and then hit
138:03 enter. So again that is controll X then type Y then press enter. That is how you
138:08 exit out. And so now if I do a cat of the env this is how you can print it out
138:11 in the terminal. So we can verify that the changes that we made like everything
138:16 for caddy is indeed made. So do that. Also change all the other environment
138:18 variables as well. I'm going to do that off camera and come back once that is
138:22 taken care of. All right. So my environment variables are all set. Now
138:26 the very last thing we have to do before we start all of our services is we need
138:32 to set up DNS. And so remember, copy your IPv4 address and then head on over
138:38 to your DNS provider. And so this process is going to look very similar no
138:42 matter the DNS provider that you have. Like I'm using NameCheep here. A lot of
138:47 people use Hostinger or Bluehost. You are able to with all these providers go
138:50 to something that is usually called like advanced DNS or manage DNS and then you
138:55 can set up custom A records here which we're going to do to set up a connection
139:01 for all of our subdomains to the IPv4 of our digital ocean droplet or whatever
139:05 cloud provider you are using. And so I'm going to go ahead and click on add new
139:09 record. It's going to be an A record for the host. It's going to be the subdomain
139:14 that I want. So, N8NYT for example. And then for the IP address, I just paste in
139:19 the IPv4 of my Digital Ocean GPU droplet. And then I'll go ahead and
139:23 click on the check here to save changes. And then I'll just go ahead and do the
139:28 same thing for open web UIT. I can't forget the YT. There we go. And
139:33 then for you, it might be more than just three, but for me, the only other one
139:36 that I have here right now is Superbase because I'm just keeping it very, very
139:41 simple. So, superbase yt and then paste in the IP again. Okay, there we go.
139:45 Boom. All right, so we have all of our records set up. And it's very important
139:49 to do this before you run things for the first time because otherwise Caddy gets
139:52 very confused. It tries to use these subdomains that you don't actually have
139:57 set up yet. And so take care of that. Then we can go back into our instance
140:01 here and run the last command. And so going back to the readme here, if I
140:06 scroll all the way down to deploying to the cloud, there is a specific parameter
140:11 that you want to add. This is very important for deploying to the cloud
140:15 because when you select the environment of public, it's going to close off a lot
140:20 more ports to make this very very secure. So any of the services that you
140:25 access from outside of the droplet, it has to go through caddy. So we use the
140:30 reverse proxy as the only entry point into any of our local AI services. This
140:34 is how we can make things as airtight as possible. We have security in mind like
140:39 I said. And so make sure that you run this command with the environment
140:43 specified. We didn't do this locally because when we are running things
140:45 locally, we don't care about security as much because it's not like our machine
140:49 is accessible to the internet like a cloud server is. And so it defaults to
140:54 the environment of private which just doesn't do as much security stuff. And
140:57 so go ahead and run this. And then of course make sure that you're using the
141:01 correct profile. So if you're using just a CPU instance with a regular droplet or
141:05 hostinger or whatever, you'd want to change this to CPU instead of GPU
141:10 Nvidia. But in our case, because we are paying the $2 an hour for a killer GPU
141:16 droplet. I can go ahead and run this command with the profile of GPU Nvidia.
141:20 Now, I left this error in here intentionally because I want to show you
141:24 what it looks like. If you get unknown shorthand flag p-p, that means that you
141:29 don't actually have docker compose installed. And this happens for some
141:32 cloud providers. And there's a very easy fix for this that I want to walk you
141:35 through. So you can even test this. Just docker compose. It'll say that compose
141:40 is not a docker command. And so going back to the readme here, I have a couple
141:44 of commands that you just have to run if this happens to you. This is at the
141:47 bottom of the deploying to the cloud section of the readme. So you can just
141:51 copy these one at a time, bring them into your droplet or your machine,
141:55 wherever you're hosting it, and go ahead and run them. And so I'm just going to
141:58 do this off camera really quickly. I'm just going to copy each of these into my
142:02 droplet. It's very easy. You can just run all of these. They're really fast as
142:06 well. So none of them are going to take very long. This is just going to get
142:10 everything ready for you so that Docker Compose is a valid command. So you can
142:14 then run the start services script. So there we go. All right. I went ahead and
142:18 ran all of those. I'll clear my terminal again and then go back to the main
142:23 command here to start our services. Boom. All right. So now we pulled
142:26 everything from Superbase, set up our CRXNG config. Now we are pulling our
142:31 Superbase containers. So again, same process is running on our computer where
142:35 it'll pull Superbase, it'll run everything for Superbase, then it'll
142:38 pull and run everything for the rest of our services. So I'll pause and come
142:42 back once this is all complete. And boom, there we go. We have all of our
142:46 services up and running. You should see green check marks across the board like
142:51 this. We are good to go. And we don't have Docker Desktop, so it's not as easy
142:55 to dive into the logs for our containers. But one quick sanity check
142:59 that you can do just in the terminal is run the command docker ps- a. This will
143:03 give you a list of all of our containers that are running here. We can make sure
143:06 that all of them are running, that we don't see any that are constantly
143:10 restarting or ones that are down. So we do have two that are exited, but these
143:14 are Nit and import and the Olama pull. These are the two that we know should be
143:17 exited. Just make sure that everything is good to go. Then we can head on over
143:22 to our browser. And because we have DNS set up already, we configure Caddy. We
143:25 can now navigate to our different services. Like I can go to
143:30 nadnyt.dynamus.ai. And boom, there we go. It's having us set up our owner account or um we can
143:39 just go to open web ui yt.dynamis.ai. Boom. And there is our open web UI. All
143:43 right. So I'll go ahead and get started. Uh, we'll have to create our account.
143:46 I'll do this off camerara, but yeah, you just create your first-time accounts for
143:49 everything. And then we'll do the same thing for let's do superbaseyt.dynamus.ai.
143:55 And boom, there we go. So, all of our services are up and running. And so now
143:59 we can log into these and create our accounts and we can interact with our
144:03 agents and bring them in. We can work with Llama in the same way. And so,
144:06 let's go ahead and do that. I'll just go ahead and create these accounts off
144:09 camera. So, I've got my accounts created for N8N and then also open WebUI. And
144:13 you can do the same for all the other accounts you might have to create for
144:16 things like Langfuse as well. And then within Open Web UI, we'll go to the
144:21 admin panel, settings, connections. Make sure that your Olama API is set
144:25 correctly to reference the service Olama. Usually this will default to
144:30 localhost or host.docer.in internal. So you can get that there. You have to set
144:34 the OpenAI API key as well, just to any kind of random value. It's just a little
144:38 bug in open web UI. Then click on save and then you can go back and your models
144:41 will be loaded. Now, now, one thing I found with Open Web UI, after you change
144:46 the Olama base URL, you have to do a full refresh of the website. Otherwise,
144:49 you'll get an error when you use the LLM. So, just a really small tangent
144:52 there, a little tidbit there, but yeah, we'll go ahead and select the model
144:55 that's pulled by default. And you can pull other ones into your Llama
144:57 container as well, like we already covered. You don't have to restart
145:00 things. So, let's run a little test. I'll just say hello, and we'll see if we
145:04 can load the model now. And boom, look at how fast that was, because we have a
145:08 killer GPU instance right now. I could run much larger LLMs if I wanted to. Um,
145:14 so yeah, let's see. What did I just say? All right, we'll do another test here.
145:17 And yeah, look at how fast that is. It's blazing fast because everything is
145:20 running locally on the same infrastructure. There's no network
145:24 delays. And so we have a powerful GPU, no network delays. We get some blazing
145:28 fast responses from these LLMs right now. And so I don't want to go and test
145:34 everything with N8N again. But what I do want to show you how to do right now is
145:39 take the Python agent that we have in this repository and deploy this onto the
145:44 cloud as well. Adding it into the local AI Docker Compose stack just like we did
145:49 on our computer, but now hosting it all in the cloud. So that's the very last
145:52 thing that I want to cover with you for our cloud deployment. And so just like
145:56 with the local AI package, we can follow the instructions here in the readme to
145:59 get everything up and running on our machine. And so the first thing we have
146:03 to do is we need to clone our repository. And so I'm going to copy
146:07 this command, go back over into my terminal here for my instance. And I
146:11 step back one directory level by the way. So I'm now at the same place where
146:14 I have the local AI package so we can run them side by side. So I'll paste
146:19 this command to clone automator agents. And then I can cd into it. And then I
146:22 also want to change my directory within the Python local AI agent specifically.
146:28 And so now doing an ls- a we can see the enenv.ample. So I'm going to just like
146:33 we did before copy this and turn it into aenv. And then I can do nano.env.
146:39 And there we can edit all of our environment variables. And so because
146:42 we're running this in the docker container attaching it to the local AI
146:46 stack. The way that I referenced was going to be just calling out the service
146:51 name. So port 111434 /v1. And then the API key. It's just that placeholder there for lama for the
146:58 LLM choice. If I want to get the exact ID of one that I already had pulled,
147:01 I'll actually show you how to do this really quick. So, I'm going to do
147:06 controlx y enter to save and exit. And then the way that you can execute a
147:12 container, it's docker exec-it and then the name of our container which
147:15 is lama. We already have this running. And then /bin/bash. And so what this is going to do is now
147:22 instead of being within our machine, we are within our Olama container. And so
147:28 now I can run the command Olama list and then I can see the LLMs that I have
147:33 available to me. So I have Quen 2.57B. So I'm going to go ahead and copy this
147:37 ID. I don't have it memorized. So this is my way to go and reference it really
147:41 quickly. And then this is also how you can access each of your containers when
147:44 you don't have Docker Desktop. You just do docker exec-it the name of your
147:50 container and then bin /bash. So kind of like how we had that exec tab in docker
147:53 desktop. And then once I'm done in here, I can just do exit. And now I'm back
147:58 within my host machine, my GPU droplet. So that's another little tidbit, another
148:02 golden nugget I wanted to give you there. But yeah, we'll go back into our
148:05 environment variables here. And I have that ID for quen 2.57b copied. So I'll
148:12 paste that in. And boom, there we go. And then for our superbase URL, it's
148:17 going to be http col back slashbackslash and then it is kong. So I guess that I
148:21 should have been more clear on this when I set things up locally. So I'll be sure
148:24 to update the documentation for this, but it's going to be Kong port 8000
148:29 because Kong is the service that we have in Superbase specifically for the
148:33 dashboard. And then the service key, well, I'm just going to go ahead and get
148:37 that from my local AI package because I have this set up in environment
148:40 variables. And so I just have to go and reference my environment variables here
148:46 to get my service ro key. And boom, there we go. Okay. So I'm going to go
148:49 ahead and paste this in. And um now I'm just going to delete this instance
148:51 after. I don't really care that I'm exposing this right now. And then for
148:57 the CRNG base URL, HTTP CRXNG 8080. And then for my bearer token, I
149:01 just have it set to test off. And then we don't need to set the OpenAI API key
149:05 because that was just for the OpenAI compatible demo earlier. So that is all
149:09 of my configuration for this container. I got to be really clear on this. I'll
149:12 update the docs for this. But uh otherwise we are looking good for our
149:17 environment variables. So controll x y enter to save. You can do a cat.env just
149:22 for that sanity check to make sure that everything's saved. We are looking good.
149:27 All right. And so now within the readme here, so I'll go back to the
149:30 instructions. We did all this already. We changed our directory. We set up our
149:33 environment variables and configured them. Now we need to run this stuff in
149:38 our SQL editor in Superbase. This is how we can get our table set up because we
149:42 haven't run things with N8N first. So we don't have this table created already.
149:47 And so now I just have to sign into Superbase here. So I've got my username
149:51 which is Superbase. And then I'm just copy and pasting the username and
149:55 password that I have um that I have set here in my environment variables. And so
150:00 I'll go to my SQL editor and go back here. I know I'm moving kind of quick
150:03 here, but I got these instructions laid out in the readme. I'm going to paste
150:08 this like that and go ahead and click on run. And then boom, there we go. So now
150:12 if I go into tables and search for NN, we have NN chat histories, a new
150:16 currently empty table. All right, looking good. And then going back after
150:21 we do that, now we can run the agent. So I just have to take this command right
150:24 here and then I'll go back to my droplet. And the one thing that I
150:28 mentioned earlier, but I want to cover again. If I go into the uh hold on, I
150:33 need to change my directory back. So, automator agents and then python local
150:38 AI. If I go into my docker compose, you have to make sure that the include path
150:42 is correct. And so, I'm going to update this by the time you get your hands on
150:44 it here where it's just going to be going two levels back. That's what we
150:47 need to do. So, make sure that we reference the right path to the local AI
150:52 package on our machine and then crl + x y enter to save. That's because we have
150:57 to go back from Python local AI agent, then back from the automator agents
151:00 directory, and then within that same directory, we have the local AI package.
151:04 So, we're good to go. Now, I can go ahead and paste in the command here to
151:08 build our agent and include it in the local AI stack. And so, it's going to
151:11 have to build everything. Takes a minute like we saw already. So, I'll pause and
151:15 come back once this is done. And all right, there we go. About 30 seconds
151:18 later and we are good to go. So, now I can do the docker ps- a again. And this
151:23 time if I look through this list very carefully, take a little bit here, I
151:28 should be able to see my Python agent. There we go. Local AI, Python, local AI
151:32 agent. And it starts with local AI because it is a part of that docker
151:36 compose stack. And by the way, I can do docker exec-it. And then I can do python-lo
151:46 agent bin bash. I can I can run this as well. And then what I can do is if I do
151:51 a print env command, I can see all the environment variables that are set
151:54 within this container. That's everything that we set up in the env. So I'm being
151:59 very comprehensive with this master class, showing you how you can tinker
152:02 around with different things like accessing your containers and seeing the
152:05 environment variables, making sure that everything that we specified in thev is
152:09 actually taking effect here. And sure enough, it is. So we are looking good.
152:13 So I'll go ahead and exit. We're back in our root machine now. We have our
152:18 container up and running and also it's running on port 8055. And so now we can
152:23 go back to open web UI within uh open webyt.dynamus.ai and we can set up our pipe. And so I'm
152:30 going to go to the admin panel functions. We don't have a function
152:33 here. So I have to import it. And so what I can do I'll actually do this
152:36 here. I'll just Google you can literally Google np pipe open web UI. And it'll
152:40 bring you to the one that I have here. You just have to sign into open web UI.
152:44 I'll click on get. And then this time for my URL instead of being something on
152:49 local host, I'm going to copy my actual subdomain here. So import to open web UI
152:53 and then boom, we have our pipe. So I'll click on save, confirm, and then within
152:59 the valves here, I can set all of my values. So I'll just click on default
153:02 for all of these so I can get a starting point here. And then yeah chat input is
153:07 good output is good the bearer token is test off and then for my URL it's going
153:14 to be http colon and then it's going to be uh the name of my service python local aai
153:24 agent port 8055. Let me get that right. 8055 and then slashinvoke- python- agent. I believe I
153:32 have this memorized. I think we are good there. So I'm going back if I clear this
153:37 and run a docker ps- a it is indeed called um python local AI agent that is
153:42 the name of our service so open web UI is able to connect to the agent directly
153:46 with this name because we are deploying it in the same docker network and so I
153:51 think we are looking good all right so I'm going to go ahead and click on save
153:56 all right and then go back and start a new chat and then also like I said a lot
154:00 of times it helps just to refresh open UI completely open web UI completely
154:03 completely. All right, there we go. And then now instead of Oh, I have to
154:07 actually enable. Let me go back to the admin panel. Functions, you have to make
154:11 sure this is ticked on. Um, so that we have the pipe enabled. Now going back
154:15 here and I'll refresh as well. We've got our pipe selected. And now I can say
154:20 hello. And there we go. Super super fast. We got a response from our Python
154:26 agent. Take a look at that. And then also going into my database here, you
154:29 can see that I have all these messages in the NAN chat histories table. We'll
154:33 take a look at that. All right. And then we can also ask it to do web search. I
154:38 can say like what is the latest LLM from um Anthropic for example. So it has to
154:44 do a quick Seir XNG search leveraging that. Uh latest is Claude Opus 4. All
154:49 right. And man that was so fast. We have no network delays now because everything
154:53 is running on the same network and we have an absolute killer GPU. So this is
154:56 so cool. Also, one thing that I want to mention is sometimes depending on your
155:02 cloud provider, CRXNG will not start successfully. There's one thing you have
155:05 to do. It's just a really small tidbit. If you run into this issue where the
155:09 CRXNG container is constantly restarting, what you want to do is go to
155:14 your local AI package and then run the command chmod 755 SER XNG. That's the
155:22 Seir XNG folder. And so the CRXNG folder is responsible for storing the
155:26 configuration that we have for CRXNG by default. Sometimes you don't have
155:29 permissions to write this file and it needs to do so. So I'm going to update
155:33 the troubleshooting to include this. But yeah, just a small tidbit. And then you
155:37 can just go ahead and run the command to start everything again. Um, obviously
155:41 you have to go back one directory then you can run this um and restart
155:45 everything. That easy to restart things to make changes take effect for your
155:49 package and then you'll be good to go. So yeah, we have everything working
155:54 here. So this is pretty much it for the master class. Now we have our local AI
155:58 package up and running with an agent and the network as well. We're communicating
156:02 to it within Open Web UI directly. There is so much that we have gotten through
156:07 now. So congratulations for making it this far. All right, I'm going to be totally
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
3:26 interested in mastering more than just local AI, but building entire AI agents
3:31 in a local AI environment or even with cloud AI, definitely check out
3:35 dynamis.ai. AI. This is my community for early AI adopters just like yourself.
3:40 And a big part of this community is the AI agent mastery course where I dive
3:45 super deep into my full process for building AI agents. I'm talking planning
3:50 and prototyping and coding and using AI coding assistance and building full
3:55 frontends for our AI agents and securing things and deploying things. There's a
3:59 lot more coming soon for this course as well. I'm very actively working on it.
4:03 And a big part of this course is the complete agent that I build throughout
4:07 it. I build both with cloud AI and local AI. And so this master class will help
4:11 you get very very comfortable with local AI. But when it comes to building
4:15 complex agents and really getting deep into building out agents, then you
4:19 definitely want to check out the AI agent mastery course here in Dynamis.AI.
4:23 So with that, let's get back to the master class, diving into what local AI
4:28 is all about. Let's start by laying the foundation. What is local AI in the
4:32 first place? Well, very simply put, local AI is running your own large
4:36 language models and infrastructure like your database and your UI entirely on
4:42 your own machine 100% offline. So when you think about when you typically want
4:46 to build an AI agent, you need a large language model maybe like GPT4.1
4:51 or Claude 4 and then you need something like your database like Superbase and
4:55 you need a way to create a user interface. you have all these different
4:58 components for your agent and typically you're using APIs to access things that
5:03 are hosted on your behalf. But with local AI, we can take all of these
5:07 things completely in our own control running them ourselves. So this is
5:11 possible through open-source large language models and software. So
5:15 everything is running on your own hardware instead of you paying for APIs.
5:20 So we run the large language model ourself on our own machine instead of
5:25 paying for the OpenAI API for example. And so for large language models, there
5:29 are thousands of different open source large language models available for us
5:32 to use in a lot of different ways. And some of these you've probably heard of
5:37 before, like Deepseek R1, Quen 3, Mistral 3.1, Llama 4. These are just a
5:41 couple of examples of the most popular ones that you've probably heard of
5:44 before. We'll be tinkering around with using some of these in this master
5:48 class. And then we also have open-source software. So all of our infrastructure
5:53 that goes along with our agents and LLMs, things like Olama for running our
5:58 LLMs, Superbase for our database, N8N for our no/ lowode workflow automations,
6:04 and open web UI to have a nice user interface to talk to our agents and
6:07 LLMs. And we'll dive into using all of these as well. Now, because local AI
6:12 means running large language models on our own computer, it's not as easy as
6:18 just going to claw.ai AI or chatgbt.com and typing in a prompt. We have to
6:22 actually install something, but it still is very easy to get started. So, let me
6:25 show you right now with a hands-on example. So, here we are within the
6:30 website for Olama. This is just.com. I'll have a link to this in the
6:33 description of the video. This is one of the open- source platforms that allows
6:39 us to very easily download and run local large language models. And so, you just
6:42 have to go to their homepage here and click on this nice big download button.
6:45 You can install it for Windows, Mac or Linux. It really works for any operating
6:49 system. Then once you have it up and running on your machine, you can open up
6:53 any terminal. Like I'm on Windows here, so I'm in a PowerShell session and I can
6:58 run Olama commands now to do things like view the models that I have available on
7:03 my machine. I can download models and I can run them as well. And the way that I
7:08 know how to pull and run specific models is I can just go to this models tab in
7:12 their navigation and I can browse and filter through all of the open source
7:16 LLMs that are available to me like DeepSeek R1. Almost everyone is familiar
7:20 with DeepSeek. It just totally blew up back in February and March. We have
7:25 Gemma 3, Quen 3, Llama 4, a few of them that I mentioned earlier when we had the
7:30 presentation up. And so we can click into any one of these like I can go into
7:35 DeepSeek R1 for example and then I have the command right here that I can copy
7:39 to download and run this specific model in my terminal. And there are a lot of
7:44 different model variants of DeepSeek R1. So we'll get into different sizes and
7:47 hardware requirements and what that all means in a little bit, but I'll just
7:50 take one of them and run it as an example. So I'll just do a really small
7:54 one right now. I'll do a 1.5 billion parameter large language model. And
7:58 again, I'll explain what that means in a little bit. I can copy this command.
8:01 It's just lama run and then the unique ID of this large language model. So I'll
8:06 go back into my terminal. I'll clear it here and then paste in this command. And
8:09 so first it's going to have to pull this large language model. And the total size
8:15 for this is 1.1 GB. And so it'll have to download it. And then because I used the
8:20 run command, it will immediately get me into a chat interface with the model
8:24 once it's downloaded. Also, if you don't want to run it right right away, you
8:27 just want to install it, you can do Olama pull instead of Olama run. And
8:32 then again, to view the models that you have available to you installed already,
8:36 you can just do the Olama list command like I did earlier. And so, right now,
8:39 I'll pause and come back once it's installed in about 30 seconds. All
8:43 right, it is now installed. And now I can just send in a message like hello.
8:47 And then boom, we are now talking to a large language model. But instead of it
8:50 being hosted somewhere else and we're just using a website, this is running on
8:54 my own infrastructure, the large language model and all the billions of
8:59 parameters are getting loaded onto my graphics card and running the inference.
9:02 That's what it's called when we're generating a response from the LLM
9:06 directly within this terminal here. And so I can ask another question like um
9:13 what is the best GPU right now? We'll see what it says. So it's thinking
9:16 first. This is actually a thinking model. Deepseek R1 is a reasoning LLM.
9:20 And then it gives us an answer. It's top GPU models today. 3080 RX6700.
9:27 Obviously, we have a training cutoff for local large language models just like we
9:30 do with ones in the cloud like GPT. And so the information is a little outdated
9:34 here, but yeah, this is a good answer. So we have a large language model that
9:38 we're talking to directly on our machine. And then to close out of this,
9:42 I can just do control D or command D on Mac. And if I do list, we have all the
9:47 other models that you saw earlier, plus now this one that I just installed. So
9:50 these are all available for me to run again just with that Olama run command.
9:54 And it won't have to reinstall if you already have it installed. Run just
9:58 installs it if you don't have it yet already. So that is just a quick demo of
10:03 using Olama. We'll dive a lot more into Olama later, like how we can actually
10:07 use it within our Python code and within our N8N workflows. This is just our
10:11 quick way to try it out within the terminal. Now, to really get into why we
10:15 should care about local AI now that we know what it is, I want to cover the
10:19 pros and cons of local AI and what I like to call cloud AI. That's just when
10:23 you're paying for things to be hosted for you, like using Claude or Gemini or
10:28 using the cloud version of N8N instead of hosting it yourself. And I also want
10:32 to cover the advantages of each because I don't want to sugarcoat things and
10:35 just hype up this master class by telling you that you should always use
10:39 local AI. That is certainly not the case. There is a time and place for both
10:43 of these categories here, but there are so many use cases where local AI is
10:50 absolutely crucial. You have no idea how many businesses I have talked to that
10:54 are willing to put tens of thousands of dollars into running their own LLMs and
10:58 infrastructure because privacy and security is so crucial for the things
11:02 that they're building with AI. And that actually gets into the first advantage
11:06 here of local AI, which is privacy and security. You can run things 100%
11:11 offline. The data that you're giving to your LLMs as prompts, it now doesn't
11:16 leave your hardware. It stays entirely within your own control. And for a lot
11:20 of businesses, that is 100% crucial, especially when they're in highly
11:24 regulated industries like the health industry, finance, uh even real estate.
11:28 Like there's so many use cases where you're working with intellectual
11:32 property or just really sensitive information. You don't want to be
11:36 sending your data off to an LLM provider like Google or OpenAI or Enthropic. And
11:41 so as a business owner, you should definitely be paying attention to this
11:45 if you are working with automation use cases where you're dealing with any kind
11:48 of sensitive data. And then also if you're a freelancer, you're starting an
11:52 AI automation agency, anything where you're building for other businesses,
11:55 you are going to have so many opportunities open up to you when you're
11:59 able to work with local AI because you can handle those use cases now where
12:03 they need to work with sensitive data and you can't just go and use the OpenAI
12:08 API. And that is the main advantage of local AI. It is a very big deal. But
12:11 there are a few other things that are worth focusing on as well. Starting with
12:16 model fine-tuning, you can take any open- source large language model and
12:20 add additional training on top with your own data. Basically making it a domain
12:24 expert on your business or the problem that you are solving. It's so so
12:29 powerful. You can make models through fine-tuning more powerful than the best
12:33 of the best in the cloud depending on what you are able to fine-tune with
12:38 depending on the data that you have. And you can do fine-tuning with some cloud
12:42 models like with GPT, but your options are pretty limited and it can be quite
12:46 expensive. And so it definitely is a huge advantage to local AI. And local AI
12:52 in general can be very coste effective, including the infrastructure as well. So
12:56 your LLMs and your infrastructure. You run it all yourself and you pay for
13:00 nothing besides the electricity bill if it's running on your computer at your
13:04 house or if you have some private server in the cloud. You just have to pay for
13:07 that server and that's it. There's no N8N bill, no Superbase bill, no OpenAI
13:13 bill. You can save a lot of money. It's really, really nice. And on top of that,
13:17 when everything is running on your own infrastructure, the agents that you
13:22 create can run on the same server, the same place as your infrastructure. And
13:26 so it can actually be faster because you don't have network delays calling APIs
13:30 for all your different services for your LLMs and your database and things like
13:34 that. And then with that, we can now get into the advantages of cloud AI.
13:38 Starting with it's a lot easier to set up. There's a reason why I have to have
13:42 this master class for you in the first place. There are some initial hurdles
13:47 that we have to jump over to really have everything fully set up for our local
13:51 LLMs and infrastructure. And you just don't have that with cloud AI because
13:54 you can very simply call into these APIs. You just have to sign up and get
13:58 an API key and that's about it. So, it certainly is easier to get up and
14:02 running and there's less maintenance overall because they are hosting things
14:05 for you. Superbase is hosting the database for you. OpenAI is hosting the
14:09 LLM for you. So, you don't have to manage things on your own hardware. With
14:13 Local AI, you have to apply patches and updates if you have a private server in
14:17 the cloud. You have to manage your own hardware if you're running on your own
14:20 computer, making sure that it's on 24/7, if you want your database on 24/7, that
14:24 kind of thing. It's just less maintenance with cloud AI. And then
14:28 probably the biggest advantage of cloud AI overall is that you have better
14:34 models available to you. Claude 4 sonnet or opus for example is more powerful
14:40 than any local AI that you could run. So we have this gap here and this gap was a
14:45 lot bigger at one point even a year ago. The best local LLMs absolutely crushed
14:50 the best local LLMs and that gap is starting to diminish. And so I really
14:55 see a future where that gap is diminished entirely and all the best
14:59 local LLMs are actually on par with the best cloud ones. That's the future I
15:03 see. That's why I think that cloud that's why I think that local AI is such
15:07 a big deal because the advantages of local AI, those are just going to get
15:12 more prevalent over time when businesses realize they really want private and
15:15 secure solutions. And then the advantages of cloud AI, I think those
15:19 are actually going to diminish over time. That's the key. minimal setup,
15:24 less maintenance. Well, those advantages are going to go away as we have
15:28 platforms and better instructions and solutions to make the setup and
15:32 maintenance easier for local AI and we have the gap that's continuing to
15:35 diminish between the power of these LLMs. All these advantages are going to
15:39 actually go away and then it'll just completely make sense to use local AI
15:43 honestly probably for like every single solution in the future. That's really
15:47 what I see us heading towards. And then the last advantage to cloud AI which
15:51 also I think will go away over time is that you have some features out of the
15:55 box like you have memory that's built directly into chat GPT. Gemini has web
15:59 search baked in even when you use it through the API like these kind of
16:02 capabilities that are out of the box that you have to implement yourself with
16:07 local AI maybe as tools for your agent and you can definitely do that but it is
16:10 nice that these things are out of the box for cloud AI. So those are the pros
16:15 and cons between the two. I hope that this makes it very clear for you to pick
16:19 right now for your own use case. Should I implement local AI or cloud AI? A lot
16:24 of it comes down to the security and privacy requirements for your use case.
16:28 Now, the next big thing that we need to talk about for local AI is hardware
16:33 requirements. Cuz here's the thing, large language models are very resource
16:39 inensive. You can't just run any LLM on any computer. And the reason for that is
16:43 large language models are made up of billions or even trillions of numbers
16:47 called parameters. And they're all connected together in a web that looks
16:51 kind of like this. This is a very simplified view with just a few
16:54 parameters here. But each of the parameters are nodes and they're
16:58 connected together. The input layer is where our prompt comes in and our prompt
17:02 is fed through all these hidden layers and then we have the output at the end.
17:05 This is the response we get back from the LLM. But like I said, this is a very
17:10 simplified view. GPT4, for example, like you can see on the right hand side, is
17:15 estimated to have 1.4 trillion parameters. And so, if you want to fit
17:20 an entire large language model into your graphics card, you have to store all of
17:25 these numbers. And even though we can handle gigabytes at a time in our
17:28 graphics cards through what is called VRAMm, storing billions or trillions of
17:34 numbers is absolutely insane. And so that's why large language models, you
17:37 actually have to have a pretty good graphics card if you want to run some of
17:42 the best ones. And so looking at Olama here, when we see these different sizes,
17:47 going back to their model list, like 1.5 billion parameters or 27 billion
17:51 parameters, there are different sizes for the local LLMs. Obviously, the
17:56 larger a local LM that you are running, the more performance you are going to
17:59 get, but you are going to be limited to what you are capable of running with
18:04 your graphics card or your hardware. So, with that in mind, I now want to dive
18:07 into the nitty-gritty details with you so you know exactly the kind of models
18:11 that you can run, the kind of speeds you can expect depending on your hardware.
18:15 And if you want to invest in new hardware to run local AI, I've got some
18:19 recommendations as well. So there are generally four primary size ranges for
18:25 large language models based on the speed and the power that you are looking for.
18:29 You have models that are around seven or 8 billion parameters. Those are
18:33 generally the smallest that I'd recommend trying to run. There are a lot
18:37 of smaller LLMs available like 1 billion parameters or three billion parameters,
18:41 but I'm so unimpressed when I use those LLMs that I don't really want to focus
18:45 on them here. 7 billion parameters is still tiny compared to the large cloud
18:51 AI models like Claude or GBT, but you can get pretty good results with them
18:55 for just simple chat use cases. And so for these models, assuming a Q4
18:59 quantization, which I'll get into quantization in a little bit, it's
19:02 basically just a way to make the LLM a lot smaller without hurting performance
19:06 that much, a 7 billion parameter model will need about four to 5 GB of VRAM on
19:11 your graphics card. And so if you have something like a 3060 Ti from Nvidia
19:16 with 8 GB of VRAM, you can very comfortably run a 7 billion parameter
19:20 model and you can expect to get very roughly around 25 to 35 tokens per
19:27 second. A token is roughly equivalent to a word. And so your local large language
19:31 model at 7 billion parameters with this graphics card will get about 25 to 35
19:37 words per second out on the screen to you being streamed out. And then if you
19:42 use much more powerful hardware like a 3090 to run a 7 billion parameter model
19:47 then you'll just jack up the speed a lot more. So that's 7 billion or 8 billion
19:51 parameters. Another very common size is something around 14 billion parameters.
19:57 This will take about 8 to 10 GB of VRAM. And so just a couple of options for
20:01 this. You have the 4070Ti which is usually 16 GB of VRAM or you could go as
20:08 low as 12 GB of VRAM with the 3080 Ti. And you could expect to get about 15 to
20:13 25 words per second. And then this is where you start to get into basic tool
20:17 calling. So I find that when you are building with a 7 billion parameter
20:21 model, they don't do tool calling very well. So you can't really build that
20:25 powerful of agents around a 7 billion parameter model. But once you get to
20:29 something around 14 billion parameters, that's when I see agents being able to
20:33 really accept instructions well around tools and system prompts and leveraging
20:37 tools to do things on our behalf. That's when we can really start to use LLMs to
20:42 make things that are agentic. And then the next big category of LLMs
20:47 is somewhere between 30 and 34 billion parameters. You see a lot of LM that
20:51 fall in that size range. This will typically need 16 to 20 gigabyt of VRAM.
20:57 And so a 3090 is a really good example of a graphics card that can run this. It
21:03 has 24 GB of VRAM. I actually have two 3090s myself. And I'll have a link to my
21:08 exact PC that I built for running local AI in the description of this video. So
21:12 I have two 3090s, which we'll need in a second for a 70 billion parameter, but
21:16 one is enough for a 32 billion parameter model. And then also Macs with their new
21:22 M4 chips are very powerful with their unified memory architecture. So if you
21:27 get a Mac M4 Pro with 24 GB of unified memory, you can also run 32 billion
21:32 parameter models. Now the speed isn't going to be the best necessarily, and
21:36 again, this does depend a lot on your computer overall, but you can expect
21:40 something around 10 to 15, maybe up to 20 tokens per second. and 32 billion
21:46 parameters is when you really start to see LLMs that are actually pretty
21:50 impressive. Like 7 billion and 14 billion, they are disappointing quite a
21:54 bit. I'll be totally honest. Especially when you try to use them with more
21:58 complicated agentic tasks. 32 billion when you start to get into this range is
22:02 when I I'm actually genuinely impressed. I'm like, "Oh, this is actually pretty
22:05 close to the performance of some of the best cloud AI." And then 70 billion
22:10 parameters. This is going to take about 35 to 40 GB of VRAM for most consumer
22:19 GPUs like 3090s and 4090s even 5090s. It's not actually enough VRAM. And so
22:22 this is when you have to start to split a large language model across multiple
22:28 GPUs which solutions like Olama will actually help you do this right out of
22:31 the box. So it's not this insane setup even though it might feel kind of
22:35 daunting like oh I have to split the layers of my LLM between GPUs. It's not
22:38 actually that complicated. And so 23090s, 24090s, that will be necessary.
22:44 Um, or you could have more of like an enterprisegrade GPU like an H100. So
22:49 Nvidia has a lot of these non-consumer grade GPUs that have a lot more VRAMm to
22:53 handle things like 70 billion parameter models. And the speed won't be the best
22:58 if you're using something like 23090s, especially because performance is hurt
23:01 when you have to split an LM between GPUs. You could expect something like 8
23:06 to 12 tokens per second. And this is obviously if you have the most complex
23:09 agents that you're really trying to match the performance of cloud AI as
23:13 much as possible, that's when you'd want to use a 7 billion parameter model. And
23:16 then if you're investing in hardware to run local AI, I have a couple of quick
23:20 recommendations here. And a lot of this depends on the size of the model that's
23:24 going to be good enough for your use case. And so I'll dive into some
23:28 alternatives for running local AI directly if you want to do testing
23:32 before you buy infrastructure. I'll get into that in a little bit, but
23:35 recommended builds. If you want to spend around $800 to build a PC, I'd recommend
23:41 getting a 4060Ti graphics card and then 32 GB of RAM. If you want to spend
23:47 $2,000, I'd recommend either getting a PC with a 3090 and 64 GB of RAM or
23:53 getting that Mac M4 Pro with 24 GB of unified memory. And then lastly, if you
23:58 want to spend $4,000, which is about what I spent for my PC, then I'd
24:02 recommend getting two 3090 graphics cards, and I got both of mine used for
24:07 around $700 each. Um, and then also getting 128 GB of RAM, or you can get a
24:15 Mac M4 Max with 64 GB of unified memory. So, I wanted to really get into the
24:19 nitty-gritty details there. So, I know I spent a good amount of time diving into
24:22 super specific numbers, but I hope this is really helpful for you. No matter the
24:26 large language model or your hardware, you now know generally where you're at
24:30 for what you can run. So, to go along with that information overload, I want
24:34 to give you some specifics, individual LLMs that you can try right now based on
24:38 the size range that you know will work for your hardware. So, just a couple of
24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This
24:47 is the most popular local LLM ever. It completely blew up a few months ago. And
24:52 the best part about DeepSeek R1 is they have an option that fits into each of
24:56 the size ranges that I just covered in that chart. So they have a 7 billion
25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And
25:05 then there is also the full real version of R1, which is 671 billion parameters.
25:10 I'm sorry though, you probably don't have the hardware to run that unless
25:13 you're spending tens of thousands on your infrastructure. So, probably stick
25:16 with one of these based on your graphics card or if you have a Mac computer, pick
25:19 the one that'll work for you and just try it out. You can click on any one of
25:23 these sizes here. And then here's your command to download and run it. And this
25:28 is defaulting to a Q4 quantization, which is what I was assuming in the
25:31 chart earlier. And again, I will cover what that actually means in a little bit
25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.
25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do
25:45 have all the other um sizes that fit into those ranges that I mentioned
25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And
25:52 the same kind of deal where you click on the size that you want and you've got
25:55 your command to install it here. And this is a reasoning LLM just like
26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.
26:05 I've had really good results with this as well. There are less options here,
26:08 but you've got 22 or 24 billion parameters, which is going to work well
26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified
26:18 memory. Really, really good model. And then also, there is a version of it that
26:22 is fine-tuned for coding specifically called Devstrol, which is a another
26:26 really cool LLM worth checking out as well if you have the hardware to run it.
26:30 So, that is everything for just general recommendations for local LMS to try
26:34 right now. This is the part of the master class that is going to become
26:38 outdated the fastest because there are new local LMS coming out every single
26:42 month. I don't really know how long my recommendations will last for. But in
26:45 general, you can just go to the model list in Olama, search for the ones,
26:49 finds one that has the size that works with your graphics card and just give it
26:52 a shot. You can install it and run it very easily with Olama. And the other
26:57 thing that I want to mention here is you don't always have to run open- source
27:01 large language models yourself. You can use a platform like Open Router. You can
27:05 just go to open router.ai, sign up, add in some API credits. You can try these
27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for
27:15 your agents before you invest in hardware to actually run them yourself.
27:18 And so within Open Router, I can just search for Quen here. And I can go down
27:23 to Quen and I can go to 32 billion. They have a free offering as well that
27:26 doesn't have the best rate limits. So I'll just go to this one right here,
27:31 Quen 3 32B. So I can try the model out through open router. They actually host
27:35 it for me. So it's an open- source non-local version, but now I can try it
27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now
27:43 I want to buy a 3090 graphics card so that I can install it directly through
27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here
27:51 in open router. And there are other platforms like Grock as well where you
27:55 can run these open source large language models um not on your own infrastructure
27:58 if you just want to do some testing before beforehand or whatever that might
28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's
28:04 everything for my general recommendations for LLMs to try and use
28:09 in your agents. All right, it is time to take a quick breather. This is
28:12 everything that we've covered already in our master class. What is local AI? Why
28:17 we care about it? Why it's the future and hardware requirements. And I really
28:20 wanted to dive deep into this stuff because it sets the stage for everything
28:24 that we do when we actually build agents and deploy our infrastructure. And so
28:28 the last thing that I want to do with you before we really start to get into
28:33 building agents and setting up our package is I want to talk about some of
28:37 the tricky stuff that is usually pretty daunting for anyone getting into local
28:42 AI. I'm talking things like offloading models, quantization, environment
28:47 variables to handle things like uh flash attention, all the stuff that is really
28:51 important that I want to break down simply for you so you can feel confident
28:55 that you have everything set up right, that you know what goes into using local
29:00 LLMs. The first big concept to focus on here is quantization. And this is
29:04 crucial. It's how we can make large language models a lot smaller so they
29:10 can fit on our GPUs without hurting performance too much. We are lowering
29:14 the model precision here. And so what basically what that means is we have
29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits
29:23 with the full size, but we can lower the precision of each of those parameters to
29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of
29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the
29:35 parameters that we already covered. And we can make these numbers less precise
29:40 or smaller without losing much performance. So, we can fit larger LLMs
29:45 within a GPU that normally wouldn't even be close to running the full-size model.
29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4
29:55 quantization like four bit per parameter in that diagram earlier. If you had the
30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could
30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it
30:10 possible. It's like rounding a number that has a long decimal to something
30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're
30:19 doing it for each of the billions of parameters, those numbers that we have.
30:23 And so just to give you a visual representation of this, you can also
30:27 quantize images just like you can quantize LLMs. And so we have our full
30:31 scale image on the lefth hand side here comparing it to different levels of
30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with
30:40 a 16- bit quantization, it almost looks the same. But then once we go down to
30:44 4bit, you can very much see that we have a huge loss in quality for the image.
30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit
30:54 quantization, we don't actually lose that much performance like we lose a lot
30:58 of quality with images. And so that's why it's so useful for us. And so I have
31:01 a table just to kind of describe what this looks like. So FP16, that's the
31:07 16bit precision that all LMS have as a base. That is the full size. The speed
31:11 is obviously going to be very slow because the model is a lot bigger, but
31:16 your quality is perfect compared to what it could be. I mean, obviously that
31:18 doesn't mean that you're going to get perfect answers all the time. I'm just
31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8
31:28 precision, so it's half the size. The speed is going to be a lot better. And
31:33 the quality is nearperfect. So it's not like performance is cut in half just
31:37 because size is. You still have the same number of parameters. Each one is just a
31:42 bit less precise. And so you're still going to get almost the same results.
31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very
31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these
31:57 numbers are very vague on purpose. There's not a huge way to for me to like
32:01 qualify exactly the difference, especially because it changes per LLM
32:05 and your hardware and everything like that. So, I'm just being very general
32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to
32:13 be very very fast, but usually your performance starts to go down quite a
32:17 bit once you go down to a Q2. And then like the note that I have in the bottom
32:22 left here, a Q4 quantization is generally the best balance. And so when
32:26 you are thinking to yourself, which large language model should I run? What
32:31 size should I use? My rule of thumb is to pick the largest large language model
32:37 that can work with your hardware with a Q4 quantization. That is why I assumed
32:42 that in the table earlier. And then also like we saw in Olama earlier, it always
32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4
32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion
32:59 parameter model is still going to be a lot more powerful than the full 7
33:02 billion parameter or 14 billion parameter because you don't actually
33:07 lose that much performance. So that is quantization. So just to make this very
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 just fun fact, you can actually do that. Now, the very last thing that I want to
37:15 cover before we dive into some code, setting up the local AI package, and
37:21 building out some agents is a few very crucial parameters, environment
37:25 variables for Olama. So, these are environment variables that you can set
37:28 on your machine just like any other based on your operating system. And
37:32 Olama does have an FAQ for setting up some of these things, which I'll link to
37:36 in the description as well. But yeah, these are a bit more technical, so
37:40 people skip past setting this stuff up a lot, but it's actually really, really
37:44 important to make things very efficient when running local LLMs. So the first
37:49 environment variable is flash attention. You want to set this to one or true.
37:54 When you have this set to true, it's going to make the attention calculation
37:59 a lot more efficient. It sounds fancy, but basically large language models when
38:04 they are generating a response, they have to calculate which parts of your
38:08 prompt to pay the most attention to. That's the calculation. And you can make
38:12 it a lot more efficient without losing much performance at all by setting up
38:16 the flash attention, setting that to true. And then for another optimization,
38:21 just like we can quantize the LLM itself, you can also quantize or
38:27 compress the context. So your system prompt, the tool descriptions, your
38:31 prompt and conversation history, all that context that's being sent to your
38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for
38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the
38:46 context memory. It's a very simplified explanation, but it's really, really
38:50 useful because a long conversation can also take a lot of VRAM just like larger
38:55 LLM. And so it's good to compress that. And then the third environment variable,
38:58 this is actually probably the most crucial one to set up for Olama. There
39:02 is this crazy thing. I don't know why Olama does it, but by default, they
39:07 limit every single large language model to 2,000 tokens for the context limit,
39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and
39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a
39:21 lot of local large language models can also handle large prompts. But Olamo
39:25 will limit you to default to 2,000 tokens. And so you have to override that
39:30 yourself with this environment variable. And so generally I recommend starting
39:34 with about 8,000 tokens to start. You can move this all the way up to
39:38 something like 32,000 tokens if your local large language model supports
39:42 that. And if you view the model page on Alama, you can see the context link
39:46 that's supported by the LLM. But you definitely want to, you know, jack this
39:50 up more from just 2,000 because a lot of times when you have longer
39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do
39:57 not miss this. If your large language model is starting to go completely off
40:02 the rails and ignore your system prompt and forget that it has these tools that
40:06 you gave it, it's probably because you reached the context length. And so, just
40:10 keep that in mind. I see people miss this a lot. And then the very last
40:14 environment variable, uh, probably the least important out of all these four,
40:18 but if you're running a lot of different large language models at once and you're
40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so
40:25 in Olama, you can limit the number of models that are allowed to be in your
40:29 memory at a single time. With this one, typically you want to set this to either
40:33 one or two. Definitely set this to just one if you are using large language
40:37 models that are basically fit for your GPU. like it's going to fit exactly into
40:41 your VRAM and you're not going to have room for another large language model.
40:44 But if you are running more smaller ones and maybe you could actually fit two on
40:48 your GPU with the VRAM that you have, you can set this to two. So again, more
40:52 technical overall, but it's very important to have these right. And we'll
40:55 get into the local AI package where I already have these set up in the
40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a
41:02 minute ago that I'll have linked in the description. And so there's actually a
41:06 lot of good things to read into here. um like being able to verify that your GPU
41:10 is compatible with Olama. How can you tell if the model's actually loaded on
41:13 your GPU? So, a lot of like sanity check things that they walk you through in the
41:17 FAQ as well. Also talking about environment variables, which I just
41:20 covered. And so, they've got some instructions here depending on your OS
41:23 how to get those set up. So, if there's anything that's confusing to you, this
41:26 is a very good resource to start with. So, I'm trying to make it possible for
41:30 you to look into things further if there's anything that doesn't quite make
41:33 sense for what I explained here. And of course, always let me know in the
41:35 comments if you have any questions on this stuff as well, especially the more
41:39 technical stuff that I just got to cover because it's so important even though I
41:43 know we really want to dive into the meat of things, which we are actually
41:47 going to do now. All right, here is everything that we have covered at this
41:50 point. And congratulations if you have made it this far because I covered all
41:55 the tricky stuff with quantization and the hardware requirements and offloading
41:59 and some of our little configuration and parameters. So, if you got all of that,
42:03 the rest of it is going to be a walk in the park as we start to dive into code,
42:07 getting all of our local AI set up and building out some agents. You understand
42:10 the foundation now that we're going to build on top of to make some cool stuff.
42:15 And so, now the next thing that we're going to do is talk about how we can use
42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show
42:23 you an example. We can take something that is using OpenAI right now,
42:27 transform it into something that is using OAMA and local LLM. So, we'll
42:31 actually dive into some code here. And I've got my fair share of no code stuff
42:35 in this master class as well, but I want to focus on both because I think it's
42:38 really important to use both code and no code whenever applicable. And that
42:42 applies to local AI just like building agents in general. So, I've already
42:45 promised a couple of times that I would dive into OpenAI API compatibility, what
42:50 it is, and why it's so important. And we're going to dive into this now so you
42:54 can really start to see how you can take existing agents and transform them into
42:59 being 100% local with local large language models without really having to
43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI
43:10 has created a standard for exposing large language models through an API.
43:14 It's called the chat completions API. It's kind of like how model context
43:19 protocol MCP is a standard for connecting agents to tools. The chat
43:23 completions API is a standard for exposing large language models over an
43:28 API. So you have this common endpoint along with a few other ones that all of
43:35 these providers implement. This is the way to access the large language model
43:39 to get a response based on some conversation history that you pass in.
43:43 So, Olama is implementing this as of February. We have other providers like
43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.
43:53 Almost every single provider is OpenAI API compatible. And so, not only is it
43:57 very easy to swap between large language models within a specific provider, it's
44:02 also very easy to swap between providers entirely. You can go from Gemini to
44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one
44:13 piece of configuration pointing to a different base URL as it is called. So
44:18 you can access that provider and then the actual API endpoint that you hit
44:22 once you are connected to that specific provider is always the exact same and
44:26 the response that you get back is also always the exact same. And so Olama has
44:31 this implemented now. And I'll link to this article in the description as well
44:33 if you want to read through this because they have a really neat Python example.
44:37 It shows where we create an OpenAI client and the only thing we have to do
44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are
44:47 pointing to Olama that is hosted locally instead of pointing to the URL for
44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And
44:55 then with Olama, you don't actually need an API key because everything's running
44:58 locally. So you just need some placeholder value here. But there is no
45:02 authentication that is going on. You can set that up. I'm not going to dive into
45:05 that right now. But by default, because it's all just running locally, you don't
45:09 even need an API key to connect to Olama. And then once we have our OpenAI
45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in
45:18 exactly the same way. But now we can specify a model that we have downloaded
45:22 locally already through Lama. We pass in our conversation history in the same way
45:27 and we access the response like the content the AI produced the token usage
45:31 like all those things that we get back from the response in the same way.
45:34 They've got a JavaScript example as well. They have a couple of examples
45:38 using different frameworks like the Versell AI SDK and Autogen. Really any
45:44 AI agent framework can work with OpenAI API compatibility to make it very easy
45:47 to swap between these different providers. like Pyantic AI, my favorite
45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily
45:57 within your Pantic AI agents swap between these different providers. And
4:28 is all about. Let's start by laying the foundation. What is local AI in the
4:32 first place? Well, very simply put, local AI is running your own large
4:36 language models and infrastructure like your database and your UI entirely on
4:42 your own machine 100% offline. So when you think about when you typically want
4:46 to build an AI agent, you need a large language model maybe like GPT4.1
4:51 or Claude 4 and then you need something like your database like Superbase and
4:55 you need a way to create a user interface. you have all these different
4:58 components for your agent and typically you're using APIs to access things that
5:03 are hosted on your behalf. But with local AI, we can take all of these
5:07 things completely in our own control running them ourselves. So this is
5:11 possible through open-source large language models and software. So
5:15 everything is running on your own hardware instead of you paying for APIs.
5:20 So we run the large language model ourself on our own machine instead of
5:25 paying for the OpenAI API for example. And so for large language models, there
5:29 are thousands of different open source large language models available for us
5:32 to use in a lot of different ways. And some of these you've probably heard of
5:37 before, like Deepseek R1, Quen 3, Mistral 3.1, Llama 4. These are just a
5:41 couple of examples of the most popular ones that you've probably heard of
5:44 before. We'll be tinkering around with using some of these in this master
5:48 class. And then we also have open-source software. So all of our infrastructure
5:53 that goes along with our agents and LLMs, things like Olama for running our
5:58 LLMs, Superbase for our database, N8N for our no/ lowode workflow automations,
6:04 and open web UI to have a nice user interface to talk to our agents and
6:07 LLMs. And we'll dive into using all of these as well. Now, because local AI
6:12 means running large language models on our own computer, it's not as easy as
6:18 just going to claw.ai AI or chatgbt.com and typing in a prompt. We have to
6:22 actually install something, but it still is very easy to get started. So, let me
6:25 show you right now with a hands-on example. So, here we are within the
6:30 website for Olama. This is just.com. I'll have a link to this in the
6:33 description of the video. This is one of the open- source platforms that allows
6:39 us to very easily download and run local large language models. And so, you just
6:42 have to go to their homepage here and click on this nice big download button.
6:45 You can install it for Windows, Mac or Linux. It really works for any operating
6:49 system. Then once you have it up and running on your machine, you can open up
6:53 any terminal. Like I'm on Windows here, so I'm in a PowerShell session and I can
6:58 run Olama commands now to do things like view the models that I have available on
7:03 my machine. I can download models and I can run them as well. And the way that I
7:08 know how to pull and run specific models is I can just go to this models tab in
7:12 their navigation and I can browse and filter through all of the open source
7:16 LLMs that are available to me like DeepSeek R1. Almost everyone is familiar
7:20 with DeepSeek. It just totally blew up back in February and March. We have
7:25 Gemma 3, Quen 3, Llama 4, a few of them that I mentioned earlier when we had the
7:30 presentation up. And so we can click into any one of these like I can go into
7:35 DeepSeek R1 for example and then I have the command right here that I can copy
7:39 to download and run this specific model in my terminal. And there are a lot of
7:44 different model variants of DeepSeek R1. So we'll get into different sizes and
7:47 hardware requirements and what that all means in a little bit, but I'll just
7:50 take one of them and run it as an example. So I'll just do a really small
7:54 one right now. I'll do a 1.5 billion parameter large language model. And
7:58 again, I'll explain what that means in a little bit. I can copy this command.
8:01 It's just lama run and then the unique ID of this large language model. So I'll
8:06 go back into my terminal. I'll clear it here and then paste in this command. And
8:09 so first it's going to have to pull this large language model. And the total size
8:15 for this is 1.1 GB. And so it'll have to download it. And then because I used the
8:20 run command, it will immediately get me into a chat interface with the model
8:24 once it's downloaded. Also, if you don't want to run it right right away, you
8:27 just want to install it, you can do Olama pull instead of Olama run. And
8:32 then again, to view the models that you have available to you installed already,
8:36 you can just do the Olama list command like I did earlier. And so, right now,
8:39 I'll pause and come back once it's installed in about 30 seconds. All
8:43 right, it is now installed. And now I can just send in a message like hello.
8:47 And then boom, we are now talking to a large language model. But instead of it
8:50 being hosted somewhere else and we're just using a website, this is running on
8:54 my own infrastructure, the large language model and all the billions of
8:59 parameters are getting loaded onto my graphics card and running the inference.
9:02 That's what it's called when we're generating a response from the LLM
9:06 directly within this terminal here. And so I can ask another question like um
9:13 what is the best GPU right now? We'll see what it says. So it's thinking
9:16 first. This is actually a thinking model. Deepseek R1 is a reasoning LLM.
9:20 And then it gives us an answer. It's top GPU models today. 3080 RX6700.
9:27 Obviously, we have a training cutoff for local large language models just like we
9:30 do with ones in the cloud like GPT. And so the information is a little outdated
9:34 here, but yeah, this is a good answer. So we have a large language model that
9:38 we're talking to directly on our machine. And then to close out of this,
9:42 I can just do control D or command D on Mac. And if I do list, we have all the
9:47 other models that you saw earlier, plus now this one that I just installed. So
9:50 these are all available for me to run again just with that Olama run command.
9:54 And it won't have to reinstall if you already have it installed. Run just
9:58 installs it if you don't have it yet already. So that is just a quick demo of
10:03 using Olama. We'll dive a lot more into Olama later, like how we can actually
10:07 use it within our Python code and within our N8N workflows. This is just our
10:11 quick way to try it out within the terminal. Now, to really get into why we
10:15 should care about local AI now that we know what it is, I want to cover the
10:19 pros and cons of local AI and what I like to call cloud AI. That's just when
10:23 you're paying for things to be hosted for you, like using Claude or Gemini or
10:28 using the cloud version of N8N instead of hosting it yourself. And I also want
10:32 to cover the advantages of each because I don't want to sugarcoat things and
10:35 just hype up this master class by telling you that you should always use
10:39 local AI. That is certainly not the case. There is a time and place for both
10:43 of these categories here, but there are so many use cases where local AI is
10:50 absolutely crucial. You have no idea how many businesses I have talked to that
10:54 are willing to put tens of thousands of dollars into running their own LLMs and
10:58 infrastructure because privacy and security is so crucial for the things
11:02 that they're building with AI. And that actually gets into the first advantage
11:06 here of local AI, which is privacy and security. You can run things 100%
11:11 offline. The data that you're giving to your LLMs as prompts, it now doesn't
11:16 leave your hardware. It stays entirely within your own control. And for a lot
11:20 of businesses, that is 100% crucial, especially when they're in highly
11:24 regulated industries like the health industry, finance, uh even real estate.
11:28 Like there's so many use cases where you're working with intellectual
11:32 property or just really sensitive information. You don't want to be
11:36 sending your data off to an LLM provider like Google or OpenAI or Enthropic. And
11:41 so as a business owner, you should definitely be paying attention to this
11:45 if you are working with automation use cases where you're dealing with any kind
11:48 of sensitive data. And then also if you're a freelancer, you're starting an
11:52 AI automation agency, anything where you're building for other businesses,
11:55 you are going to have so many opportunities open up to you when you're
11:59 able to work with local AI because you can handle those use cases now where
12:03 they need to work with sensitive data and you can't just go and use the OpenAI
12:08 API. And that is the main advantage of local AI. It is a very big deal. But
12:11 there are a few other things that are worth focusing on as well. Starting with
12:16 model fine-tuning, you can take any open- source large language model and
12:20 add additional training on top with your own data. Basically making it a domain
12:24 expert on your business or the problem that you are solving. It's so so
12:29 powerful. You can make models through fine-tuning more powerful than the best
12:33 of the best in the cloud depending on what you are able to fine-tune with
12:38 depending on the data that you have. And you can do fine-tuning with some cloud
12:42 models like with GPT, but your options are pretty limited and it can be quite
12:46 expensive. And so it definitely is a huge advantage to local AI. And local AI
12:52 in general can be very coste effective, including the infrastructure as well. So
12:56 your LLMs and your infrastructure. You run it all yourself and you pay for
13:00 nothing besides the electricity bill if it's running on your computer at your
13:04 house or if you have some private server in the cloud. You just have to pay for
13:07 that server and that's it. There's no N8N bill, no Superbase bill, no OpenAI
13:13 bill. You can save a lot of money. It's really, really nice. And on top of that,
13:17 when everything is running on your own infrastructure, the agents that you
13:22 create can run on the same server, the same place as your infrastructure. And
13:26 so it can actually be faster because you don't have network delays calling APIs
13:30 for all your different services for your LLMs and your database and things like
13:34 that. And then with that, we can now get into the advantages of cloud AI.
13:38 Starting with it's a lot easier to set up. There's a reason why I have to have
13:42 this master class for you in the first place. There are some initial hurdles
13:47 that we have to jump over to really have everything fully set up for our local
13:51 LLMs and infrastructure. And you just don't have that with cloud AI because
13:54 you can very simply call into these APIs. You just have to sign up and get
13:58 an API key and that's about it. So, it certainly is easier to get up and
14:02 running and there's less maintenance overall because they are hosting things
14:05 for you. Superbase is hosting the database for you. OpenAI is hosting the
14:09 LLM for you. So, you don't have to manage things on your own hardware. With
14:13 Local AI, you have to apply patches and updates if you have a private server in
14:17 the cloud. You have to manage your own hardware if you're running on your own
14:20 computer, making sure that it's on 24/7, if you want your database on 24/7, that
14:24 kind of thing. It's just less maintenance with cloud AI. And then
14:28 probably the biggest advantage of cloud AI overall is that you have better
14:34 models available to you. Claude 4 sonnet or opus for example is more powerful
14:40 than any local AI that you could run. So we have this gap here and this gap was a
14:45 lot bigger at one point even a year ago. The best local LLMs absolutely crushed
14:50 the best local LLMs and that gap is starting to diminish. And so I really
14:55 see a future where that gap is diminished entirely and all the best
14:59 local LLMs are actually on par with the best cloud ones. That's the future I
15:03 see. That's why I think that cloud that's why I think that local AI is such
15:07 a big deal because the advantages of local AI, those are just going to get
15:12 more prevalent over time when businesses realize they really want private and
15:15 secure solutions. And then the advantages of cloud AI, I think those
15:19 are actually going to diminish over time. That's the key. minimal setup,
15:24 less maintenance. Well, those advantages are going to go away as we have
15:28 platforms and better instructions and solutions to make the setup and
15:32 maintenance easier for local AI and we have the gap that's continuing to
15:35 diminish between the power of these LLMs. All these advantages are going to
15:39 actually go away and then it'll just completely make sense to use local AI
15:43 honestly probably for like every single solution in the future. That's really
15:47 what I see us heading towards. And then the last advantage to cloud AI which
15:51 also I think will go away over time is that you have some features out of the
15:55 box like you have memory that's built directly into chat GPT. Gemini has web
15:59 search baked in even when you use it through the API like these kind of
16:02 capabilities that are out of the box that you have to implement yourself with
16:07 local AI maybe as tools for your agent and you can definitely do that but it is
16:10 nice that these things are out of the box for cloud AI. So those are the pros
16:15 and cons between the two. I hope that this makes it very clear for you to pick
16:19 right now for your own use case. Should I implement local AI or cloud AI? A lot
16:24 of it comes down to the security and privacy requirements for your use case.
16:28 Now, the next big thing that we need to talk about for local AI is hardware
16:33 requirements. Cuz here's the thing, large language models are very resource
16:39 inensive. You can't just run any LLM on any computer. And the reason for that is
16:43 large language models are made up of billions or even trillions of numbers
16:47 called parameters. And they're all connected together in a web that looks
16:51 kind of like this. This is a very simplified view with just a few
16:54 parameters here. But each of the parameters are nodes and they're
16:58 connected together. The input layer is where our prompt comes in and our prompt
17:02 is fed through all these hidden layers and then we have the output at the end.
17:05 This is the response we get back from the LLM. But like I said, this is a very
17:10 simplified view. GPT4, for example, like you can see on the right hand side, is
17:15 estimated to have 1.4 trillion parameters. And so, if you want to fit
17:20 an entire large language model into your graphics card, you have to store all of
17:25 these numbers. And even though we can handle gigabytes at a time in our
17:28 graphics cards through what is called VRAMm, storing billions or trillions of
17:34 numbers is absolutely insane. And so that's why large language models, you
17:37 actually have to have a pretty good graphics card if you want to run some of
17:42 the best ones. And so looking at Olama here, when we see these different sizes,
17:47 going back to their model list, like 1.5 billion parameters or 27 billion
17:51 parameters, there are different sizes for the local LLMs. Obviously, the
17:56 larger a local LM that you are running, the more performance you are going to
17:59 get, but you are going to be limited to what you are capable of running with
18:04 your graphics card or your hardware. So, with that in mind, I now want to dive
18:07 into the nitty-gritty details with you so you know exactly the kind of models
18:11 that you can run, the kind of speeds you can expect depending on your hardware.
18:15 And if you want to invest in new hardware to run local AI, I've got some
18:19 recommendations as well. So there are generally four primary size ranges for
18:25 large language models based on the speed and the power that you are looking for.
18:29 You have models that are around seven or 8 billion parameters. Those are
18:33 generally the smallest that I'd recommend trying to run. There are a lot
18:37 of smaller LLMs available like 1 billion parameters or three billion parameters,
18:41 but I'm so unimpressed when I use those LLMs that I don't really want to focus
18:45 on them here. 7 billion parameters is still tiny compared to the large cloud
18:51 AI models like Claude or GBT, but you can get pretty good results with them
18:55 for just simple chat use cases. And so for these models, assuming a Q4
18:59 quantization, which I'll get into quantization in a little bit, it's
19:02 basically just a way to make the LLM a lot smaller without hurting performance
19:06 that much, a 7 billion parameter model will need about four to 5 GB of VRAM on
19:11 your graphics card. And so if you have something like a 3060 Ti from Nvidia
19:16 with 8 GB of VRAM, you can very comfortably run a 7 billion parameter
19:20 model and you can expect to get very roughly around 25 to 35 tokens per
19:27 second. A token is roughly equivalent to a word. And so your local large language
19:31 model at 7 billion parameters with this graphics card will get about 25 to 35
19:37 words per second out on the screen to you being streamed out. And then if you
19:42 use much more powerful hardware like a 3090 to run a 7 billion parameter model
19:47 then you'll just jack up the speed a lot more. So that's 7 billion or 8 billion
19:51 parameters. Another very common size is something around 14 billion parameters.
19:57 This will take about 8 to 10 GB of VRAM. And so just a couple of options for
20:01 this. You have the 4070Ti which is usually 16 GB of VRAM or you could go as
20:08 low as 12 GB of VRAM with the 3080 Ti. And you could expect to get about 15 to
20:13 25 words per second. And then this is where you start to get into basic tool
20:17 calling. So I find that when you are building with a 7 billion parameter
20:21 model, they don't do tool calling very well. So you can't really build that
20:25 powerful of agents around a 7 billion parameter model. But once you get to
20:29 something around 14 billion parameters, that's when I see agents being able to
20:33 really accept instructions well around tools and system prompts and leveraging
20:37 tools to do things on our behalf. That's when we can really start to use LLMs to
20:42 make things that are agentic. And then the next big category of LLMs
20:47 is somewhere between 30 and 34 billion parameters. You see a lot of LM that
20:51 fall in that size range. This will typically need 16 to 20 gigabyt of VRAM.
20:57 And so a 3090 is a really good example of a graphics card that can run this. It
21:03 has 24 GB of VRAM. I actually have two 3090s myself. And I'll have a link to my
21:08 exact PC that I built for running local AI in the description of this video. So
21:12 I have two 3090s, which we'll need in a second for a 70 billion parameter, but
21:16 one is enough for a 32 billion parameter model. And then also Macs with their new
21:22 M4 chips are very powerful with their unified memory architecture. So if you
21:27 get a Mac M4 Pro with 24 GB of unified memory, you can also run 32 billion
21:32 parameter models. Now the speed isn't going to be the best necessarily, and
21:36 again, this does depend a lot on your computer overall, but you can expect
21:40 something around 10 to 15, maybe up to 20 tokens per second. and 32 billion
21:46 parameters is when you really start to see LLMs that are actually pretty
21:50 impressive. Like 7 billion and 14 billion, they are disappointing quite a
21:54 bit. I'll be totally honest. Especially when you try to use them with more
21:58 complicated agentic tasks. 32 billion when you start to get into this range is
22:02 when I I'm actually genuinely impressed. I'm like, "Oh, this is actually pretty
22:05 close to the performance of some of the best cloud AI." And then 70 billion
22:10 parameters. This is going to take about 35 to 40 GB of VRAM for most consumer
22:19 GPUs like 3090s and 4090s even 5090s. It's not actually enough VRAM. And so
22:22 this is when you have to start to split a large language model across multiple
22:28 GPUs which solutions like Olama will actually help you do this right out of
22:31 the box. So it's not this insane setup even though it might feel kind of
22:35 daunting like oh I have to split the layers of my LLM between GPUs. It's not
22:38 actually that complicated. And so 23090s, 24090s, that will be necessary.
22:44 Um, or you could have more of like an enterprisegrade GPU like an H100. So
22:49 Nvidia has a lot of these non-consumer grade GPUs that have a lot more VRAMm to
22:53 handle things like 70 billion parameter models. And the speed won't be the best
22:58 if you're using something like 23090s, especially because performance is hurt
23:01 when you have to split an LM between GPUs. You could expect something like 8
23:06 to 12 tokens per second. And this is obviously if you have the most complex
23:09 agents that you're really trying to match the performance of cloud AI as
23:13 much as possible, that's when you'd want to use a 7 billion parameter model. And
23:16 then if you're investing in hardware to run local AI, I have a couple of quick
23:20 recommendations here. And a lot of this depends on the size of the model that's
23:24 going to be good enough for your use case. And so I'll dive into some
23:28 alternatives for running local AI directly if you want to do testing
23:32 before you buy infrastructure. I'll get into that in a little bit, but
23:35 recommended builds. If you want to spend around $800 to build a PC, I'd recommend
23:41 getting a 4060Ti graphics card and then 32 GB of RAM. If you want to spend
23:47 $2,000, I'd recommend either getting a PC with a 3090 and 64 GB of RAM or
23:53 getting that Mac M4 Pro with 24 GB of unified memory. And then lastly, if you
23:58 want to spend $4,000, which is about what I spent for my PC, then I'd
24:02 recommend getting two 3090 graphics cards, and I got both of mine used for
24:07 around $700 each. Um, and then also getting 128 GB of RAM, or you can get a
24:15 Mac M4 Max with 64 GB of unified memory. So, I wanted to really get into the
24:19 nitty-gritty details there. So, I know I spent a good amount of time diving into
24:22 super specific numbers, but I hope this is really helpful for you. No matter the
24:26 large language model or your hardware, you now know generally where you're at
24:30 for what you can run. So, to go along with that information overload, I want
24:34 to give you some specifics, individual LLMs that you can try right now based on
24:38 the size range that you know will work for your hardware. So, just a couple of
24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This
24:47 is the most popular local LLM ever. It completely blew up a few months ago. And
24:52 the best part about DeepSeek R1 is they have an option that fits into each of
24:56 the size ranges that I just covered in that chart. So they have a 7 billion
25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And
25:05 then there is also the full real version of R1, which is 671 billion parameters.
25:10 I'm sorry though, you probably don't have the hardware to run that unless
25:13 you're spending tens of thousands on your infrastructure. So, probably stick
25:16 with one of these based on your graphics card or if you have a Mac computer, pick
25:19 the one that'll work for you and just try it out. You can click on any one of
25:23 these sizes here. And then here's your command to download and run it. And this
25:28 is defaulting to a Q4 quantization, which is what I was assuming in the
25:31 chart earlier. And again, I will cover what that actually means in a little bit
25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.
25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do
25:45 have all the other um sizes that fit into those ranges that I mentioned
25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And
25:52 the same kind of deal where you click on the size that you want and you've got
25:55 your command to install it here. And this is a reasoning LLM just like
26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.
26:05 I've had really good results with this as well. There are less options here,
26:08 but you've got 22 or 24 billion parameters, which is going to work well
26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified
26:18 memory. Really, really good model. And then also, there is a version of it that
26:22 is fine-tuned for coding specifically called Devstrol, which is a another
26:26 really cool LLM worth checking out as well if you have the hardware to run it.
26:30 So, that is everything for just general recommendations for local LMS to try
26:34 right now. This is the part of the master class that is going to become
26:38 outdated the fastest because there are new local LMS coming out every single
26:42 month. I don't really know how long my recommendations will last for. But in
26:45 general, you can just go to the model list in Olama, search for the ones,
26:49 finds one that has the size that works with your graphics card and just give it
26:52 a shot. You can install it and run it very easily with Olama. And the other
26:57 thing that I want to mention here is you don't always have to run open- source
27:01 large language models yourself. You can use a platform like Open Router. You can
27:05 just go to open router.ai, sign up, add in some API credits. You can try these
27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for
27:15 your agents before you invest in hardware to actually run them yourself.
27:18 And so within Open Router, I can just search for Quen here. And I can go down
27:23 to Quen and I can go to 32 billion. They have a free offering as well that
27:26 doesn't have the best rate limits. So I'll just go to this one right here,
27:31 Quen 3 32B. So I can try the model out through open router. They actually host
27:35 it for me. So it's an open- source non-local version, but now I can try it
27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now
27:43 I want to buy a 3090 graphics card so that I can install it directly through
27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here
27:51 in open router. And there are other platforms like Grock as well where you
27:55 can run these open source large language models um not on your own infrastructure
27:58 if you just want to do some testing before beforehand or whatever that might
28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's
28:04 everything for my general recommendations for LLMs to try and use
28:09 in your agents. All right, it is time to take a quick breather. This is
28:12 everything that we've covered already in our master class. What is local AI? Why
28:17 we care about it? Why it's the future and hardware requirements. And I really
28:20 wanted to dive deep into this stuff because it sets the stage for everything
28:24 that we do when we actually build agents and deploy our infrastructure. And so
28:28 the last thing that I want to do with you before we really start to get into
28:33 building agents and setting up our package is I want to talk about some of
28:37 the tricky stuff that is usually pretty daunting for anyone getting into local
28:42 AI. I'm talking things like offloading models, quantization, environment
28:47 variables to handle things like uh flash attention, all the stuff that is really
28:51 important that I want to break down simply for you so you can feel confident
28:55 that you have everything set up right, that you know what goes into using local
29:00 LLMs. The first big concept to focus on here is quantization. And this is
29:04 crucial. It's how we can make large language models a lot smaller so they
29:10 can fit on our GPUs without hurting performance too much. We are lowering
29:14 the model precision here. And so what basically what that means is we have
29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits
29:23 with the full size, but we can lower the precision of each of those parameters to
29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of
29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the
29:35 parameters that we already covered. And we can make these numbers less precise
29:40 or smaller without losing much performance. So, we can fit larger LLMs
29:45 within a GPU that normally wouldn't even be close to running the full-size model.
29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4
29:55 quantization like four bit per parameter in that diagram earlier. If you had the
30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could
30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it
30:10 possible. It's like rounding a number that has a long decimal to something
30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're
30:19 doing it for each of the billions of parameters, those numbers that we have.
30:23 And so just to give you a visual representation of this, you can also
30:27 quantize images just like you can quantize LLMs. And so we have our full
30:31 scale image on the lefth hand side here comparing it to different levels of
30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with
30:40 a 16- bit quantization, it almost looks the same. But then once we go down to
30:44 4bit, you can very much see that we have a huge loss in quality for the image.
30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit
30:54 quantization, we don't actually lose that much performance like we lose a lot
30:58 of quality with images. And so that's why it's so useful for us. And so I have
31:01 a table just to kind of describe what this looks like. So FP16, that's the
31:07 16bit precision that all LMS have as a base. That is the full size. The speed
31:11 is obviously going to be very slow because the model is a lot bigger, but
31:16 your quality is perfect compared to what it could be. I mean, obviously that
31:18 doesn't mean that you're going to get perfect answers all the time. I'm just
31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8
31:28 precision, so it's half the size. The speed is going to be a lot better. And
31:33 the quality is nearperfect. So it's not like performance is cut in half just
31:37 because size is. You still have the same number of parameters. Each one is just a
31:42 bit less precise. And so you're still going to get almost the same results.
31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very
31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these
31:57 numbers are very vague on purpose. There's not a huge way to for me to like
32:01 qualify exactly the difference, especially because it changes per LLM
32:05 and your hardware and everything like that. So, I'm just being very general
32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to
32:13 be very very fast, but usually your performance starts to go down quite a
32:17 bit once you go down to a Q2. And then like the note that I have in the bottom
32:22 left here, a Q4 quantization is generally the best balance. And so when
32:26 you are thinking to yourself, which large language model should I run? What
32:31 size should I use? My rule of thumb is to pick the largest large language model
32:37 that can work with your hardware with a Q4 quantization. That is why I assumed
32:42 that in the table earlier. And then also like we saw in Olama earlier, it always
32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4
32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion
32:59 parameter model is still going to be a lot more powerful than the full 7
33:02 billion parameter or 14 billion parameter because you don't actually
33:07 lose that much performance. So that is quantization. So just to make this very
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 just fun fact, you can actually do that. Now, the very last thing that I want to
37:15 cover before we dive into some code, setting up the local AI package, and
37:21 building out some agents is a few very crucial parameters, environment
37:25 variables for Olama. So, these are environment variables that you can set
37:28 on your machine just like any other based on your operating system. And
37:32 Olama does have an FAQ for setting up some of these things, which I'll link to
37:36 in the description as well. But yeah, these are a bit more technical, so
37:40 people skip past setting this stuff up a lot, but it's actually really, really
37:44 important to make things very efficient when running local LLMs. So the first
37:49 environment variable is flash attention. You want to set this to one or true.
37:54 When you have this set to true, it's going to make the attention calculation
37:59 a lot more efficient. It sounds fancy, but basically large language models when
38:04 they are generating a response, they have to calculate which parts of your
38:08 prompt to pay the most attention to. That's the calculation. And you can make
38:12 it a lot more efficient without losing much performance at all by setting up
38:16 the flash attention, setting that to true. And then for another optimization,
38:21 just like we can quantize the LLM itself, you can also quantize or
38:27 compress the context. So your system prompt, the tool descriptions, your
38:31 prompt and conversation history, all that context that's being sent to your
38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for
38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the
38:46 context memory. It's a very simplified explanation, but it's really, really
38:50 useful because a long conversation can also take a lot of VRAM just like larger
38:55 LLM. And so it's good to compress that. And then the third environment variable,
38:58 this is actually probably the most crucial one to set up for Olama. There
39:02 is this crazy thing. I don't know why Olama does it, but by default, they
39:07 limit every single large language model to 2,000 tokens for the context limit,
39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and
39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a
39:21 lot of local large language models can also handle large prompts. But Olamo
39:25 will limit you to default to 2,000 tokens. And so you have to override that
39:30 yourself with this environment variable. And so generally I recommend starting
39:34 with about 8,000 tokens to start. You can move this all the way up to
39:38 something like 32,000 tokens if your local large language model supports
39:42 that. And if you view the model page on Alama, you can see the context link
39:46 that's supported by the LLM. But you definitely want to, you know, jack this
39:50 up more from just 2,000 because a lot of times when you have longer
39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do
39:57 not miss this. If your large language model is starting to go completely off
40:02 the rails and ignore your system prompt and forget that it has these tools that
40:06 you gave it, it's probably because you reached the context length. And so, just
40:10 keep that in mind. I see people miss this a lot. And then the very last
40:14 environment variable, uh, probably the least important out of all these four,
40:18 but if you're running a lot of different large language models at once and you're
40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so
40:25 in Olama, you can limit the number of models that are allowed to be in your
40:29 memory at a single time. With this one, typically you want to set this to either
40:33 one or two. Definitely set this to just one if you are using large language
40:37 models that are basically fit for your GPU. like it's going to fit exactly into
40:41 your VRAM and you're not going to have room for another large language model.
40:44 But if you are running more smaller ones and maybe you could actually fit two on
40:48 your GPU with the VRAM that you have, you can set this to two. So again, more
40:52 technical overall, but it's very important to have these right. And we'll
40:55 get into the local AI package where I already have these set up in the
40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a
41:02 minute ago that I'll have linked in the description. And so there's actually a
41:06 lot of good things to read into here. um like being able to verify that your GPU
41:10 is compatible with Olama. How can you tell if the model's actually loaded on
41:13 your GPU? So, a lot of like sanity check things that they walk you through in the
41:17 FAQ as well. Also talking about environment variables, which I just
41:20 covered. And so, they've got some instructions here depending on your OS
41:23 how to get those set up. So, if there's anything that's confusing to you, this
41:26 is a very good resource to start with. So, I'm trying to make it possible for
41:30 you to look into things further if there's anything that doesn't quite make
41:33 sense for what I explained here. And of course, always let me know in the
41:35 comments if you have any questions on this stuff as well, especially the more
41:39 technical stuff that I just got to cover because it's so important even though I
41:43 know we really want to dive into the meat of things, which we are actually
41:47 going to do now. All right, here is everything that we have covered at this
41:50 point. And congratulations if you have made it this far because I covered all
41:55 the tricky stuff with quantization and the hardware requirements and offloading
41:59 and some of our little configuration and parameters. So, if you got all of that,
42:03 the rest of it is going to be a walk in the park as we start to dive into code,
42:07 getting all of our local AI set up and building out some agents. You understand
42:10 the foundation now that we're going to build on top of to make some cool stuff.
42:15 And so, now the next thing that we're going to do is talk about how we can use
42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show
42:23 you an example. We can take something that is using OpenAI right now,
42:27 transform it into something that is using OAMA and local LLM. So, we'll
42:31 actually dive into some code here. And I've got my fair share of no code stuff
42:35 in this master class as well, but I want to focus on both because I think it's
42:38 really important to use both code and no code whenever applicable. And that
42:42 applies to local AI just like building agents in general. So, I've already
42:45 promised a couple of times that I would dive into OpenAI API compatibility, what
42:50 it is, and why it's so important. And we're going to dive into this now so you
42:54 can really start to see how you can take existing agents and transform them into
42:59 being 100% local with local large language models without really having to
43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI
43:10 has created a standard for exposing large language models through an API.
43:14 It's called the chat completions API. It's kind of like how model context
43:19 protocol MCP is a standard for connecting agents to tools. The chat
43:23 completions API is a standard for exposing large language models over an
43:28 API. So you have this common endpoint along with a few other ones that all of
43:35 these providers implement. This is the way to access the large language model
43:39 to get a response based on some conversation history that you pass in.
43:43 So, Olama is implementing this as of February. We have other providers like
43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.
43:53 Almost every single provider is OpenAI API compatible. And so, not only is it
43:57 very easy to swap between large language models within a specific provider, it's
44:02 also very easy to swap between providers entirely. You can go from Gemini to
44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one
44:13 piece of configuration pointing to a different base URL as it is called. So
44:18 you can access that provider and then the actual API endpoint that you hit
44:22 once you are connected to that specific provider is always the exact same and
44:26 the response that you get back is also always the exact same. And so Olama has
44:31 this implemented now. And I'll link to this article in the description as well
44:33 if you want to read through this because they have a really neat Python example.
44:37 It shows where we create an OpenAI client and the only thing we have to do
44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are
44:47 pointing to Olama that is hosted locally instead of pointing to the URL for
44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And
44:55 then with Olama, you don't actually need an API key because everything's running
44:58 locally. So you just need some placeholder value here. But there is no
45:02 authentication that is going on. You can set that up. I'm not going to dive into
45:05 that right now. But by default, because it's all just running locally, you don't
45:09 even need an API key to connect to Olama. And then once we have our OpenAI
45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in
45:18 exactly the same way. But now we can specify a model that we have downloaded
45:22 locally already through Lama. We pass in our conversation history in the same way
45:27 and we access the response like the content the AI produced the token usage
45:31 like all those things that we get back from the response in the same way.
45:34 They've got a JavaScript example as well. They have a couple of examples
45:38 using different frameworks like the Versell AI SDK and Autogen. Really any
45:44 AI agent framework can work with OpenAI API compatibility to make it very easy
45:47 to swap between these different providers. like Pyantic AI, my favorite
45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily
45:57 within your Pantic AI agents swap between these different providers. And
46:02 so what I have for you now is two code bases that I want to cover. The first
46:07 one is the local AI package, which we'll dive into in a little bit. But right
46:13 now, we have all of the agents that we are going to be creating in this master
46:17 class. So I have a couple for N8N that are also available in this repository.
46:21 And then a couple of scripts that I want to share with you as well. And so the
46:25 very first thing that I want to show you is this simple script that I have called
46:31 OpenAI compatible demo. And so you can download this repository. I'll have this
46:34 linked in the description as well. There's instructions for downloading and
46:38 setting up everything in here. And this is all 100% local AI. And so with that,
46:43 I'm going to go over into my windsurf here where I have this OpenAI compatible
46:47 demo set up. So I've got a comment at the top reminding us what the OpenAI API
46:52 compatibility looks like. We set our base URL to point to Olama hosted
46:59 locally and it's hosted on port 11434 by default. So I can actually show you
47:02 this. I have Ola running in a Docker container, which we're going to dive
47:05 into this when we set up the local AI package, but you can see that it is
47:11 being exposed on port 11434. And by the way, you can see the
47:14 127.0.0.1 in that URL that I have highlighted here, that is synonymous with localhost.
47:21 And so this right here, you could also replace with 127.0.0.1.
47:25 Just a little tidbit there. It's not super important. I just typically leave
47:28 it as localhost. And then you can change the port as well. I'm just sticking to
47:32 what the default is. And then again, we don't need to set our API key. We can
47:36 just set it to any value that we want here. We just need some placeholder even
47:39 though there is no authentication with a llama for real unless you configure
47:43 that. So that's OpenAI compatibility. And the important thing with this script
47:47 here is I have two different configurations here. I have one for
47:51 talking to OpenAI and then one for OALMA. So with OpenAI, we set our base
47:57 URL to point to api.openai.com. We have our OpenAI API key set in our
48:01 environment variables. So you can just set all your environment variables here
48:05 and then rename this to env. I've got instructions for that in the readme of
48:08 course. And then going back to the script, we are using GPT4.1 nano for our
48:13 large language model. There's something super fast and cheap. And then for our
48:17 Lama configuration, we are setting the base URL here, localhost1434
48:22 or just whatever we have set in our environment variables. Same thing for
48:26 the API key. And then same thing for our large language model. And what I'm going
48:31 to be using in this case is Quen 314B. That is one of the large language models
48:34 that I showed you within the Olama website. Definitely a smaller one
48:38 compared to what I could run, but I just want to run something fast. And very
48:41 small large language models are great for simple tasks like summarization or
48:45 just basic chat. And that's what I'm going to be using here just for a simple
48:49 demo. And so whether it's enabled or not, this configuration is just based on
48:53 what we have set for our environment variables. And the important thing here
48:59 is the code that runs for each of these configurations just as we go through
49:03 this demo is exactly the same. We are parameterizing the configuration for the
49:08 base URL and API key. So we are setting up the exact same OpenAI client just
49:13 like we saw in the Olama article but just changing the base URL and API key.
49:17 And so then for example when we use it right here it's client.hat.comp
49:22 completions.create create calling the exact same function no matter if we're
49:26 using OpenAI or Olama. And then we're handling the response in the same way as
49:31 well. And so I'll go back to my terminal now. And so I went through all the steps
49:34 already to set up my virtual environment, install all of my dependencies. And so now I can run the
49:40 command OpenAI compatible demo. And now it's going to present the two
49:43 configuration options for me. And so I can run through OpenAI. So we'll go
49:46 ahead and do that first. And these two demos are going to look exactly the
49:50 same, but that is the point. And so we have our base URL here for OpenAI. We
49:54 have a basic example of a completion with GPT4.1 Nano. There we go. So this
49:59 is the model that was used. Here are the number of tokens. And this is our
50:03 response. And then I can press enter to see a streaming response now as well. So
50:07 we saw it type out our answer in real time. And then I can press enter one
50:10 more time. This is the last part of the demo. Just say multi-turn conversation.
50:14 So we got a couple of messages here in our conversation history. So very nice
50:19 and simple. The point here is to now show you that I can run this and select
50:23 Olama now instead and everything is going to look exactly the same and all
50:27 of the code is the same as wallet. It is only our configuration that is
50:31 different. And so it will take a little bit when you first run this because
50:36 Olama has to load the large language model into your GPU. And so going to the
50:42 logs for Olama, I can show you what this looks like here. And so when we first
50:47 make a request when Quen 314B is not loaded into our GPU yet, you're going to
50:52 see a lot of logs come in here and we'll and you'll have this container up and
50:54 running when you have the local AI package which we'll cover in a little
50:57 bit. So it shows all the metadata about our model like it's Quen 314b. Uh we can
51:05 see here that uh we have a Q4 KM quantization like we saw in the Olama
51:09 website. Uh what other information do we have here? There's just so much to to
51:14 digest here. Um, yeah, another really important thing is we have the uh
51:19 context link. I have that set to 8,192 just like I recommended in the
51:22 environment variables. And then we can see that we offloaded all of the layers
51:26 to the GPU. So I don't have to do any offloading to the CPU or the RAM. I can
51:30 keep everything in the GPU, which is certainly ideal, like I said, to make
51:34 sure this is actually fast. And then when we get a response from quen 314b,
51:41 we are calling the v1/hatcompletions endpoint because it is openi API
51:46 compatible. So that exact endpoint that we hit for openai is the one that we are
51:50 hitting here with a large language model that is running entirely on our computer
51:54 in Olama. And so the response I get back, it's actually a reasoning LLM as
51:58 well. So we even have the thinking tokens here, which is super cool. And so
52:02 we got our response. It's just printing out the first part of it here just to
52:04 keep it short. And then I can press enter. And we can see a streaming demo
52:08 as well. And it's going to be a lot faster this time because we do already
52:11 have the model loaded into our GPU. And so that first request when it first has
52:15 to load a model is always the slower one. And then it's faster going forward
52:19 once that model is already loaded in our GPU. And then as long as we don't swap
52:24 to another large language model and use that one, then it will remain in our GPU
52:28 for some time. And so then all of our responses after are faster. And then we
52:33 just have the last part of our demo here with a multi-turn conversation. So we
52:37 can see conversation history in action as well, just not with streaming here.
52:40 Um, and and everything's a bit slower with this large language model because
52:43 it is a reasoning one. And so you can certainly if you want faster uh
52:47 inference, you can always use a non-reasoning local LLM like Mistl or
52:52 Gemma for example. So that is our very simple demo showing how this works. I
52:55 hope that you can see with this and again this works with other AI agent
52:59 frameworks like eggno or pideantic AI or crew AI as well like they all work in
53:03 this way where you can use openAI API compatibility to swap between providers
53:08 so easily so you don't have to recreate things to use local AI and that's
53:11 something so important that I want to communicate with you because if I'm the
53:15 one introducing you to local AI I also want to show you how it can very easily
53:19 fit into your existing systems and automations. All right. Now, we have
53:23 gotten to the part of the local AI master class that I'm actually the most
53:27 excited for because over the past months, I have very much been pouring my
53:31 heart and soul into building up something to make it infinitely easier
53:35 for you to get everything up and running for local AI. And that is the local AI
53:40 package. And so, right now, we're going to walk through installing it step by
53:44 step. I don't want you to miss anything here because it's so important to get
53:47 this up and running, get it all working well. Because if you have the local AI
53:51 package running on your machine and everything is working, you don't need
53:55 anything else to start building AI agents running 100% offline and
53:59 completely private. And so here's the thing. At this point, we've been
54:04 focusing mostly on Olama and running our local large language models. But there's
54:08 the whole other component to local AI that I introduced at the start of the
54:13 master class for our infrastructure. things like our database and local and
54:18 private web search, our user interface, agent monitoring. We have all these
54:23 other open-source platforms that we also want to run along with our large
54:27 language models and the local AI package is the solution to bring all of that
54:32 together curated for you to install in just a few steps. So, here is the GitHub
54:37 repository for the local AI package. I'll have this linked in the description
54:41 below. Just to be very clear, there are two GitHub repos for this master class.
54:45 We have this one that we covered earlier. This has our N8N and Python
54:49 agents that we'll cover in a bit, as well as the OpenAI compatible demo that
54:53 we saw earlier. So, you want to have this cloned and the local AI package as
54:57 well. Very easy to get both up and running. And if you scroll down in the
55:02 local AI package, I have very comprehensive instructions for setting
55:06 up everything, including how to deploy it to a private server in the cloud,
55:10 which we'll get into at the end of this master class, and a troubleshooting
55:13 section at the bottom. So, everything that I'm about to walk you through here,
55:17 there's instructions in the readme as well if you just want to circle back to
55:21 clarify anything. Also, I dive into all of the platforms that are included in
55:26 the local AI package. And this is very important because like I said, when you
55:30 want to build a 100% offline and private AI agent, it's a lot more than just the
55:35 large language model. You have all of the accompanying infrastructure like
55:39 your database and your UI. And so I have all that included. First of all, I have
55:44 N8N that is our low/noodeode workflow automation platform. We'll be building
55:48 an agent with N8N in the local AI package in a little bit once we have it
55:52 set up. We have Superbase for our open- source database. We have Olama. Of
55:56 course, we want to have this in the package as well for our LLMs. Open Web
56:01 UI, which gives us a chat GPT like interface for us to talk to our LLMs and
56:06 have things like conversation history. Very, very nice. So, we're looking at
56:09 this right here. This is included in the package. Then we have Flowwise. It's
56:13 similar to N8N. It's another really good tool to build AI agents with no slash
56:18 low code. Quadrant, which is an open- source vector database. Neo4j which is a
56:25 knowledge graph engine and then seir xng for open-source completely free and
56:31 private web search caddy which this is going to be very important for us once
56:35 we deploy the local AI package to the cloud and we actually want to have
56:38 domains for our different services like nn and open web UI and then the last
56:42 thing is langfuse this is an open- source LLM engineering platform it helps
56:47 us with agent observability now some of these services are outside of the scope
56:52 for this local AI master class. I don't want to spend a half hour on every
56:56 single one of these services and make this a 10-hour video. I will be focusing
57:02 in this video on N8N, Superbase, Olama, Open WebUI, CRXNG, and then Caddy once
57:08 we deploy everything to the cloud. So, I do cover like half of these services.
57:12 And the other thing that I want to touch on here is that there are quite a few
57:16 things included here. And so you do need about 8 GB of RAM on your machine or
57:21 your cloud server to run everything. It is pretty big overall. And so you can
57:26 remove certain things like if you don't want Quadrant and Langfuse for example,
57:30 you can take those out of the package. More on that later. It doesn't have to
57:34 be super bloated, you can whittle this down to what you need. But yeah, there's
57:37 a lot of different things that go into building AI agents. And so I have all of
57:40 these services here so that no matter what you need, I've got you covered. And
6:12 means running large language models on our own computer, it's not as easy as
6:18 just going to claw.ai AI or chatgbt.com and typing in a prompt. We have to
6:22 actually install something, but it still is very easy to get started. So, let me
6:25 show you right now with a hands-on example. So, here we are within the
6:30 website for Olama. This is just.com. I'll have a link to this in the
6:33 description of the video. This is one of the open- source platforms that allows
6:39 us to very easily download and run local large language models. And so, you just
6:42 have to go to their homepage here and click on this nice big download button.
6:45 You can install it for Windows, Mac or Linux. It really works for any operating
6:49 system. Then once you have it up and running on your machine, you can open up
6:53 any terminal. Like I'm on Windows here, so I'm in a PowerShell session and I can
6:58 run Olama commands now to do things like view the models that I have available on
7:03 my machine. I can download models and I can run them as well. And the way that I
7:08 know how to pull and run specific models is I can just go to this models tab in
7:12 their navigation and I can browse and filter through all of the open source
7:16 LLMs that are available to me like DeepSeek R1. Almost everyone is familiar
7:20 with DeepSeek. It just totally blew up back in February and March. We have
7:25 Gemma 3, Quen 3, Llama 4, a few of them that I mentioned earlier when we had the
7:30 presentation up. And so we can click into any one of these like I can go into
7:35 DeepSeek R1 for example and then I have the command right here that I can copy
7:39 to download and run this specific model in my terminal. And there are a lot of
7:44 different model variants of DeepSeek R1. So we'll get into different sizes and
7:47 hardware requirements and what that all means in a little bit, but I'll just
7:50 take one of them and run it as an example. So I'll just do a really small
7:54 one right now. I'll do a 1.5 billion parameter large language model. And
7:58 again, I'll explain what that means in a little bit. I can copy this command.
8:01 It's just lama run and then the unique ID of this large language model. So I'll
8:06 go back into my terminal. I'll clear it here and then paste in this command. And
8:09 so first it's going to have to pull this large language model. And the total size
8:15 for this is 1.1 GB. And so it'll have to download it. And then because I used the
8:20 run command, it will immediately get me into a chat interface with the model
8:24 once it's downloaded. Also, if you don't want to run it right right away, you
8:27 just want to install it, you can do Olama pull instead of Olama run. And
8:32 then again, to view the models that you have available to you installed already,
8:36 you can just do the Olama list command like I did earlier. And so, right now,
8:39 I'll pause and come back once it's installed in about 30 seconds. All
8:43 right, it is now installed. And now I can just send in a message like hello.
8:47 And then boom, we are now talking to a large language model. But instead of it
8:50 being hosted somewhere else and we're just using a website, this is running on
8:54 my own infrastructure, the large language model and all the billions of
8:59 parameters are getting loaded onto my graphics card and running the inference.
9:02 That's what it's called when we're generating a response from the LLM
9:06 directly within this terminal here. And so I can ask another question like um
9:13 what is the best GPU right now? We'll see what it says. So it's thinking
9:16 first. This is actually a thinking model. Deepseek R1 is a reasoning LLM.
9:20 And then it gives us an answer. It's top GPU models today. 3080 RX6700.
9:27 Obviously, we have a training cutoff for local large language models just like we
9:30 do with ones in the cloud like GPT. And so the information is a little outdated
9:34 here, but yeah, this is a good answer. So we have a large language model that
9:38 we're talking to directly on our machine. And then to close out of this,
9:42 I can just do control D or command D on Mac. And if I do list, we have all the
9:47 other models that you saw earlier, plus now this one that I just installed. So
9:50 these are all available for me to run again just with that Olama run command.
9:54 And it won't have to reinstall if you already have it installed. Run just
9:58 installs it if you don't have it yet already. So that is just a quick demo of
10:03 using Olama. We'll dive a lot more into Olama later, like how we can actually
10:07 use it within our Python code and within our N8N workflows. This is just our
10:11 quick way to try it out within the terminal. Now, to really get into why we
10:15 should care about local AI now that we know what it is, I want to cover the
10:19 pros and cons of local AI and what I like to call cloud AI. That's just when
10:23 you're paying for things to be hosted for you, like using Claude or Gemini or
10:28 using the cloud version of N8N instead of hosting it yourself. And I also want
10:32 to cover the advantages of each because I don't want to sugarcoat things and
10:35 just hype up this master class by telling you that you should always use
10:39 local AI. That is certainly not the case. There is a time and place for both
10:43 of these categories here, but there are so many use cases where local AI is
10:50 absolutely crucial. You have no idea how many businesses I have talked to that
10:54 are willing to put tens of thousands of dollars into running their own LLMs and
10:58 infrastructure because privacy and security is so crucial for the things
11:02 that they're building with AI. And that actually gets into the first advantage
11:06 here of local AI, which is privacy and security. You can run things 100%
11:11 offline. The data that you're giving to your LLMs as prompts, it now doesn't
11:16 leave your hardware. It stays entirely within your own control. And for a lot
11:20 of businesses, that is 100% crucial, especially when they're in highly
11:24 regulated industries like the health industry, finance, uh even real estate.
11:28 Like there's so many use cases where you're working with intellectual
11:32 property or just really sensitive information. You don't want to be
11:36 sending your data off to an LLM provider like Google or OpenAI or Enthropic. And
11:41 so as a business owner, you should definitely be paying attention to this
11:45 if you are working with automation use cases where you're dealing with any kind
11:48 of sensitive data. And then also if you're a freelancer, you're starting an
11:52 AI automation agency, anything where you're building for other businesses,
11:55 you are going to have so many opportunities open up to you when you're
11:59 able to work with local AI because you can handle those use cases now where
12:03 they need to work with sensitive data and you can't just go and use the OpenAI
12:08 API. And that is the main advantage of local AI. It is a very big deal. But
12:11 there are a few other things that are worth focusing on as well. Starting with
12:16 model fine-tuning, you can take any open- source large language model and
12:20 add additional training on top with your own data. Basically making it a domain
12:24 expert on your business or the problem that you are solving. It's so so
12:29 powerful. You can make models through fine-tuning more powerful than the best
12:33 of the best in the cloud depending on what you are able to fine-tune with
12:38 depending on the data that you have. And you can do fine-tuning with some cloud
12:42 models like with GPT, but your options are pretty limited and it can be quite
12:46 expensive. And so it definitely is a huge advantage to local AI. And local AI
12:52 in general can be very coste effective, including the infrastructure as well. So
12:56 your LLMs and your infrastructure. You run it all yourself and you pay for
13:00 nothing besides the electricity bill if it's running on your computer at your
13:04 house or if you have some private server in the cloud. You just have to pay for
13:07 that server and that's it. There's no N8N bill, no Superbase bill, no OpenAI
13:13 bill. You can save a lot of money. It's really, really nice. And on top of that,
13:17 when everything is running on your own infrastructure, the agents that you
13:22 create can run on the same server, the same place as your infrastructure. And
13:26 so it can actually be faster because you don't have network delays calling APIs
13:30 for all your different services for your LLMs and your database and things like
13:34 that. And then with that, we can now get into the advantages of cloud AI.
13:38 Starting with it's a lot easier to set up. There's a reason why I have to have
13:42 this master class for you in the first place. There are some initial hurdles
13:47 that we have to jump over to really have everything fully set up for our local
13:51 LLMs and infrastructure. And you just don't have that with cloud AI because
13:54 you can very simply call into these APIs. You just have to sign up and get
13:58 an API key and that's about it. So, it certainly is easier to get up and
14:02 running and there's less maintenance overall because they are hosting things
14:05 for you. Superbase is hosting the database for you. OpenAI is hosting the
14:09 LLM for you. So, you don't have to manage things on your own hardware. With
14:13 Local AI, you have to apply patches and updates if you have a private server in
14:17 the cloud. You have to manage your own hardware if you're running on your own
14:20 computer, making sure that it's on 24/7, if you want your database on 24/7, that
14:24 kind of thing. It's just less maintenance with cloud AI. And then
14:28 probably the biggest advantage of cloud AI overall is that you have better
14:34 models available to you. Claude 4 sonnet or opus for example is more powerful
14:40 than any local AI that you could run. So we have this gap here and this gap was a
14:45 lot bigger at one point even a year ago. The best local LLMs absolutely crushed
14:50 the best local LLMs and that gap is starting to diminish. And so I really
14:55 see a future where that gap is diminished entirely and all the best
14:59 local LLMs are actually on par with the best cloud ones. That's the future I
15:03 see. That's why I think that cloud that's why I think that local AI is such
15:07 a big deal because the advantages of local AI, those are just going to get
15:12 more prevalent over time when businesses realize they really want private and
15:15 secure solutions. And then the advantages of cloud AI, I think those
15:19 are actually going to diminish over time. That's the key. minimal setup,
15:24 less maintenance. Well, those advantages are going to go away as we have
15:28 platforms and better instructions and solutions to make the setup and
15:32 maintenance easier for local AI and we have the gap that's continuing to
15:35 diminish between the power of these LLMs. All these advantages are going to
15:39 actually go away and then it'll just completely make sense to use local AI
15:43 honestly probably for like every single solution in the future. That's really
15:47 what I see us heading towards. And then the last advantage to cloud AI which
15:51 also I think will go away over time is that you have some features out of the
15:55 box like you have memory that's built directly into chat GPT. Gemini has web
15:59 search baked in even when you use it through the API like these kind of
16:02 capabilities that are out of the box that you have to implement yourself with
16:07 local AI maybe as tools for your agent and you can definitely do that but it is
16:10 nice that these things are out of the box for cloud AI. So those are the pros
16:15 and cons between the two. I hope that this makes it very clear for you to pick
16:19 right now for your own use case. Should I implement local AI or cloud AI? A lot
16:24 of it comes down to the security and privacy requirements for your use case.
16:28 Now, the next big thing that we need to talk about for local AI is hardware
16:33 requirements. Cuz here's the thing, large language models are very resource
16:39 inensive. You can't just run any LLM on any computer. And the reason for that is
16:43 large language models are made up of billions or even trillions of numbers
16:47 called parameters. And they're all connected together in a web that looks
16:51 kind of like this. This is a very simplified view with just a few
16:54 parameters here. But each of the parameters are nodes and they're
16:58 connected together. The input layer is where our prompt comes in and our prompt
17:02 is fed through all these hidden layers and then we have the output at the end.
17:05 This is the response we get back from the LLM. But like I said, this is a very
17:10 simplified view. GPT4, for example, like you can see on the right hand side, is
17:15 estimated to have 1.4 trillion parameters. And so, if you want to fit
17:20 an entire large language model into your graphics card, you have to store all of
17:25 these numbers. And even though we can handle gigabytes at a time in our
17:28 graphics cards through what is called VRAMm, storing billions or trillions of
17:34 numbers is absolutely insane. And so that's why large language models, you
17:37 actually have to have a pretty good graphics card if you want to run some of
17:42 the best ones. And so looking at Olama here, when we see these different sizes,
17:47 going back to their model list, like 1.5 billion parameters or 27 billion
17:51 parameters, there are different sizes for the local LLMs. Obviously, the
17:56 larger a local LM that you are running, the more performance you are going to
17:59 get, but you are going to be limited to what you are capable of running with
18:04 your graphics card or your hardware. So, with that in mind, I now want to dive
18:07 into the nitty-gritty details with you so you know exactly the kind of models
18:11 that you can run, the kind of speeds you can expect depending on your hardware.
18:15 And if you want to invest in new hardware to run local AI, I've got some
18:19 recommendations as well. So there are generally four primary size ranges for
18:25 large language models based on the speed and the power that you are looking for.
18:29 You have models that are around seven or 8 billion parameters. Those are
18:33 generally the smallest that I'd recommend trying to run. There are a lot
18:37 of smaller LLMs available like 1 billion parameters or three billion parameters,
18:41 but I'm so unimpressed when I use those LLMs that I don't really want to focus
18:45 on them here. 7 billion parameters is still tiny compared to the large cloud
18:51 AI models like Claude or GBT, but you can get pretty good results with them
18:55 for just simple chat use cases. And so for these models, assuming a Q4
18:59 quantization, which I'll get into quantization in a little bit, it's
19:02 basically just a way to make the LLM a lot smaller without hurting performance
19:06 that much, a 7 billion parameter model will need about four to 5 GB of VRAM on
19:11 your graphics card. And so if you have something like a 3060 Ti from Nvidia
19:16 with 8 GB of VRAM, you can very comfortably run a 7 billion parameter
19:20 model and you can expect to get very roughly around 25 to 35 tokens per
19:27 second. A token is roughly equivalent to a word. And so your local large language
19:31 model at 7 billion parameters with this graphics card will get about 25 to 35
19:37 words per second out on the screen to you being streamed out. And then if you
19:42 use much more powerful hardware like a 3090 to run a 7 billion parameter model
19:47 then you'll just jack up the speed a lot more. So that's 7 billion or 8 billion
19:51 parameters. Another very common size is something around 14 billion parameters.
19:57 This will take about 8 to 10 GB of VRAM. And so just a couple of options for
20:01 this. You have the 4070Ti which is usually 16 GB of VRAM or you could go as
20:08 low as 12 GB of VRAM with the 3080 Ti. And you could expect to get about 15 to
20:13 25 words per second. And then this is where you start to get into basic tool
20:17 calling. So I find that when you are building with a 7 billion parameter
20:21 model, they don't do tool calling very well. So you can't really build that
20:25 powerful of agents around a 7 billion parameter model. But once you get to
20:29 something around 14 billion parameters, that's when I see agents being able to
20:33 really accept instructions well around tools and system prompts and leveraging
20:37 tools to do things on our behalf. That's when we can really start to use LLMs to
20:42 make things that are agentic. And then the next big category of LLMs
20:47 is somewhere between 30 and 34 billion parameters. You see a lot of LM that
20:51 fall in that size range. This will typically need 16 to 20 gigabyt of VRAM.
20:57 And so a 3090 is a really good example of a graphics card that can run this. It
21:03 has 24 GB of VRAM. I actually have two 3090s myself. And I'll have a link to my
21:08 exact PC that I built for running local AI in the description of this video. So
21:12 I have two 3090s, which we'll need in a second for a 70 billion parameter, but
21:16 one is enough for a 32 billion parameter model. And then also Macs with their new
21:22 M4 chips are very powerful with their unified memory architecture. So if you
21:27 get a Mac M4 Pro with 24 GB of unified memory, you can also run 32 billion
21:32 parameter models. Now the speed isn't going to be the best necessarily, and
21:36 again, this does depend a lot on your computer overall, but you can expect
21:40 something around 10 to 15, maybe up to 20 tokens per second. and 32 billion
21:46 parameters is when you really start to see LLMs that are actually pretty
21:50 impressive. Like 7 billion and 14 billion, they are disappointing quite a
21:54 bit. I'll be totally honest. Especially when you try to use them with more
21:58 complicated agentic tasks. 32 billion when you start to get into this range is
22:02 when I I'm actually genuinely impressed. I'm like, "Oh, this is actually pretty
22:05 close to the performance of some of the best cloud AI." And then 70 billion
22:10 parameters. This is going to take about 35 to 40 GB of VRAM for most consumer
22:19 GPUs like 3090s and 4090s even 5090s. It's not actually enough VRAM. And so
22:22 this is when you have to start to split a large language model across multiple
22:28 GPUs which solutions like Olama will actually help you do this right out of
22:31 the box. So it's not this insane setup even though it might feel kind of
22:35 daunting like oh I have to split the layers of my LLM between GPUs. It's not
22:38 actually that complicated. And so 23090s, 24090s, that will be necessary.
22:44 Um, or you could have more of like an enterprisegrade GPU like an H100. So
22:49 Nvidia has a lot of these non-consumer grade GPUs that have a lot more VRAMm to
22:53 handle things like 70 billion parameter models. And the speed won't be the best
22:58 if you're using something like 23090s, especially because performance is hurt
23:01 when you have to split an LM between GPUs. You could expect something like 8
23:06 to 12 tokens per second. And this is obviously if you have the most complex
23:09 agents that you're really trying to match the performance of cloud AI as
23:13 much as possible, that's when you'd want to use a 7 billion parameter model. And
23:16 then if you're investing in hardware to run local AI, I have a couple of quick
23:20 recommendations here. And a lot of this depends on the size of the model that's
23:24 going to be good enough for your use case. And so I'll dive into some
23:28 alternatives for running local AI directly if you want to do testing
23:32 before you buy infrastructure. I'll get into that in a little bit, but
23:35 recommended builds. If you want to spend around $800 to build a PC, I'd recommend
23:41 getting a 4060Ti graphics card and then 32 GB of RAM. If you want to spend
23:47 $2,000, I'd recommend either getting a PC with a 3090 and 64 GB of RAM or
23:53 getting that Mac M4 Pro with 24 GB of unified memory. And then lastly, if you
23:58 want to spend $4,000, which is about what I spent for my PC, then I'd
24:02 recommend getting two 3090 graphics cards, and I got both of mine used for
24:07 around $700 each. Um, and then also getting 128 GB of RAM, or you can get a
24:15 Mac M4 Max with 64 GB of unified memory. So, I wanted to really get into the
24:19 nitty-gritty details there. So, I know I spent a good amount of time diving into
24:22 super specific numbers, but I hope this is really helpful for you. No matter the
24:26 large language model or your hardware, you now know generally where you're at
24:30 for what you can run. So, to go along with that information overload, I want
24:34 to give you some specifics, individual LLMs that you can try right now based on
24:38 the size range that you know will work for your hardware. So, just a couple of
24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This
24:47 is the most popular local LLM ever. It completely blew up a few months ago. And
24:52 the best part about DeepSeek R1 is they have an option that fits into each of
24:56 the size ranges that I just covered in that chart. So they have a 7 billion
25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And
25:05 then there is also the full real version of R1, which is 671 billion parameters.
25:10 I'm sorry though, you probably don't have the hardware to run that unless
25:13 you're spending tens of thousands on your infrastructure. So, probably stick
25:16 with one of these based on your graphics card or if you have a Mac computer, pick
25:19 the one that'll work for you and just try it out. You can click on any one of
25:23 these sizes here. And then here's your command to download and run it. And this
25:28 is defaulting to a Q4 quantization, which is what I was assuming in the
25:31 chart earlier. And again, I will cover what that actually means in a little bit
25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.
25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do
25:45 have all the other um sizes that fit into those ranges that I mentioned
25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And
25:52 the same kind of deal where you click on the size that you want and you've got
25:55 your command to install it here. And this is a reasoning LLM just like
26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.
26:05 I've had really good results with this as well. There are less options here,
26:08 but you've got 22 or 24 billion parameters, which is going to work well
26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified
26:18 memory. Really, really good model. And then also, there is a version of it that
26:22 is fine-tuned for coding specifically called Devstrol, which is a another
26:26 really cool LLM worth checking out as well if you have the hardware to run it.
26:30 So, that is everything for just general recommendations for local LMS to try
26:34 right now. This is the part of the master class that is going to become
26:38 outdated the fastest because there are new local LMS coming out every single
26:42 month. I don't really know how long my recommendations will last for. But in
26:45 general, you can just go to the model list in Olama, search for the ones,
26:49 finds one that has the size that works with your graphics card and just give it
26:52 a shot. You can install it and run it very easily with Olama. And the other
26:57 thing that I want to mention here is you don't always have to run open- source
27:01 large language models yourself. You can use a platform like Open Router. You can
27:05 just go to open router.ai, sign up, add in some API credits. You can try these
27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for
27:15 your agents before you invest in hardware to actually run them yourself.
27:18 And so within Open Router, I can just search for Quen here. And I can go down
27:23 to Quen and I can go to 32 billion. They have a free offering as well that
27:26 doesn't have the best rate limits. So I'll just go to this one right here,
27:31 Quen 3 32B. So I can try the model out through open router. They actually host
27:35 it for me. So it's an open- source non-local version, but now I can try it
27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now
27:43 I want to buy a 3090 graphics card so that I can install it directly through
27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here
27:51 in open router. And there are other platforms like Grock as well where you
27:55 can run these open source large language models um not on your own infrastructure
27:58 if you just want to do some testing before beforehand or whatever that might
28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's
28:04 everything for my general recommendations for LLMs to try and use
28:09 in your agents. All right, it is time to take a quick breather. This is
28:12 everything that we've covered already in our master class. What is local AI? Why
28:17 we care about it? Why it's the future and hardware requirements. And I really
28:20 wanted to dive deep into this stuff because it sets the stage for everything
28:24 that we do when we actually build agents and deploy our infrastructure. And so
28:28 the last thing that I want to do with you before we really start to get into
28:33 building agents and setting up our package is I want to talk about some of
28:37 the tricky stuff that is usually pretty daunting for anyone getting into local
28:42 AI. I'm talking things like offloading models, quantization, environment
28:47 variables to handle things like uh flash attention, all the stuff that is really
28:51 important that I want to break down simply for you so you can feel confident
28:55 that you have everything set up right, that you know what goes into using local
29:00 LLMs. The first big concept to focus on here is quantization. And this is
29:04 crucial. It's how we can make large language models a lot smaller so they
29:10 can fit on our GPUs without hurting performance too much. We are lowering
29:14 the model precision here. And so what basically what that means is we have
29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits
29:23 with the full size, but we can lower the precision of each of those parameters to
29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of
29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the
29:35 parameters that we already covered. And we can make these numbers less precise
29:40 or smaller without losing much performance. So, we can fit larger LLMs
29:45 within a GPU that normally wouldn't even be close to running the full-size model.
29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4
29:55 quantization like four bit per parameter in that diagram earlier. If you had the
30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could
30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it
30:10 possible. It's like rounding a number that has a long decimal to something
30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're
30:19 doing it for each of the billions of parameters, those numbers that we have.
30:23 And so just to give you a visual representation of this, you can also
30:27 quantize images just like you can quantize LLMs. And so we have our full
30:31 scale image on the lefth hand side here comparing it to different levels of
30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with
30:40 a 16- bit quantization, it almost looks the same. But then once we go down to
30:44 4bit, you can very much see that we have a huge loss in quality for the image.
30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit
30:54 quantization, we don't actually lose that much performance like we lose a lot
30:58 of quality with images. And so that's why it's so useful for us. And so I have
31:01 a table just to kind of describe what this looks like. So FP16, that's the
31:07 16bit precision that all LMS have as a base. That is the full size. The speed
31:11 is obviously going to be very slow because the model is a lot bigger, but
31:16 your quality is perfect compared to what it could be. I mean, obviously that
31:18 doesn't mean that you're going to get perfect answers all the time. I'm just
31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8
31:28 precision, so it's half the size. The speed is going to be a lot better. And
31:33 the quality is nearperfect. So it's not like performance is cut in half just
31:37 because size is. You still have the same number of parameters. Each one is just a
31:42 bit less precise. And so you're still going to get almost the same results.
31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very
31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these
31:57 numbers are very vague on purpose. There's not a huge way to for me to like
32:01 qualify exactly the difference, especially because it changes per LLM
32:05 and your hardware and everything like that. So, I'm just being very general
32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to
32:13 be very very fast, but usually your performance starts to go down quite a
32:17 bit once you go down to a Q2. And then like the note that I have in the bottom
32:22 left here, a Q4 quantization is generally the best balance. And so when
32:26 you are thinking to yourself, which large language model should I run? What
32:31 size should I use? My rule of thumb is to pick the largest large language model
32:37 that can work with your hardware with a Q4 quantization. That is why I assumed
32:42 that in the table earlier. And then also like we saw in Olama earlier, it always
32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4
32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion
32:59 parameter model is still going to be a lot more powerful than the full 7
33:02 billion parameter or 14 billion parameter because you don't actually
33:07 lose that much performance. So that is quantization. So just to make this very
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 just fun fact, you can actually do that. Now, the very last thing that I want to
37:15 cover before we dive into some code, setting up the local AI package, and
37:21 building out some agents is a few very crucial parameters, environment
37:25 variables for Olama. So, these are environment variables that you can set
37:28 on your machine just like any other based on your operating system. And
37:32 Olama does have an FAQ for setting up some of these things, which I'll link to
37:36 in the description as well. But yeah, these are a bit more technical, so
37:40 people skip past setting this stuff up a lot, but it's actually really, really
37:44 important to make things very efficient when running local LLMs. So the first
37:49 environment variable is flash attention. You want to set this to one or true.
37:54 When you have this set to true, it's going to make the attention calculation
37:59 a lot more efficient. It sounds fancy, but basically large language models when
38:04 they are generating a response, they have to calculate which parts of your
38:08 prompt to pay the most attention to. That's the calculation. And you can make
38:12 it a lot more efficient without losing much performance at all by setting up
38:16 the flash attention, setting that to true. And then for another optimization,
38:21 just like we can quantize the LLM itself, you can also quantize or
38:27 compress the context. So your system prompt, the tool descriptions, your
38:31 prompt and conversation history, all that context that's being sent to your
38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for
38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the
38:46 context memory. It's a very simplified explanation, but it's really, really
38:50 useful because a long conversation can also take a lot of VRAM just like larger
38:55 LLM. And so it's good to compress that. And then the third environment variable,
38:58 this is actually probably the most crucial one to set up for Olama. There
39:02 is this crazy thing. I don't know why Olama does it, but by default, they
39:07 limit every single large language model to 2,000 tokens for the context limit,
39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and
39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a
39:21 lot of local large language models can also handle large prompts. But Olamo
39:25 will limit you to default to 2,000 tokens. And so you have to override that
39:30 yourself with this environment variable. And so generally I recommend starting
39:34 with about 8,000 tokens to start. You can move this all the way up to
39:38 something like 32,000 tokens if your local large language model supports
39:42 that. And if you view the model page on Alama, you can see the context link
39:46 that's supported by the LLM. But you definitely want to, you know, jack this
39:50 up more from just 2,000 because a lot of times when you have longer
39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do
39:57 not miss this. If your large language model is starting to go completely off
40:02 the rails and ignore your system prompt and forget that it has these tools that
40:06 you gave it, it's probably because you reached the context length. And so, just
40:10 keep that in mind. I see people miss this a lot. And then the very last
40:14 environment variable, uh, probably the least important out of all these four,
40:18 but if you're running a lot of different large language models at once and you're
40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so
40:25 in Olama, you can limit the number of models that are allowed to be in your
40:29 memory at a single time. With this one, typically you want to set this to either
40:33 one or two. Definitely set this to just one if you are using large language
40:37 models that are basically fit for your GPU. like it's going to fit exactly into
40:41 your VRAM and you're not going to have room for another large language model.
40:44 But if you are running more smaller ones and maybe you could actually fit two on
40:48 your GPU with the VRAM that you have, you can set this to two. So again, more
40:52 technical overall, but it's very important to have these right. And we'll
40:55 get into the local AI package where I already have these set up in the
40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a
41:02 minute ago that I'll have linked in the description. And so there's actually a
41:06 lot of good things to read into here. um like being able to verify that your GPU
41:10 is compatible with Olama. How can you tell if the model's actually loaded on
41:13 your GPU? So, a lot of like sanity check things that they walk you through in the
41:17 FAQ as well. Also talking about environment variables, which I just
41:20 covered. And so, they've got some instructions here depending on your OS
41:23 how to get those set up. So, if there's anything that's confusing to you, this
41:26 is a very good resource to start with. So, I'm trying to make it possible for
41:30 you to look into things further if there's anything that doesn't quite make
41:33 sense for what I explained here. And of course, always let me know in the
41:35 comments if you have any questions on this stuff as well, especially the more
41:39 technical stuff that I just got to cover because it's so important even though I
41:43 know we really want to dive into the meat of things, which we are actually
41:47 going to do now. All right, here is everything that we have covered at this
41:50 point. And congratulations if you have made it this far because I covered all
41:55 the tricky stuff with quantization and the hardware requirements and offloading
41:59 and some of our little configuration and parameters. So, if you got all of that,
42:03 the rest of it is going to be a walk in the park as we start to dive into code,
42:07 getting all of our local AI set up and building out some agents. You understand
42:10 the foundation now that we're going to build on top of to make some cool stuff.
42:15 And so, now the next thing that we're going to do is talk about how we can use
42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show
42:23 you an example. We can take something that is using OpenAI right now,
42:27 transform it into something that is using OAMA and local LLM. So, we'll
42:31 actually dive into some code here. And I've got my fair share of no code stuff
42:35 in this master class as well, but I want to focus on both because I think it's
42:38 really important to use both code and no code whenever applicable. And that
42:42 applies to local AI just like building agents in general. So, I've already
42:45 promised a couple of times that I would dive into OpenAI API compatibility, what
42:50 it is, and why it's so important. And we're going to dive into this now so you
42:54 can really start to see how you can take existing agents and transform them into
42:59 being 100% local with local large language models without really having to
43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI
43:10 has created a standard for exposing large language models through an API.
43:14 It's called the chat completions API. It's kind of like how model context
43:19 protocol MCP is a standard for connecting agents to tools. The chat
43:23 completions API is a standard for exposing large language models over an
43:28 API. So you have this common endpoint along with a few other ones that all of
43:35 these providers implement. This is the way to access the large language model
43:39 to get a response based on some conversation history that you pass in.
43:43 So, Olama is implementing this as of February. We have other providers like
43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.
43:53 Almost every single provider is OpenAI API compatible. And so, not only is it
43:57 very easy to swap between large language models within a specific provider, it's
44:02 also very easy to swap between providers entirely. You can go from Gemini to
44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one
44:13 piece of configuration pointing to a different base URL as it is called. So
44:18 you can access that provider and then the actual API endpoint that you hit
44:22 once you are connected to that specific provider is always the exact same and
44:26 the response that you get back is also always the exact same. And so Olama has
44:31 this implemented now. And I'll link to this article in the description as well
44:33 if you want to read through this because they have a really neat Python example.
44:37 It shows where we create an OpenAI client and the only thing we have to do
44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are
44:47 pointing to Olama that is hosted locally instead of pointing to the URL for
44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And
44:55 then with Olama, you don't actually need an API key because everything's running
44:58 locally. So you just need some placeholder value here. But there is no
45:02 authentication that is going on. You can set that up. I'm not going to dive into
45:05 that right now. But by default, because it's all just running locally, you don't
45:09 even need an API key to connect to Olama. And then once we have our OpenAI
45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in
45:18 exactly the same way. But now we can specify a model that we have downloaded
45:22 locally already through Lama. We pass in our conversation history in the same way
45:27 and we access the response like the content the AI produced the token usage
45:31 like all those things that we get back from the response in the same way.
45:34 They've got a JavaScript example as well. They have a couple of examples
45:38 using different frameworks like the Versell AI SDK and Autogen. Really any
45:44 AI agent framework can work with OpenAI API compatibility to make it very easy
45:47 to swap between these different providers. like Pyantic AI, my favorite
45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily
45:57 within your Pantic AI agents swap between these different providers. And
46:02 so what I have for you now is two code bases that I want to cover. The first
46:07 one is the local AI package, which we'll dive into in a little bit. But right
46:13 now, we have all of the agents that we are going to be creating in this master
46:17 class. So I have a couple for N8N that are also available in this repository.
46:21 And then a couple of scripts that I want to share with you as well. And so the
46:25 very first thing that I want to show you is this simple script that I have called
46:31 OpenAI compatible demo. And so you can download this repository. I'll have this
46:34 linked in the description as well. There's instructions for downloading and
46:38 setting up everything in here. And this is all 100% local AI. And so with that,
46:43 I'm going to go over into my windsurf here where I have this OpenAI compatible
46:47 demo set up. So I've got a comment at the top reminding us what the OpenAI API
46:52 compatibility looks like. We set our base URL to point to Olama hosted
46:59 locally and it's hosted on port 11434 by default. So I can actually show you
47:02 this. I have Ola running in a Docker container, which we're going to dive
47:05 into this when we set up the local AI package, but you can see that it is
47:11 being exposed on port 11434. And by the way, you can see the
47:14 127.0.0.1 in that URL that I have highlighted here, that is synonymous with localhost.
47:21 And so this right here, you could also replace with 127.0.0.1.
47:25 Just a little tidbit there. It's not super important. I just typically leave
47:28 it as localhost. And then you can change the port as well. I'm just sticking to
47:32 what the default is. And then again, we don't need to set our API key. We can
47:36 just set it to any value that we want here. We just need some placeholder even
47:39 though there is no authentication with a llama for real unless you configure
47:43 that. So that's OpenAI compatibility. And the important thing with this script
47:47 here is I have two different configurations here. I have one for
47:51 talking to OpenAI and then one for OALMA. So with OpenAI, we set our base
47:57 URL to point to api.openai.com. We have our OpenAI API key set in our
48:01 environment variables. So you can just set all your environment variables here
48:05 and then rename this to env. I've got instructions for that in the readme of
48:08 course. And then going back to the script, we are using GPT4.1 nano for our
48:13 large language model. There's something super fast and cheap. And then for our
48:17 Lama configuration, we are setting the base URL here, localhost1434
48:22 or just whatever we have set in our environment variables. Same thing for
48:26 the API key. And then same thing for our large language model. And what I'm going
48:31 to be using in this case is Quen 314B. That is one of the large language models
48:34 that I showed you within the Olama website. Definitely a smaller one
48:38 compared to what I could run, but I just want to run something fast. And very
48:41 small large language models are great for simple tasks like summarization or
48:45 just basic chat. And that's what I'm going to be using here just for a simple
48:49 demo. And so whether it's enabled or not, this configuration is just based on
48:53 what we have set for our environment variables. And the important thing here
48:59 is the code that runs for each of these configurations just as we go through
49:03 this demo is exactly the same. We are parameterizing the configuration for the
49:08 base URL and API key. So we are setting up the exact same OpenAI client just
49:13 like we saw in the Olama article but just changing the base URL and API key.
49:17 And so then for example when we use it right here it's client.hat.comp
49:22 completions.create create calling the exact same function no matter if we're
49:26 using OpenAI or Olama. And then we're handling the response in the same way as
49:31 well. And so I'll go back to my terminal now. And so I went through all the steps
49:34 already to set up my virtual environment, install all of my dependencies. And so now I can run the
49:40 command OpenAI compatible demo. And now it's going to present the two
49:43 configuration options for me. And so I can run through OpenAI. So we'll go
49:46 ahead and do that first. And these two demos are going to look exactly the
49:50 same, but that is the point. And so we have our base URL here for OpenAI. We
49:54 have a basic example of a completion with GPT4.1 Nano. There we go. So this
49:59 is the model that was used. Here are the number of tokens. And this is our
50:03 response. And then I can press enter to see a streaming response now as well. So
50:07 we saw it type out our answer in real time. And then I can press enter one
50:10 more time. This is the last part of the demo. Just say multi-turn conversation.
50:14 So we got a couple of messages here in our conversation history. So very nice
50:19 and simple. The point here is to now show you that I can run this and select
50:23 Olama now instead and everything is going to look exactly the same and all
50:27 of the code is the same as wallet. It is only our configuration that is
50:31 different. And so it will take a little bit when you first run this because
50:36 Olama has to load the large language model into your GPU. And so going to the
50:42 logs for Olama, I can show you what this looks like here. And so when we first
50:47 make a request when Quen 314B is not loaded into our GPU yet, you're going to
50:52 see a lot of logs come in here and we'll and you'll have this container up and
50:54 running when you have the local AI package which we'll cover in a little
50:57 bit. So it shows all the metadata about our model like it's Quen 314b. Uh we can
51:05 see here that uh we have a Q4 KM quantization like we saw in the Olama
51:09 website. Uh what other information do we have here? There's just so much to to
51:14 digest here. Um, yeah, another really important thing is we have the uh
51:19 context link. I have that set to 8,192 just like I recommended in the
51:22 environment variables. And then we can see that we offloaded all of the layers
51:26 to the GPU. So I don't have to do any offloading to the CPU or the RAM. I can
51:30 keep everything in the GPU, which is certainly ideal, like I said, to make
51:34 sure this is actually fast. And then when we get a response from quen 314b,
51:41 we are calling the v1/hatcompletions endpoint because it is openi API
51:46 compatible. So that exact endpoint that we hit for openai is the one that we are
51:50 hitting here with a large language model that is running entirely on our computer
51:54 in Olama. And so the response I get back, it's actually a reasoning LLM as
51:58 well. So we even have the thinking tokens here, which is super cool. And so
52:02 we got our response. It's just printing out the first part of it here just to
52:04 keep it short. And then I can press enter. And we can see a streaming demo
52:08 as well. And it's going to be a lot faster this time because we do already
52:11 have the model loaded into our GPU. And so that first request when it first has
52:15 to load a model is always the slower one. And then it's faster going forward
52:19 once that model is already loaded in our GPU. And then as long as we don't swap
52:24 to another large language model and use that one, then it will remain in our GPU
52:28 for some time. And so then all of our responses after are faster. And then we
52:33 just have the last part of our demo here with a multi-turn conversation. So we
52:37 can see conversation history in action as well, just not with streaming here.
52:40 Um, and and everything's a bit slower with this large language model because
52:43 it is a reasoning one. And so you can certainly if you want faster uh
52:47 inference, you can always use a non-reasoning local LLM like Mistl or
52:52 Gemma for example. So that is our very simple demo showing how this works. I
52:55 hope that you can see with this and again this works with other AI agent
52:59 frameworks like eggno or pideantic AI or crew AI as well like they all work in
53:03 this way where you can use openAI API compatibility to swap between providers
53:08 so easily so you don't have to recreate things to use local AI and that's
53:11 something so important that I want to communicate with you because if I'm the
53:15 one introducing you to local AI I also want to show you how it can very easily
53:19 fit into your existing systems and automations. All right. Now, we have
53:23 gotten to the part of the local AI master class that I'm actually the most
53:27 excited for because over the past months, I have very much been pouring my
53:31 heart and soul into building up something to make it infinitely easier
53:35 for you to get everything up and running for local AI. And that is the local AI
53:40 package. And so, right now, we're going to walk through installing it step by
53:44 step. I don't want you to miss anything here because it's so important to get
53:47 this up and running, get it all working well. Because if you have the local AI
53:51 package running on your machine and everything is working, you don't need
53:55 anything else to start building AI agents running 100% offline and
53:59 completely private. And so here's the thing. At this point, we've been
54:04 focusing mostly on Olama and running our local large language models. But there's
54:08 the whole other component to local AI that I introduced at the start of the
54:13 master class for our infrastructure. things like our database and local and
54:18 private web search, our user interface, agent monitoring. We have all these
54:23 other open-source platforms that we also want to run along with our large
54:27 language models and the local AI package is the solution to bring all of that
54:32 together curated for you to install in just a few steps. So, here is the GitHub
54:37 repository for the local AI package. I'll have this linked in the description
54:41 below. Just to be very clear, there are two GitHub repos for this master class.
54:45 We have this one that we covered earlier. This has our N8N and Python
54:49 agents that we'll cover in a bit, as well as the OpenAI compatible demo that
54:53 we saw earlier. So, you want to have this cloned and the local AI package as
54:57 well. Very easy to get both up and running. And if you scroll down in the
55:02 local AI package, I have very comprehensive instructions for setting
55:06 up everything, including how to deploy it to a private server in the cloud,
55:10 which we'll get into at the end of this master class, and a troubleshooting
55:13 section at the bottom. So, everything that I'm about to walk you through here,
55:17 there's instructions in the readme as well if you just want to circle back to
55:21 clarify anything. Also, I dive into all of the platforms that are included in
55:26 the local AI package. And this is very important because like I said, when you
55:30 want to build a 100% offline and private AI agent, it's a lot more than just the
55:35 large language model. You have all of the accompanying infrastructure like
55:39 your database and your UI. And so I have all that included. First of all, I have
55:44 N8N that is our low/noodeode workflow automation platform. We'll be building
55:48 an agent with N8N in the local AI package in a little bit once we have it
55:52 set up. We have Superbase for our open- source database. We have Olama. Of
55:56 course, we want to have this in the package as well for our LLMs. Open Web
56:01 UI, which gives us a chat GPT like interface for us to talk to our LLMs and
56:06 have things like conversation history. Very, very nice. So, we're looking at
56:09 this right here. This is included in the package. Then we have Flowwise. It's
56:13 similar to N8N. It's another really good tool to build AI agents with no slash
56:18 low code. Quadrant, which is an open- source vector database. Neo4j which is a
56:25 knowledge graph engine and then seir xng for open-source completely free and
56:31 private web search caddy which this is going to be very important for us once
56:35 we deploy the local AI package to the cloud and we actually want to have
56:38 domains for our different services like nn and open web UI and then the last
56:42 thing is langfuse this is an open- source LLM engineering platform it helps
56:47 us with agent observability now some of these services are outside of the scope
56:52 for this local AI master class. I don't want to spend a half hour on every
56:56 single one of these services and make this a 10-hour video. I will be focusing
57:02 in this video on N8N, Superbase, Olama, Open WebUI, CRXNG, and then Caddy once
57:08 we deploy everything to the cloud. So, I do cover like half of these services.
57:12 And the other thing that I want to touch on here is that there are quite a few
57:16 things included here. And so you do need about 8 GB of RAM on your machine or
57:21 your cloud server to run everything. It is pretty big overall. And so you can
57:26 remove certain things like if you don't want Quadrant and Langfuse for example,
57:30 you can take those out of the package. More on that later. It doesn't have to
57:34 be super bloated, you can whittle this down to what you need. But yeah, there's
57:37 a lot of different things that go into building AI agents. And so I have all of
57:40 these services here so that no matter what you need, I've got you covered. And
57:44 so with that, we can now move on to installing the local AI package. And
57:48 these instructions will work for you on any operating system, any computer. Even
57:52 if you don't have a really good GPU to run local large language models, you
57:56 still could always use OpenAI or Anthropic, something like that, and then
58:00 run everything else locally to save on costs or just to have everything running
58:04 on your computer. And so there are a couple of prerequisites that you have to
58:08 have before you can do the instructions below. You need Python so you can run
58:12 the start script that boots everything up. Git or GitHub desktop so you can
58:16 clone this GitHub repository, bring it all onto your own machine. And then you
58:21 want Docker or Docker Desktop. And so I've got links for all of these. Docker
58:25 and Docker Desktop we need because all of these local AI services that I've
58:29 curated for you, they all run as individual Docker containers that are
58:34 all combined together in a stack. And so I'll actually show you this is the end
58:36 result once we have everything up and running within your docker desktop. You
58:41 have this local AI docker compos stack that has all of the services running in
58:45 tandem like superbase and reddus and nitn and flowwise caddy neo4j. All of
58:50 these are running within this stack. That is what we're working towards right
58:54 now. And so make sure you have all these things installed. I've got links that'll
58:57 take you to installing no matter your operating system. Very easy to get all
59:01 of this up and running on your machine. Then we can move on to our first command
59:05 here, which is to clone this GitHub repository, bringing all of this code on
59:10 your machine so you can get everything running. And so you want to open up a
59:14 new terminal. So I've got a new PowerShell session open here. Going to
59:18 paste in this command. And I'm going to be doing this completely from scratch
59:22 with you. So you clone the repo and then I'm just going to change my directory
59:26 into local AI package, which was just created from this get clone command. So
59:31 those are the first two steps. The next thing is we have to configure all of our
59:36 environment variables. And believe it or not, this is actually the longest part
59:41 of the process. And once we have this taken care of, it's a breeze getting the
59:44 rest of this up and running. But there's a lot of configuration that we have to
59:49 set up for our different services like credentials for logging into our
59:54 Superbase dashboard or Neo4j. Uh things like our Superbase um anonymous key and
59:59 private key. All these things we have to configure. And so within our terminal
60:04 here, you can do code dot to open this within VS code or windsurf. Open this in
60:09 windsurf. You just want to open up this folder within your IDE and the specific
60:14 IDE that you use. Really doesn't matter. You just want to get to this.env.example
60:20 here. I'm going to copy it and then I'm going to paste it. And then I'm going to
60:24 rename this toenv. So we're taking the example. example, turning it into av file. So,
60:31 you want to make sure that you copy it and rename it like this. Then we can go
60:36 ahead and start setting all of our configuration. And I'll even zoom in on
60:39 this just so that it's very easy for you to see everything that we are setting up
60:44 here. So, first up, we have a couple of credentials for N8N. We have our
60:49 encryption key and our JWT secret. And it's very easy to generate these. In
60:53 fact, we'll be doing this a couple of times, but we'll use this open SSL
60:58 command to generate a random 32 character alpha numeric string that
61:02 we're going to use for things like our encryption key and JWT secret. And so,
61:08 OpenSSL is a command that is available for you by default on Linux and Macs.
61:12 You can just open up any terminal and run this command and it'll spit out a
61:16 long string that you can then just paste in for this value. For Windows, you
61:20 can't just open up any terminal and use OpenSSL, but you can use Git Bash, which
61:26 is going to come with GitHub Desktop when you install it. And so, I'll go
61:29 ahead and just search for that. If you just go to your search bar on your
61:32 bottom left on Windows and search for Git Bash, it's going to open up this
61:37 terminal like this. And so, I can go ahead and copy this command, go in here,
61:42 and paste it in. And then I can run it. And then, boom, there we go. This is I
61:45 know it's really small for you to see right now. I'm going to go ahead and
61:48 copy this because this is now the value that I can use for my encryption key.
61:52 And then you want to do the exact same thing to generate a JWT secret. And then
61:57 the other way that you can do this if you don't want to install git bash or
62:01 it's not working for whatever reason, you can use Python to generate this as
62:05 well. So I can just copy this command and then I can go into the terminal here
62:10 and I can just paste this in. And so it's going to just like with OpenSSL
62:15 generate this random 32 character string that I can copy and then use for my JWT
62:20 secret. There we go. And so I am going to get in the weeds a little bit here
62:24 with each of these different parameters, but I really want to make sure that I'm
62:27 clear on how to set up everything for you so you can really walk through this
62:31 step by step with me. And like I said, setting up the environment variables is
62:35 the longest part by far for getting the local AI package set up. So if you bear
62:39 with me on this, you get through this configuration, you will have everything
62:43 running that you need for local AI for the LLMs and your infrastructure. So
62:47 that's everything for N8N. Now we have some secrets for Superbase. And there
62:53 are some instructions in the Superbase documentation for how to get some of
62:57 these values. So it's this link right here, which I have open up on my
63:01 browser. So we'll we'll reference this in a little bit here. But first, we can
63:05 set up a couple of other things. The first thing we need to define is our
63:11 Postgress password. So, Supphabase uses Postgress under the hood for the
63:14 database. And so, we want to set a password here that we'll use to connect
63:19 to Postgress within N8N or a connection string that we have for our Python code,
63:23 whatever that might be. And this value can be really anything that you want.
63:26 Just note that you have to be very careful at using special characters like
63:31 percent symbols. So if you ever have any issues with Postgress, it's probably
63:36 because you have special characters that are throwing it off. U that's something
63:39 that I've seen happen quite a few times. And so like I said, I want to mention
63:42 troubleshooting steps and things to make sure that it is very clear for you. So
63:47 for this Postgress password here, I'm just going to say test Postgress pass.
63:51 I'm just going to give some kind of random value here. Just end with a
63:54 couple of numbers. I don't care that I'm exposing this information to you because
63:58 this is a local AI package. These passwords are for services that never
64:02 leave my computer. So, it's not like you could hack me by connecting to anything
64:08 here. And then we have a JWT secret. And this is where we get into this link
64:13 right here in the Superbase docs. And so they walk you through generating a JWT
64:18 secret and then using that to create both your anonymous and your service
64:22 role keys. If you're familiar with Superbase at all, we need both of these
64:27 pieces of information. The anonymous key is what we share to our front end. This
64:30 is our public key. And then the service role key has all permissions for
64:33 Superbase. We'll use this in our backends for things like our agents. And
64:39 so you can just go ahead and copy. You can go ahead and copy this JWT secret.
64:43 And then you can paste this in right here. This is 32 characters long just
64:47 like the things that we generated with OpenSSL. I'm just going to be using
64:52 exactly what Superbase tells me to. And then what you can do with this is you
64:56 can select the anonymous key. Click on generate JWT and then I can copy this
65:02 value and then I will paste this for my anonymous token. And so I'm just
65:06 replacing the default value there for the anonymous key. And then going back
65:10 and selecting the service key, I'm going to generate that one as well. So it
65:13 looks very similar. They'll always start with ey, but these values are different
65:18 if you go towards the end. And so I'll go ahead and paste this for my service
65:22 ro key. Boom. There we go. All right. And then for the Superbase dashboard
65:27 that we'll log into to see our tables and our SQL editor and authentication
65:31 and everything like that, we have our username here, which I'm just going to
65:35 keep as superbase. And then for the password, I can just say test superbase
65:39 pass. I'll just kind of use that as my common nomenclature here for my
65:42 passwords cuz I don't really care what that is right now. And then the last
65:45 thing that we have to set up is our pooler tenant ID. And it's not really
65:49 important to dive into what exactly this means. Just know that you can set this
65:52 to really anything that you want. Like I typically will just choose four digits
65:57 here like 1,00 for my pooler tenant ID. So that is everything that we need for
66:01 superbase. And actually most of the configuration is for superbase. Then we
66:06 have Neo4j. This is really simple. You can leave Neo4j for the username and
66:11 then I'll just say test Neo4j pass for my password here. So you just set the
66:15 password for knowledger graph and even if you're not using Neo4j you still have
66:19 to set this but yeah it just takes two seconds. Then we have langfuse. This is
66:23 for agent observability. We have a few secrets that we need here. And for these
66:28 values they can really just be whatever you want. It doesn't matter because
66:31 these are just passwords just like we had passwords for things like Neoforj.
66:35 So I can just say test click house pass. Um and then I can do test mo pass. And
66:43 um I mean it really doesn't matter here. Random Langfuse salt. I'm just doing
66:47 completely whack values here. You probably want something more secure in
66:51 this case, but um I'm just doing something as a placeholder for now. Um
66:56 yeah, that there we go. Okay, good. And then then the last thing that we need
66:59 for Langfuse is an encryption key. And this is also generated with OpenSSL like
67:04 we did for the N8N credentials. And so I'll go back to my git bash terminal.
67:08 And again, you can do this with Python as well. I'll just run the exact same
67:12 command. I'll get a different value this time. And so I'll go ahead and copy
67:16 that. You could technically use the same value over and over if you wanted to,
67:20 but obviously it's way more secure to use a different value for each of the
67:24 encryption keys that you generate with OpenSSL. So there we go. That is our
67:28 encryption key. And that is actually everything that we have to set up for
67:32 our environment variables when we are just running the local AI package on our
67:37 computer. Once we deploy it to the cloud and we actually want domains for our
67:41 different services like open web UI and N8N then we'll have to set up caddy. So
67:45 this is where we'll dive into domains and we'll get into this at the end of
67:49 the master class here. But everything past this point for environment
67:54 variables is completely optional. You can leave all of this exactly as it is
67:59 and everything will work. Most of this is just extra configuration for
68:03 superbase. So, Superbase is definitely the biggest service that's included in
68:08 this list of, you know, curated services for you. And so, there's a lot of
68:11 different configuration things you can play around with if you want to dive
68:15 more into this. You can definitely look at the same documentation page that we
68:19 were using for the Superbase Secrets. And so, you can scroll through this if
68:22 you want to learn more um like setting up email authentication or Google
68:27 authentication. um diving more into all of those different configuration things
68:32 for Superbase if you want to dive more into that. I'm not going to get into all
68:36 of this right now because the core of getting Superbase up and running we
68:40 already have taken care of with the credentials that we set up at the top um
68:44 right here. And so that these are these are just the base things and so that's
68:47 what we'll stick to right now. So that is everything for our environment
68:51 variables. So then going back to our readme now which I have open directly in
68:55 windsurf now instead of my browser we have finished our configuration and I do
68:59 have a note here that you want to set things up for caddy if you're deploying
69:03 to production. Obviously we're doing that later not right now like I said and
69:07 so with that we are good to start everything. Now before we spin up the
69:12 entire local AI package there is one thing that I want to cover. It's
69:14 important to cover this before we run things. If you don't want to run
69:19 everything in the package cuz it is a lot like maybe you only want to use half
69:22 of these services and you don't want Neo4j and Langfuse and Flowwise right
69:28 now. There are two options that you have. The easiest one right now is to go
69:33 into the docker compose file. This is the main file where all of the services
69:38 are curated together and you can just remove the services that you don't want
69:42 to include. So, for example, if you don't want Quadrant right now, cuz it is
69:46 actually one of the larger services. It's like 600 u megabytes of RAM just
69:50 having this running, you can search for Quadrant, and you can just go ahead and
69:54 delete this service from the stack like that. Boom. Now I don't have Quadrant.
69:59 It won't spin up as a part of the stack anymore. And then also I have a volume
70:03 for Quadrant. So, you can remove that as well. Volumes, by the way, is how we are
70:08 able to persist data for these containers. So if we tear down
70:12 everything and then we spin it back up, we still are going to have our open web
70:17 UI conversations and our N8N workflows, everything in Superbase, like all that
70:21 is still going to be saved because we're storing it all in volumes. So we can do
70:25 whatever the heck we want with these containers. We can tear them down. We
70:28 can update them, which I'll show you how to do later. We can spin it back up. And
70:32 all of our data will always be persisted. So you don't have to worry
70:35 about losing information. And you can always back things up if you want to be
70:39 really secure, but I've never done that before and I've been updating this
70:42 package for months and months and months and all of my workflows from 6 months
70:46 ago are still there. I haven't lost anything. And so that's just a quick
70:50 caveat there for how you can remove services if you want. And then another
70:54 thing that we don't have available yet, but I'm very excited to, you know, kind
70:58 of talk about this right now. It's in beta right now. We are creating me and
71:03 one other guy uh that's actually on my Dynamist team. Um Thomas, he's got a
71:07 YouTube channel as well. He's a great guy. We're working together on this.
71:09 He's actually been putting in most of the work creating a front-end
71:13 application for us to manage our local AI package. And one of the big things
71:17 with this is that we're going to make it possible for you to toggle on and off
71:22 the services that you want to have within your local AI package. So you can
71:27 very much customize the package to the services that you want to run. So you
71:31 can keep it lightweight just to the things you care about. Also, we'll be
71:34 able to manage environment variables and monitor the containers. Not all of this
71:38 is up and running at this point, but this is in beta. We're working on it.
71:41 I'm really excited for this. So, not available yet, but at once this is
71:44 available, this will be a really good way for you to customize the package to
71:47 your needs. So, you don't have to go and edit the docker compose file directly.
71:51 So, that's something that I just wanted to get out of the way now. But, we can
71:57 start and actually execute our package now. Get all these containers up and
72:02 running. So the command that you run to start the local AI package is different
72:07 depending on your operating system and the hardware that you have. So for
72:13 example, if you are an Nvidia GPU user, you want to run this start services.py
72:18 script. This boots up all of the containers and you want to specifically
72:23 pass in the profile of GPU NVIDIA. This is going to start Ola in a way where the
72:29 Olama container is able to leverage your GPU automatically. And then if you are
72:34 using an AMD GPU and you're on Linux, then you can run it this way. Which by
72:38 the way, unfortunately, if you have an AMD GPU on Windows, you aren't able to
72:46 run O Lama in a container. And it's the same thing with Mac computers.
72:49 Unfortunately, like you see right here, you cannot expose your GPU to the Docker
72:55 instance. And so if you are an AMD GPU on Windows or running on Mac, you cannot
73:01 run Olama in the local AI package. You just have to install it on your own
73:04 machine like I already showed you in this master class and then you'll just
73:08 run everything else through the local AI package and they can actually go out to
73:12 your machine and communicate to Olama directly. So just a small limitation for
73:17 Mac and AMD on Windows. But if you're running on Linux or an Nvidia GPU on
73:21 Windows like I'm using, then you can go ahead and run this command right here.
73:27 So if you can't run a GPU in the Olama container, then you can always just
73:32 start in CPU mode or you can run with a profile of none. This will actually make
73:36 it so that Olama never starts in the local AI package. So you can just
73:40 leverage the Olama that you have already running on your computer like I showed
73:43 you how to install already. So, just a couple of small caveats that I really
73:46 want to hit on there. I need to make sure that you're using the right
73:51 command. And so, in my case, I'm Nvidia on Windows. So, I'm going to copy this
73:55 command. Go back over into my terminal. I'll just clear it here. So, we have a
73:59 blank slate. And I'll paste in this command. And so, it's going to do quite
74:02 a few things initially. First, it's going to clone the Superbase repository
74:07 because Superbase actually manages the stack in a separate place. And so, we
74:10 have to pull that in. Then there's some configuration for CRXNG for our uh local
74:17 and private web search. And then I have a couple of warnings here saying that
74:20 the Flowwise username and password are not set, which by the way for that if
74:24 you want to set the Flow Wise username and password, it's optional, but you can
74:29 do that if I scroll down right here. So you can set these values, those will
74:32 actually make those warnings go away, but you can also ignore them, too. So
74:35 anyway, I just wanted to mention that really quickly. But now what's happening
74:39 here is it starts by running all of the Superbase containers. And so there's
74:44 quite a bit that goes into Superbase, like I said. So we're running all of
74:47 that. It's getting all that spun up. And then once we run all of these, it's
74:51 going to move on to deploying the rest of our stack. And if you're running this
74:55 for the very first time, it will take a while to download all of these images.
74:59 They're not super small. There's a lot of infrastructure that we're starting up
75:03 here. And so it'll take a bit. You just have to be patient. maybe go grab your
75:06 coffee or make your next meal, whatever that is. And then everything will be up
75:09 and running once you are back. And so yeah, now you can see that we are
75:13 running the rest of the containers here. Um, and so we'll just wait for that to
75:16 be done. And then I'll show you what that looks like in Docker Desktop as
75:19 well. And so I'll give it a second here just to finish. Uh, looks like my
75:24 terminal glitched a little bit. Like I was scrolling and so it kind of broke it
75:27 a bit. But anyway, everything is up and running now. It'll look like this where
75:30 it'll say all of the containers are healthy or running or started. And then
75:34 if I go into Docker Desktop and I expand the local AI compost stack, you want to
75:39 make sure that you have a green dot for everything except for the Olama pull and
75:45 N8N import. These just run once initially and then they go down because
75:49 they're responsible for pulling some things for our local AI package. And so
75:53 yeah, I've got green dots for everything except for two right here. Now I'm
75:57 leaving this in here intentionally actually because there is a bug with
76:02 Superbase specifically if you are on Windows. So you'll see this issue where
76:08 the Superbase pooler is constantly restarting and that also affects N8N
76:12 because N8N relies on the Superbase pooler. So it's constantly restarting as
76:17 well. If you see this problem, I actually talk about this in the
76:21 troubleshooting section of the readme. If you scroll all the way down, if the
76:24 Superbase pooler is restarting, you can check out this GitHub issue. And so I
76:29 linked to this right here, and he tells you exactly which file you want to
76:33 change. It's this one right here. So it's docker volumes poolerpooler.exs.
76:39 And you need to change the file to end in lf. And so I'll show you what I mean
76:43 by that. I'll show you exactly how to do this. It's like a super tiny random
76:47 thing, but this has tripped up so many people. So I want to include this
76:51 explicitly in the master class here. So you want to go within the superbase
76:56 folder within docker volumes and then it's within pooler and then we have
77:00 pooler.exs and basically no matter your IDE you can see the crlf in the bottom right here.
77:08 You want to click on this and then change it to lf and then make sure that
77:13 you save this file. Very easy to fix that. And then what you can do is you
77:19 can run the exact same command to spin everything up again. And so I'm going to
77:22 do this now. It's going to go through all the same steps. It'll be faster this
77:25 time because you already have everything pulled. And this, by the way, is how you
77:28 can just restart everything really quickly if you want to enforce new
77:31 environment variables or anything like that. So I want to include that
77:35 explicitly um for that reason as well. And I'll go ahead and close out of this.
77:39 And and while this is all restarting, the other thing that I want to show you
77:42 in the readme is I also have instructions for upgrading the containers in the local AI package. So
77:49 when N8N has an update or Superbase has an update, it is your responsibility
77:53 because you're managing the infrastructure to update things yourself. And so you very simply just
77:58 have to run these three commands to update everything. You want to tear down
78:03 all of the containers and make sure you specify your profile like GPU Nvidia and
78:09 then you want to pull all of the latest containers and again specifying your
78:13 profile. And then once you do those two things, you'll have the most up-to-date
78:17 versions of the containers downloaded. So you can go ahead and run the start
78:21 services with your profile just like we just did to restart things. Very easy to
78:25 update everything. And even though we are completely tearing down our
78:29 containers here before we upgrade them, we aren't losing any information because
78:33 we are persisting things in the volumes that we have set up at the top of our
78:37 Docker Compose stack. And so this is where we store all of our data in our
78:41 database and and workflows. All these things are persisted. So we don't have
78:45 to worry about losing them. Very easy to upgrade things and you still get to keep
78:48 everything. You don't have to make backups and things like that unless you
78:54 just want to be ultra ultra safe. So now we can go back to our Docker desktop and
79:00 we've got green dots for everything now since we fixed that pooler.exs issue.
79:04 The only thing that we don't have green dots for is the N8N import and then we
79:08 have our Olama pull as well because like I said those are the two things that
79:11 just have to run at the beginning and then they aren't ongoing processes like
79:16 the rest of our services. So, we have everything up and running. And if there
79:22 is anything that is a white dot besides Olama pull or n import or if there's
79:27 anything that is constantly restarting, just feel free to post a comment and
79:31 I'll definitely be sure to help you out. And then also check out the
79:34 troubleshooting section as well. One thing that I'll mention really quick is
79:37 sometimes your N8N will constantly restart and it'll say something like the
79:42 N8N encryption key doesn't match what you have in the config. And the big
79:46 thing to keep in mind for that is you want to make sure that you set this
79:51 value for the encryption key before you ever run it for the first time.
79:53 Otherwise, it's going to generate some random default value and then if you
79:56 change this later, it won't match with what it expects. And so, yeah, my big
79:59 recommendation is like make sure you have everything set up in your
80:03 environment variables before you ever run the start services for the first
80:08 time. This should be run once you have your environment variables set up.
80:11 Otherwise, you risk any of these services creating default values that
80:14 then wouldn't match with the keys and things that you set up later. And so
80:18 with that, we can now go into our browser and actually explore all of
80:22 these local AI services that we have running on our computer now. Now over in
80:26 our browser, we can start visiting the different services that we have spun up.
80:30 Like here is N8N. You just have to go to localhost port 5678. It'll have you
80:35 create a local account when you first visit it. And then you'll have this
80:38 workflow view that should look very familiar to you. if you have used NAND
80:41 in the past. And then we have open web UI localhost port 8080. This is our chat
80:47 GPT like interface where we can directly talk to all of the models that we have
80:52 pulled in our Olama container. Really, really neat. And then we have local host
80:57 port 8000 for our Superbase dashboard. The signin definitely isn't pretty
81:00 compared to the managed version of Superbase. But once you enter in your
81:04 username and password that you have set for the environment variables for the
81:07 dashboard, then you have the very typical view where we have our tables
81:11 and we've got our SQL editor. Everything that you're familiar with with
81:14 Superbase. And that's the key thing with all these different services. They all
81:18 will look the exact same for you pretty much. Um like another one for example,
81:25 if I go to localhost um port 3000, we have languages. This is for agent
81:28 observability and monitoring. And this is something I'm not going to dive into
81:31 in this master class. Like I said, I'm not covering all the services. But yeah,
81:34 I just want to show that like every single one of these pretty much you can
81:39 access in your browser. And by the way, the way that we know the specific port
81:43 to access for each of these services is by taking a look at either what it tells
81:48 us in Docker Desktop. So like we can see that Neo4j is um let's see, we have port
81:53 7474. For uh CR XNG, it's port 8081. For Flowwise, it's port 3001. What's one
82:01 that we've seen already? Um, let me Yeah, like Open Web UI is port 8080. So,
82:06 the port on the left is the one that we access in our browser. And then the port
82:11 on the right is what's mapped on the container. So, when we visit port 8080
82:16 on our computer, that goes into port 8080 on the container. And that's what
82:21 we have exposed. The other way that you can see the port that you need to use is
82:24 just by taking a look at this docker compose file. And you don't need to have
82:29 like a super good understanding of this docker compose file. But if you want to
82:33 customize your stack or even help me by making contributions to local AI
82:36 package, this is the main place to make changes. And so for example, I can go
82:41 down to flowwise and I can see that the port is 3001. Or if I go down to let's say N8N, we can
82:49 see that the port is 5678. And so the port is always going to be
82:52 there somewhere in the service that you have set up. Like for the Langfuse
82:56 worker, it's 3030. That's more of a behind-the-scenes kind of service. But
82:59 let me just find one more example for you here. Um yeah, like Reddus for
83:04 example is 6379. So you can see the ports in the Docker Compose as well. I
83:07 just want to call it out just to at least get you a little bit comfortable
83:11 and familiar with the Docker Compose file in case you want to customize
83:14 things. But the main thing is just leveraging what you see here in Docker
83:18 Desktop. Last thing in Docker Desktop really quickly, if you want to bring
83:21 more local large language models into the mix, you can do it without having to
83:25 restart anything. You just have to find the Olama container in the Docker
83:29 Compose stack. Head on over to the exec tab. And now here we can run any
83:33 commands that we'd want. We're directly within the container here. And we can
83:36 use Lama commands just like we did earlier on our host machine. And so for
83:40 example, I ran Lama list already. So I can see the large language models that
83:43 have already been pulled in my Olama container. If I want to pull more, I can
83:48 just do Olama pull and then find that ID for the model I want to use on the Olama
83:52 website. And like I said, you don't have to restart anything. If I pull it here,
83:56 it's now in the container and I can immediately start using it in Open Web
84:00 UI or N8N. We'll see that in a little bit. And so that's just really important
84:03 because a lot of times you're going to want to start to use different large
84:06 language models and you don't want to have to restart anything. The ones that
84:11 are brought into the machine by default is it's determined by this line right
84:15 here. So if you want to change the ones that are pulled by default, I just have
84:20 Quinn 2.57B instructs like a really small lightweight one that I have
84:23 brought into your Lama container by default. Uh if you want to add in
84:27 different ones, you can just update this line right here to include multiple
84:32 Olama pulls. And so that way you can bring in Quen 3 or Mistral 3.1 small,
84:36 whatever you want. This is just the one I have by default. And then all the
84:40 other ones that you saw in my list here, I've pulled myself. All right. Now that
84:45 we have the local AI package up and running, it is time to build some
84:50 agents. Now, we get to use our local AI package to actually build out an
84:53 application. And so, I'm going to start by introducing you to Open Web UI, and
84:58 we'll use it to talk to our Olama LLM. So, we have an application kind of right
85:02 out of the box for us. Then I'll dive into building a local AI agent with N8N,
85:08 even connecting it to Open Web UI. So we have this custom agent that we built in
85:12 N8N and then we immediately have a really nice UI to chat with it. And then
85:16 we'll transition to Python building the exact same agent in Python as well. Like
85:21 I said, I want to focus on both no code and code to really make this a complete
85:24 master class so that whether you want to build with N8N or Python, you can see
85:28 how to connect to our different services that we have running locally like
85:32 Superbase and CRXNG and Open Web UI. So, we'll cover all of that and then I'll
85:36 get into deployments after this. But yeah, let's go ahead right now focus on
85:40 open web UI and building out some agents. So, back over in Open Web UI,
85:45 remember this is localhost port 8080. You want to set up your connection to
85:49 Olama so we can start talking with our local LLMs in this nice interface. And
85:54 so bottom left, go to the admin panel, then go to settings and then the
85:58 connections tab. Here we can set up our connections both to OpenAI with our API
86:02 key, which we're not going to do right now, but then also the Olama API. This
86:07 is what we want to set up. Now, usually by default, this value is just
86:12 localhost. And this is actually wrong. This is something that is so important
86:16 to understand. And this will apply when we set up credentials in N8N and Python
86:20 as well. When you are within a container, localhost means that you are
86:26 referencing still within the container. Open web UI needs to reach out to the
86:32 Olama container, not itself. So localhost is not correct here. This is
86:36 generally the default just because open web UI assumes that you're running on
86:39 your machine and so then you would also have Lama running on your machine. So
86:42 local host usually works when you're outside of containers. But here we have
86:47 to change this. This is super important to get right. And so there are two
86:51 options we have. If you are running on a Mac or AMD on Windows and you want to
86:56 use Lama running on your machine not within a container, then you want to do
87:00 host.doccker.in. This is the way in docker to tell the container to look outside to the host
87:07 machine where you're running the containers and you're running
87:11 separately. Very important to know that. And then if you are running Olama in the
87:16 container like I am doing. I have Ola running in my Docker desktop. You want
87:21 to change this to Olama, you're specifically calling out the service
87:26 that is running the Olama container in your Docker Compose stack. And the way
87:30 that we know that this is the name specifically is because we just go back
87:35 to our allimportant Docker Compose file. Olama. So whenever there's an X and a
87:40 dash, you just ignore that. It's just the thing after it. So, ola is the name
87:45 of our service running the container. And then if we wanted to connect to
87:49 something else like flow-wise, flow-wise is the name of the service. Open WebUI,
87:55 it's open- web UI. All of these tople keywords, these are the names when we
87:59 want our containers to be talking to each other. And all of this is possible
88:03 because they are within the same Docker network. And so I'll just show you that
88:06 so you know what I'm talking about here. If I go back to Docker Desktop, we have
88:11 this local AI compos stack. All of these containers can now communicate
88:14 internally with each other by referencing the names like Reddus or
88:19 CRXNG. So, we'll be seeing that a lot when we're building out our agents as
88:22 well. So, I wanted to spend a couple minutes to focus on that. And so, you
88:25 can go ahead and click on save in the very bottom right. I know my face is
88:28 covering this right now, but you have a save button here. Make sure you actually
88:32 do that. Um, and for this API key, I don't know why it's asking me to fill it
88:34 out. I don't really care about connecting to open AI. So I'll just put
88:37 some random value there and click save. And then boom, there we go. We are good.
88:40 And then a lot of times with open web UI, it also helps to refresh otherwise
88:43 it doesn't load the models for some reason. So I just did a refresh of the
88:49 site here. Control F5. And then now we can select all of the local LLMs that we
88:53 have pulled in our Olama container. And so for example, I can do Quinn 2.57B.
88:57 That's the one that I just have by default. I can say hello. And it's going
89:02 to take a little bit cuz it has to load this model onto my GPU just like we saw
89:06 with quen 3 earlier. But then in a second here we'll get a response. And
89:10 there are actually multiple calls that are being done here. We have one to get
89:15 our response, one to get a title for our conversation on the lefth hand side. And
89:19 then also if you click on the three dots here, you can see that it created a
89:23 couple of tags for this conversation. So couple of things that are fired off all
89:26 at once there. And I can test conversation history. What did I just
89:30 say? So yeah, I mean everything's working really well here. We have chat
89:34 history, conversation history on the lefth hand side. There's so much that we
89:37 get out of the box. And so I wanted to show you this really quickly. Now we can
89:41 move on to building an agent in N8N. And I'll even show you how to connect it to
89:46 Open Web UI as well through this N8N agent connector. Really exciting stuff.
89:49 So let's get right into it. So I'm going to start really simple here by building
89:53 a basic agent. The main thing that I want to focus on is just connecting to
89:57 our different local AI services. So I am going to assume that you have a basic
90:01 knowledge of N8N here because this is not an N8N master class. And so I'm
90:04 starting with a chat trigger so we can talk to our agent directly in the UI.
90:08 We'll connect this to open web UI in a bit as well. And then I want to connect
90:14 an AI agent node. And so what we want to do is connect for the chat model and
90:18 then local superbase for our conversation history, our agent memory.
90:22 And so for the chat model I'm going to do lama chat model. I'm going to create
90:26 brand new credentials. You can see me do this from scratch. The URL that you want
90:32 for the base URL is exactly the same as what we just entered into open web UI.
90:35 And so if you are running Olama on your host machine like an AMD on Windows or
90:40 you are running on a Mac or you just don't want to run the Olama container,
90:46 then it is host.doccker.in. And then if you are referencing the
90:49 Olama container, we just reference Olama. That's the name of the service
90:53 running the Olama container in our stack. And then the port is 11434 by
90:57 default. And you can test this connection. So it'll do a quick ping to
91:01 the container to make sure that we are good to go. And I'll even show you what
91:04 that looks like. So right here in my Olama container, I have the logs up. And
91:09 the last two requests were just a simple get request to the root endpoint. We
91:13 have two of those right here. And if I click on retry and I go back to the
91:18 logs, boom, we are at three now. So it made three requests. So it's just making
91:22 that simple ping each time to make sure the container is available. And so I'm
91:26 going to go ahead and click on save and then close out. So now we have our
91:29 credentials and then we can automatically select the model that we
91:33 have loaded now in our container. And so just to keep things really lightweight,
91:36 I'm going to go with the 7 billion parameter model right now from Quen 2.5.
91:40 Cool. All right. So that is everything that we need to connect Olama. It is
91:44 that easy. And then we could even test it right now. So, I'm going to go ahead
91:47 and save this workflow. And I'm going to just say hello. And uh we don't need the
91:52 conversation history or tools or anything at this point. We're already
91:55 getting a response here from the LLM. It's working on loading the model into
91:59 my GPU as we speak. And so there we go. We got our answer looking really good.
92:04 Cool. So now we can add memory as well. So I'm going to add Postgress because
92:08 remember Superbase uses Postgress under the hood. And then I'm going to create
92:12 brand new credentials here. And this is actually probably the hardest one to set
92:16 up out of all of the credentials for connecting to our local AI service. And
92:20 so I'm going to show you what the Docker Compose file looks like just that it's
92:24 clear how I'm getting these different values. And so I'll point out all of
92:28 them. So the first one for our host it is DB because this is the name of the
92:35 specific Superbase service that we have that is the underlying Postgress
92:38 database. And I can show you how I got that really quick. If you go to the
92:42 superbase folder that we pull when we run that start services script, I go to
92:47 docker and then docker compose. If I search for db and there's quite a few
92:53 dependencies on db here. So let me find the actual reference to it. Where is db?
92:58 Here we go. So yeah, it's really short. Uh db is the name of our service that
93:03 actually is the superbase DB. So this is the container name that this is what
93:07 you'll see in docker desktop. But then this is the underlying service that we
93:10 want to reference when we have our containers communicating with each
93:14 other. Like in this case we have our N8N container talking to our superbase
93:18 database container. And then the database and username are both going to
93:22 be Postgress. Those are the values that we have by default. If you scroll down a
93:26 bit in thev you can see these right here. The Postgress database is
93:29 Postgress and the user is also Postgress. And you can customize these
93:33 things but these are some of the optional parameters that I didn't touch
93:36 in the setup with you. And so you can just leave those as is. Now the
93:41 Postgress password, this is one of them that we set. That was the first
93:44 superbase value that we set there. Make sure you have that from what you have in
93:48 thev. And then everything else you can kind of leave as the defaults here. The port is
93:53 going to be 5432. So that is everything for setting up our connection to
93:57 Postgress. You can test this connection as well. And then we can move on to
94:01 adding in some tools and things like that as well. But yeah, this is like the
94:06 very first basic version of the agent that I wanted to show you. And hopefully
94:10 with this you can see how no matter the service that you have running in the
94:13 local AI package. It's very easy to figure out how to connect to it both
94:17 with the help of N8N because N8N always makes it really easy to connect to
94:20 things. Then also just knowing that like you just have to reference that service
94:25 name that we have for the container in the Docker Compose stack. That's how we
94:28 can talk to it. So you could add in quadrant or you could add in language.
94:31 Like you can connect anything that you want into our agent here. And so now we
94:36 have conversation history. Next up, I want to show you how to build a bit more
94:40 of a complicated agent with N8N using some tools. And then also I'm going to
94:44 show you how to connect it to Open Web UI. And so right now this is a live
94:47 demo. Instead of connecting to one of the Olama LLMs, I'm going straight to
94:53 N8N. I have this custom N8N agent connector. And so we are talking to this
94:57 agent that I'll show you how to build in a little bit. This one has a tool to use
95:02 CRXNG for local and private web search. This is one of the platforms that we
95:06 have included in the local AI package. And so this response is going to take a
95:10 little bit here because it has to search the web. And the response that it
95:13 generates with this question is pretty long. Like there we go. Okay. So we got
95:16 the answer. It's pretty long. But yeah, we are able to search the internet now
95:21 with a local agent. N8N connected to open web UI. We're getting pretty fancy
95:25 here. And we also have the title that was generated on the left. And then we
95:30 have the tags here as well. And so the way that this all works, I'm going to
95:33 start by explaining how we can connect N8N to Open Web UI. And this is just
95:37 crucial. Makes it so easy for us to test agents locally as we are developing
95:42 them. And so if you go to the settings and the admin panel in the bottom left
95:47 and go to functions, open web UI has this thing called functions which gives
95:51 us the ability to add in custom functionality kind of as like custom
95:56 models that we can then use like you saw with the N8N agent connector. And so
96:02 what I have here is this thing that I call the N8N pipe. And I'll have a link
96:05 to this in the description as well. I created this myself and I uploaded it to
96:10 the open web UI directory of functions. And so you can go to this link right
96:14 here. You can even just Google the N8N pipe for open web UI. And then you click
96:19 on this get button. It'll just have you enter in the URL for your open web UI.
96:23 So I can just like paste in this right here. Click on import to open web UI and
96:28 it'll automatically redirect you to your open web UI instance. So you'll have
96:33 this function now. And we don't have to dive into the code for all how how all
96:36 of this works. I worked pretty hard to create this for you. Uh actually quite a
96:40 while ago I made this. And the thing that we need to care about is
96:44 configuring this to talk to our N8N agent. And so if you click on the
96:49 valves, the setting icon in the top right, there are a few values that we
96:54 have to set. And so now I'm going to go over to showing you how to build things
96:57 in N8N. Then all of this will click and it'll make sense. I right now looking at
97:00 these values, you're probably like, how the heck do I get all of these? But
97:02 don't worry, we'll dive into all of that. But first, let's go into our N8N
97:07 agent. I'll explain how all of this works. So, first of all, we have our
97:12 chat trigger that gives us the ability to communicate with our agent very
97:16 easily in the workflow. We have a new trigger now for the web hook. And so,
97:22 this is turning our agent into an API endpoint. So, we're able to talk to it
97:27 with other services like open web UI. And so to configure the web hook here,
97:31 you want to make sure that it is a post request type. And then you can define a
97:35 custom path here. Whatever you set here is going to determine what our URL is.
97:40 So we have our test URL. And then also if you toggle the workflow to active,
97:44 this is really important. The workflow in N does have to be active. Then you
97:49 have access to this production URL. And this is actually the first value that we
97:54 need to set within the valves for this open web UI function. We have our N8N
97:59 URL. And because this is a container talking to another container, we don't
98:03 actually want to use this localhost value that it has here for us. We want
98:08 to specify N8N because N8N again is the name of the service running the N8N
98:13 container in our Docker Compose stack. So N8N port 5678. And then this is the
98:18 custom URL that we can determine based on this. And then the other thing that
98:23 we want to do is set up header authentication. We don't want to expose
98:27 this endpoint without any kind of security. And so we want to set up some
98:31 authentication. And so you can select header off from the authentication
98:34 dropdown. And then for the credentials here, I'll just create brand new ones to
98:38 show you what this looks like. The name needs to be authorization with a capital
98:43 A. This has to be very specific. The name in the top left and the name of
98:46 your credentials. This can be whatever you want, but this has to be
98:51 authorization. And then the value here, the way that we want to format this is
98:55 it's going to be bearer and then the and then a space and then whatever you want
99:00 your bearer token to be. So this is what you get to define, but it needs to start
99:05 with a bearer capital B and a space. And then whatever you type after bearer
99:09 space, this goes in as the NAN bearer token. So you don't include a bearer
99:13 space here because that it's just assumed that it's going to be like that.
99:16 It's going to be prefixed with that. So you just type in like test off is what I
99:21 have. So my bearer token is bearer test off like that. And then this is what I
99:24 enter in for this field. Now I already have mine set up. So I'm just going to
99:27 go ahead and close out of this. And then the last thing that we have to set up
99:30 for the web hook. And don't worry, this is the node that we spend the most time
99:33 with. You want to go to the drop down here and change this to respond using
99:38 the respond to web hook node. very important because then at the end of our
99:40 workflow and we get the response from our agent, we're going to send that back
99:45 to whatever requested our API which is going to be open web UI in this case.
99:48 And so that's everything for our configuration for the web hook. Now the
99:53 next thing that we have to do is we have to determine is open web UI sending in a
99:58 request to get a response for our main agent or is it just looking to generate
100:03 that conversation title or the tags for our conversation? Because like we were
100:06 looking at earlier, I'm going to close out of this for now and go back to a
100:10 conversation, our last conversation here. We get our main response, but then
100:15 also there is a request to an LLM to create a very simple title for our
100:19 conversation and the tags that we can see in the top right. And so our N8
100:24 workflow actually gets invoked three separate times for just the first
100:30 message in a new conversation. And so we need to determine, are we getting a main
100:34 response? Like should we go to our main agent or should we just go to a simple
100:39 LLM that I have set up here to help generate the tags or title? And so the
100:44 way that we can determine that is whenever Open Web UI is requesting
100:47 something like a title for a conversation, it always prefixes the
100:53 prompt with three pound symbols, a space, and then the word task. And so we
100:58 can key off of this. If the prompt starts with this, and that prompt just
101:02 is coming in from our web hook here. If it does start with it, then we're just
101:06 going to go to this simple LLM, we're just going to be using Quen 2.514b
101:11 instruct. We have no tools, no memory or anything like our main agent because
101:14 we're just very simply going to generate that title or the tags. And I can even
101:19 show you in the execution history what that looks like. So in this case, we
101:22 have our web hook that comes in. The chat input starts with the triple pound
101:28 and task. And so sure enough, we are deeming it to be a metadata request is
101:32 what I'm calling it. And so then it then goes down to this LLM that is just
101:36 generating some text here. We just have this JSON response with the tags for the
101:41 conversation, technology, hardware, and gaming. So we're asking about the price
101:45 of the 5090 GPU. And then we do the exact same thing to also generate the
101:51 title GPU specs. And so exactly what we see here is the title of this last
101:55 conversation. So I hope that makes sense. And then if it doesn't start with
101:59 task and the triple pound and so it's actually our request. Then we go to our
102:03 main agent. We don't want our main agent to have to handle those super simple
102:06 tasks. You can also just use a really tiny LLM. Like this would be the perfect
102:11 case to actually use a super tiny LLM like um even like DeepSec R1 1.5B. You
102:15 could because it's just such a simple task. Otherwise though we are going to
102:20 go to our main agent. And so I'm not going to dive into like all these nodes
102:24 in a ton of detail, but basically we are are expecting the chat input to contain
102:30 the prompt for our agent. And the way that we know to expect chat input
102:35 specifically is because going back to the settings for the function here with
102:38 the valves, we are saying right here chat input. So you want to make sure
102:43 that the value that you put in here for input matches exactly with what you are
102:48 expecting from our web hook. And so chat input is the one that I have by default.
102:51 So you can just copy me if you want. Then we go into our agent where we're
102:55 hooked into Olama and we've got our local superbase. I already showed you
102:58 how to connect up all this and that looks exactly the same. The only thing
103:01 that is different now is we have a single tool to search the web with
103:07 CRXNG. So it's a web search tool. I have a description here just telling it what
103:11 is going to get back from using this tool. And then for the workflow ID, this
103:16 is if I go to add a node here and I just go for uh workflow tools, call N8N
103:23 workflow tool. So this is basically taking an N8N workflow and using it as a
103:28 tool for our agent. So this is the node that we have right here. But then I'm
103:32 referencing the ID of this N8N workflow. So this ID because I'm going to just
103:37 call the subworkflow that I have defined below. And again, I don't want to dive
103:40 into all the details of NAN right now and how this all works, but the agent is
103:44 going to decide the query. What should I search the web with? It decides that and
103:49 then it invokes this sub workflow here where we have our call to CR XNG. So the
103:54 name of the container service in our docker and compost stack is just CR XNG
103:59 and it runs on port 8080. And then if you look at the CXNG documentation, you
104:03 can look at how to invoke their API and things like this. So I'm just doing a
104:07 simple search here and then there are a few different nodes because what I want
104:10 to do is I want to split out the search and actually I can show you this by
104:14 going to an execution history where we're actually using this tool. So take
104:18 a look at this. So in this case the LLM decided to invoke this tool and the
104:23 query that it decided is current price of the 5090 GPU. So this is going along
104:28 with the conversation that we had last in open web UI. we get some results from
104:33 CRXNG, which is just going to be a bunch of different websites. And so, we don't
104:37 have the answer quite yet. We just have a bunch of resources that can help us
104:41 get there. And so, I'm going to split out. So, we have a bunch of different
104:45 websites. We're going to now limit to just one. I just want to pull one
104:48 website right now just to keep it really, really simple because now we're
104:52 going to actually visit that website. I'm going to make an HTTP request to
104:57 this website, which yeah, I mean, if it's literally an Nvidia official site
105:01 for the 5090, like this definitely has the information that we need. We're
105:04 going to make a request to it, and then we're also going to use this HTML node
105:08 to make sure that we are only selecting the body of the site. So, we take out
105:12 all the footers and headers and all that junk. So, we just have the key
105:15 information. And then that is what we aggregate and then return back to our AI
105:19 agent. So it now has the content, the core content of this website to get us
105:24 that answer. That is how we invoke our web search tool. And then at the very
105:29 end, we're just going to set this output field. And that's going to be the
105:33 response that we got back either from like generating a title or calling our
105:37 main agent. And this is really important. the output field specifically
105:41 whatever we call it here we have to make sure that that is corresponding to this
105:46 value as the last thing we have to set for the settings for our open web UI
105:50 function. So output here has to match with output here because that is what
105:55 we're going to return in this respond to web hook. Whatever open web UI gets back
105:59 it's getting back from what we return right here. So that is everything for
106:03 our agent. I could probably dive in quite a bit more into explaining how
106:06 this all works and building out a lot more complex agents, which I definitely
106:10 do with local AI in the Dynamis AI agent mastery course. So check that out if you
106:13 are interested. I just wanted to give you a simple example here showing how we
106:17 can talk to our different services like Olama, Superbase, and Seir XNG. And then
106:22 also open web UI as well. So once you have all these settings set, make sure
106:26 of course that you click on save. It's very, very important. These two things
106:29 at the bottom don't really matter, by the way. But yeah, click on save once
106:32 you have all of the settings there. And then you can go ahead and have a
106:36 conversation with your agent just like I did when I was demoing things before we
106:40 dove into the workflow. And by the way, this NAN agent that works with Open Web
106:44 UI, I have as a template for you. You can go ahead and download that in this
106:48 GitHub repository where I'm storing all the agents for this masterass. So we
106:52 have the JSON for it right here. You can go ahead and download this file. Go into
106:57 your N8N instance. Click on the three dots in the top right once you've
107:01 created a new workflow. Import from file and then you can bring in that JSON
107:03 workflow. You'll just have to set up all your own credentials for things like
107:07 Lama and Superbase and CRXNG. But then you'll be good to go and you can just go
107:11 through the same process that I did setting up the function in open web UI
107:15 and it'll be with like 15 minutes you'll have everything up and running to talk
107:20 to N8N in open web UI. Next up I want to create now the Python version of our
107:25 local AI agent. And so this is going to be a onetoone translation. Exactly what
107:30 we built here in NN, we are now going to do in Python. So I can show you how to
107:34 work with both noode and code with our local AI package. And so this GitHub
107:39 repo that has the N workflow we were just looking at and that OpenAI
107:43 compatible demo we saw earlier, this has pretty much everything for the agent. So
107:46 most of this repository is for this agent that we're about to dive into now
107:50 with Python. And in this readme here, I have very detailed instructions for
107:54 setting up everything. And a lot of what we do with the Python agent, especially
107:57 when we are configuring our environment variables, it's going to look very
108:01 similar to a lot of those values that we set in N8N. Like we have our base URL
108:05 here, which you'd want to set to something, you know, like HTTP lama port
108:09 11434. We just need to add this /view one, which I guess is a little bit different,
108:14 but yeah, I've got instructions here for setting up all of our environment
108:18 variables, our API key, which you can actually use OpenAI or Open Router as
108:22 well with this agent, taking advantage of the OpenAI API compatibility. This is
108:27 a live example of this because you can change the base URL, API key, and the
108:31 LLM choice to something from Open Router or OpenAI, and then you're good to go
108:35 immediately. It's really, really easy. We will be using Olama in this case, of
108:39 course, though. And then you want to set your superb basease URL and service key.
108:42 You can get that from your environment variables. Same thing with CRXNG with
108:47 that base URL. We'll set that just like we did in N8N. We have our bearer token
108:51 like in our case was test off. It's just whatever comes after the bearer and the
108:55 space. And then the OpenAI API key you can ignore. That's just for the
108:58 compatible demo that we saw earlier. This is everything that we need for our
109:02 main agent now. And so we're using a Python library called fast API to turn
109:09 our AI agent into an API endpoint just like we did in N8N. And so fast API is
109:13 kind of what gives us this web hook both with the entry point and the exit for
109:17 our agent and then everything in between is going to be the logic where we are
109:21 using our agent. And I'm going to be using paidantic AI. It's my favorite AI
109:25 agent framework with Python right now. Makes it really easy to set up agents
109:29 and we'll so we'll dive into that here. And I don't want to get into the
109:32 nitty-gritty of the Python code here because this isn't a master class on
109:36 specifically building agents. I really just want to show you how we can be
109:40 connecting to our local AI services. This agent is 100% offline. Like I could
109:46 cut the internet to my machine and still use everything here. So we create our
109:50 Superbase client and the instance of our fast API endpoint. I have some models
109:55 here that define the the requests coming in, the response going out. So we have
109:59 the chat input and the session ID just like we saw in N8N. And then the output
110:04 is going to be this output field. And so that corresponds to exactly what we're
110:07 expecting with those settings that we set up in the function in open web UI.
110:11 So this Python agent is also going to work directly with open web UI. And then
110:16 we have some dependencies for our Pantic AI agent because it needs to have an
110:21 HTTP client and the CRXNG base URL to make those requests for the web search
110:25 tool. And then we're setting up our model here. It's an OpenAI model, but we
110:29 can override the base URL and API key to communicate with Olama or Open Router as
110:34 well like we will be doing. And then we create our Pantic AI agent just getting
110:38 that model based on our environment variables. I've got a very simple system
110:43 prompt and then the dependencies here because we need that HTTP client to talk
110:48 to CR XNG. And then I'm just allowing it to retry twice. So if there's any kind
110:51 of error that comes up, the agent can re retry automatically, which is one of the
110:55 really awesome things that we have in Pyantic AI. And then I'm also creating a
111:01 second agent here. This is the agent that is going to be responsible like we
111:05 have in NADM for handling the metadata for open web UI like conversation titles
111:10 and tags for our conversation. And so it's an entirely separate Asian because
111:14 we just have a another system prompt. In this case, I'm just doing something
111:18 really simple here. Uh we don't have any dependencies for this agent because it's
111:21 not going to be using the web search tool. And then for the model, I'm just
111:25 using the exact same model that we have for our primary agent. But like I shared
111:30 with N8N, you could make it so that this is like a much smaller model, like a one
111:33 or three billion parameter model because the task is just so basic or maybe like
111:38 a 7 billion parameter model. So you can tweak that if you want. Just for
111:41 simplicity sake, I'm using the same LLM for both of these agents.
111:46 And then we get to our web search tool. So in Pantic AI, the way that you give a
111:51 tool to your agent is you do at@ and then the name of your agent and then
111:55 tool and then the function that you define below this is now going to be
112:00 given as a tool to the agent. And then this description that we have in the doc
112:04 string here that is given as a part of the prompt to your agent. So it knows
112:09 when and how to use this tool. And so the exact, you know, details of how
112:13 we're using CRX and G here, I won't dive into, but it is the exact same as what
112:17 we did in N8N where we make that request to the search endpoint of CRX andG. We
112:22 go through the page results here. We limit to just the top three results or I
112:26 could even change this to make it even simpler and just the top result. So we
112:30 have the smallest prompt possible to the LLM. And then we get the the content of
112:34 that page specifically. And then we return that to our AI agent with some
112:40 JSON here. So now once it invokes this tool, it has a full page back with
112:44 information to help answer the question from the user. It has that web search
112:47 complete now. And then we have some security here to make sure that the
112:51 bearer token matches what we get into our API endpoint. So that's that header
112:55 authentication that we set up in N8N. So this part right here where we're
112:59 verifying the header authentication that corresponds to this verify token
113:03 function. And then we have a function to fetch conversation history to store a
113:08 new message in conversation history. So both of these are just making requests
113:12 to our locally hosted superbase using that superbase client that we created
113:16 above. And then we have the definition for our actual API endpoint. And so in
113:23 N8N we were using invoke N8N agent for our path to our agent. So this was our
113:29 production URL. In this fast API endpoint, our endpoint is slashinvoke
113:34 python agent. And then we're specifically expecting the chat input
113:39 and session ID. So that is our um chat request request type right here. And
113:43 then sorry I highlighted the wrong thing. We have our response model here
113:46 that has the output field. So we're defining the exact types for the inputs
113:50 and the outputs for this API endpoint. And then we're also using this verify
113:55 token to protect our endpoint at the start. And then the key thing here, if
113:59 the chat input starts with that task, then we're going to call our metadata
114:02 agent. And so it's just going to spit out the title or the tags, whatever that
114:06 might be. Otherwise, we're going to fetch the conversation history, format
114:11 that for Pyantic AI, store the user's message so that we have that
114:15 conversation history stored, create our dependencies so that we can communicate
114:20 with CRXNG, and then we'll just do agent.run. We'll pass in the latest
114:24 message from the user, the past conversation history and the
114:27 dependencies that we created. So it can use those when it invokes the web search
114:31 tool and then we just get the response back from the agent and we'll you can
114:34 print that out in the terminal as well and then we'll just store it in
114:38 superbase and then return the output field. So I'm going kind of fast here. I
114:42 there definitely a lot more videos on my channel where I break down in more
114:45 detail building out agents with podantic AI and turning them into API endpoints
114:49 and things like that. Um but yeah, just going a little bit faster here. here.
114:52 And then the last thing is with any kind of exception that we encounter, we're
114:55 just going to return a response to the front end saying that there was an issue
114:59 and then specifying what that is. And then we are using Ubicorn to host our
115:06 API endpoint specifically on port 8055. So that is everything for our Python
115:11 agent exactly the same as what we set up in N8N. And now going to the readme
115:16 here, I'll open up the preview. The way that we can run this agent, we just have
115:20 to open up a terminal just like we did with the OpenAI compatible demo. I've
115:25 got instructions here for um setting up the database table, which this is using
115:30 the same table as the one in N8N and N8N creates it automatically. So, if you
115:34 have already been using the N8N agent, you don't actually have to run this SQL
115:38 here. Um, and then you want to obviously set up your environment variables like
115:42 we covered. Uh, open your virtual environment and install the requirements
115:46 there. And then you can go ahead and run the command python main.py.
115:52 And so this will start the API endpoint. So it'll just be hanging here because
115:56 now it's waiting for requests to come in on a port 8055. And so what I can do is
116:03 I can go back to open web UI. I can go to the admin panel functions. Go to the
116:08 settings. I can now change this URL. So everything else is the same. I have my
116:11 bearer token, the input field, and the output field the same as N8N. The only
116:16 thing I have to change now is my URL. And so I know this is an N8N pipe and I
116:20 have N8N in the name everywhere, but this does work with just any API
116:23 endpoint that we have created with this format here. And so I'm going to say for
116:27 my URL, it's actually going to be host.doccker.in because I have my API endpoint for
116:33 Python running outside on my host machine. So I need my open web UI
116:38 container to go outside to my host machine. And then specifically the port
116:43 is going to be 8055. And then the endpoint here, I'm going to
116:46 delete this web hook here because it's invoke-python- agent. Take a look at that. All right.
116:52 Boom. So I'm going to go ahead and save this. And then I can go over to my chat.
116:57 And it says n agent connector here still. But this is actually talking to
117:00 my Python agent now. So I'll go ahead and start by asking it the exact same
117:05 question that I asked the N8N agent. And I do have this pipe set up to always say
117:08 that it's calling nan, but this is indeed calling our Python API endpoint.
117:12 And we can see that now. So there we go. We got all the requests coming in, the
117:16 response from the agent, and then also the metadata for the title and the tags
117:20 for the conversation. Take a look at that. So we got our title here. We have
117:24 our tags, and then we have our answer. It's a starting price of $2,000, which
117:28 it's a lot more right now. The starting point, the starting price is kind of
117:32 misleading, but like yeah, this is a good answer. and it did use CRXNG to do
117:36 that web search for us. This is really, really neat. Now, the last thing that we
117:40 want to do for our Python agent before we can work on deploying things to the
117:45 cloud, to a private server in the cloud, is we want to containerize it. Now, the
117:49 reason that we want to do this, and this is the Docker file that I have set up to
117:53 turn our Python agent into a container, just like our local AI services, is if
117:59 we have our agent containerized, then we can have it communicate within the
118:03 Docker network just like we have our different local AI services
118:07 communicating with each other. Because right now running directly with Python
118:12 to communicate with Olama, for example, we need our URL to be localhost, not
118:18 Olama. Remember, you can only use the specific name of the container service
118:23 when you are within the docker compose stack. And so we'd have to actually say
118:28 localhost right now. But if we add the container for the agent into the stack
118:32 as well, then we can communicate directly within the private network.
118:36 Like I can say lama and then for sir xng I could use this URL instead. Right now
118:41 we have to actually use localhost port 8081. And so it's really nice for
118:46 security reasons and just to make your deployment um in a nice package to have
118:51 the agents that you're running in the same network as your infrastructure. And
118:55 so that's what we're going to do right now. And so within the read me that I
118:59 have for instructions on setting up everything. I have the instructions that
119:02 we follow to run things with Python. I also have the instructions to run it
119:07 with Docker. And so all you want to do is run this single command. It's
119:11 actually very easy because I have this Docker file set up to turn our agent
119:15 into a container and I've got security and everything taken care of. We're
119:19 running it on port 8055 just like we did with Python. And then I have this very
119:24 simple Docker Compose file. It's just a single service that we're going to tack
119:28 on to all of the other services that we already have running for the local AI
119:32 package. And I'm calling this one the Python local AI agent. And so we're
119:36 using all of our environment variables from our ENV just like we did with the
119:40 local AI package. And then what I have at the top here is I am including the
119:45 docker compose file for the local AI package. So that just kind of solidifies
119:49 the connection there. Otherwise, you'll get this kind of weird error that says
119:52 there are orphaned containers when you run this, even though they aren't
119:55 actually. And so this is optional. You'll just get an orphan container
119:58 warning that you can ignore. But if you don't want to have that warning, you can
120:01 include this right here. You just have to make sure that this path corresponds
120:06 to the path to your docker compose in the local AI package. So in my case, I
120:10 just had to go up to directories and then go into the local AI package
120:13 folder. So yeah, you this is optional, but I want to include this here just to
120:18 make things in tiptop shape for you. So yeah, this is the docker compose. And
120:21 then what we can do now is I'll go back over to my terminal and I will paste in
120:25 this command. And what this will do is it will start or restart my Python local
120:31 AI agent container. And make sure that you specify this here because if you
120:35 don't then it's going to try to rebuild the entire local AI package because we
120:38 have this include. So very important you want to just rebuild or build for the
120:43 first time this agent container. And so I'll go ahead and run this and it's
120:47 going to give me those flow-wise warnings. So I don't have my username
120:49 and password set, but remember we can ignore those. But anyway, it's going to
120:52 build the Python local AI agent container here. And there's a couple of
120:56 steps that it has to do. It has to update some internal packages and then
121:00 also install all of the pit packages we have for our Python requirements for
121:04 things like fast API and Pantic AI. So it'll take a little bit to complete,
121:07 usually just a minute or two. And so I'll go ahead and pause and come back
121:11 once this is done. And there we go. Your output should look something like this.
121:14 It goes through all the build steps and then it says that the container is
121:18 started at the bottom. And this is now in our local AI Docker Compose stack. So
121:23 going back over to Docker Desktop. It'll take a little bit to find it here
121:26 because there are so many services that we have here. But if we scroll down,
121:29 okay, there we go. At the bottom here, we have our Python local AI agent
121:34 waiting on port 8055 just like when we ran it directly with Python, but now it
121:38 is within a container that is directly within our stack. And so now, like I was
121:41 saying, this is super important, so I'm hitting on it again. Now when we set up
121:45 our environment variables for our container, we are going to be
121:49 referencing the service names of our different local AI services like circ or
121:55 lama instead of localhost. And so this whole like localhost versus
121:58 host.docer.ernal versus using the service name, that's the thing I see people get tripped up on
122:03 the most when they're configuring different things in a Docker environment
122:06 like the local AI package. That's why I'm spending a good amount of time
122:09 really hammering that in because I want you to get it right. And of course, if
122:12 you have any issues that come up with this, just let me know. I'd love to help
122:16 walk you through what exactly your configuration should look like. And so,
122:20 we have our agent now up and running in a container. And I'm not going to go and
122:23 demo this again right now because the next thing that we're going to move into
122:26 doing and then I'll give a final demo here is deploying everything to a
122:30 private server that we have in the cloud. All right. So, we have really
122:34 gotten through all of the hard stuff already. So, if you have made it this
122:38 far, congratulations. you really have what it takes to start building AI
122:43 agents with local AI now and the sky is the limit for what you can accomplish.
122:47 And so the last thing that I want to really focus on here in this master
122:50 class is taking everything that we've been building on our own computer with
122:54 our infrastructure and our agents and deploying it to a private machine in the
122:59 cloud because then we can have our entire infrastructure and agents running
123:02 24/7. We don't have to rely on having our own computer up all the time. It's
123:05 really nice to have it there because then we can also share it with other
123:10 people as well. So local AI is still considered local as long as it's running
123:14 on a machine in the cloud that you control. And so this is not just going
123:18 to open AAI or Superbase and paying for their API. This is still us running
123:22 everything ourselves on a private server. That's what I'm going to show
123:25 you how to do right now. And this process that I cover with you is going
123:29 to work no matter the cloud provider that you end up choosing. And there are
123:32 some caveats to that that I'll explain in a little bit. But yeah, you can pick
123:36 from a lot of different options. So the cloud platform that we will be deploying
123:41 to today is Digital Ocean. I use Digital Ocean a lot. It's where I deploy most of
123:46 my AI agents. So I highly recommend it. And the best part about Digital Ocean is
123:51 they have both GPU machines if you need to have a lot of power for your local
123:56 LLMs and they have very affordable CPU instances. If you want to deploy
124:00 everything for the local AI package except Olama, you can definitely go a
124:03 more hybrid route if you don't want to pay a lot because these GPU instances in
124:07 the cloud can be pretty expensive like one, two, even $5 per hour. So, what you
124:12 can do with a hybrid setup is deploy everything in the local AI package. So,
124:16 at least you have all that locally and you're not paying for those
124:19 subscriptions. But then you could still use something like OpenAI, Open Router
124:23 or Anthropic for your LLMs. So, Digital Ocean gives us the ability to do both,
124:27 and we'll dive into that when we set things up. Another really good option
124:33 for GPU instances is Tensor Do. Tensor Do is not as nice looking to me as
124:37 Digital Ocean. I generally feel like I have a better experience with Digital
124:40 Ocean, but I have deployed the local AI package to Tensor Do before on a 4090
124:46 GPU that they offer for 37 cents an hour. It's very affordable for GPU
124:49 instances. And so, this is a good platform as well. And then also if
124:55 you're okay with not running Olama on a GPU instance, like you just want a very
124:59 affordable way to host everything in a local AI package except the LLMs, then
125:03 you can use Hostinger. Hostinger is another really really good option. super
125:08 super affordable like $7 a month for a KBM2 which I'd recommend getting if you
125:14 want to deploy everything except Olama because the requirement for the local AI
125:17 package except for running the more resource intense local LLMs is you have
125:23 to have 8 GB of RAM. So don't get a cloud machine that has four or 2 GB. You
125:27 want to have 8 GB of RAM then you'll be good to go. So you can literally do it
125:31 for $7 a month through Hostinger and it's going to be something like $28 a
125:35 month through Digital Ocean unless you want a GPU instance. So I just want to
125:39 spend a couple minutes talking about different platform options. The one
125:43 thing I will say is that the local AI package runs as a bunch of Docker
125:47 containers, right? And so what you have to avoid is using a platform like
125:54 RunPod. So RunPod is a platform for running local AI. The problem is when
125:59 you pay for a GPU instance, you don't actually get the underlying machine.
126:05 You're just sshing into a container. So, you're accessing a container. And I'll
126:09 just save you the pain right now. It is so hard and basically impossible to run
126:15 Docker containers within Docker containers. So, you really can't run the
126:20 local AI package on RunPod. There are other platforms as well like Lambda Labs
126:25 is another one that I've used before. not for the local AI package for other
126:28 things but this also runs containers like you're accessing a container so you
126:34 can't do the local AI package vast.ai AI is another option, but this also is
126:39 you're renting a GPU, but it's you're accessing a container. So, you again
126:43 can't run the local AI package. And so, based on the platform that you choose,
126:47 you have to make sure that you are accessing the underlying machine when
126:51 you rent a GPU instance like Digital Ocean is the one that I will be using.
126:56 You could use a GPU instance through the Google cloud or Azure or AWS if you want
127:00 to go more enterprise. Those all give you access to the underlying machine.
127:04 It's your own private server just like we have in Digital Ocean. So you can use
127:07 that to deploy the local AI package and the agent that we built with Python. So
127:11 that's what we're going to do right now. So once you are signed into Digital
127:14 Ocean and you have your profile and billing set up and you have a project
127:18 created or you can just use the default one, now we can go ahead and create our
127:22 private server in the cloud to host the local AI package and our agent. And so
127:25 you can click on create in the top right. And there are two options here.
127:29 If you want a CPU instance, so that hybrid approach where you're hosting
127:34 everything except for the LLMs, you can select a droplet. Otherwise, what we're
127:37 going to be doing right now so I can demo the full full thing is we will
127:42 create a GPU droplet. Now, these are going to be more expensive. Like I said,
127:48 like running an H100 GPU is $3.40 an hour. It's pretty expensive. But like I
127:52 said at the start of this master class, I know so many businesses that are
127:56 willing to put tens of thousands of dollars per year into running their own
127:59 infrastructure and LLMs. And that biggest cost that contributes to that
128:03 being tens of thousands of dollars is having a GPU droplet that is running 247
128:09 in the cloud. So the hybrid approach I definitely recommend if you don't want
128:12 to pay more, you could go as low as $7 a month with Hostinger. So, there's a very
128:16 wide range of options for you depending on what you want to pay hosting the
128:21 package, LLMs, and your agents. And the other thing I will say is that if you
128:25 want to, you could just create this instance for a day and poke around with
128:29 things and then tear it down. So, you only have to pay, you know, like 20
128:31 bucks or something like that. So, there's a lot of different options for
128:35 flexibility here. And so, I'm going to pick the Toronto data center because
128:38 there's more options for GPUs available here and it's relatively close to me.
128:42 And then for the image, I'll select AI/ML ready and it's recommended because
128:47 you get the Linux bundled with all the required GPU drivers and it does run the
128:52 Ubuntu distribution of Linux. And so this process that I'm going to walk you
128:55 through for deploying local AI package to the cloud is going to work for any
129:00 Ubuntu instance that you have running on AWS or Hostinger or Tensor Do. It's just
129:05 a very standard distribution of Linux. And then for the GPU, there are a couple
129:09 of different options that we have here with Digital Ocean. H100 is an absolute
129:15 beast. 80 GB of VRAM, so it could easily run Q4 large language models, over a 100
129:21 billion parameters, even 240 GB of RAM. So, I'm not going to run this one. I'm
129:24 just kind of pointing out that this is an absolute beast. I think the one that
129:27 I'm going to choose here is going to be the RTX 6000 ADA. So, it's 48 GB of
129:34 VRAM. So, this is enough to run 70 billion parameters or smaller of LLMs at
129:40 a Q4 quantization and it comes with 64 GB of RAM and it's going to be about
129:45 $1.90 per hour. So, I'm going to select this and then I have an SSH key that is
129:50 created already. If you don't have an SSH key, then you can click on this
129:54 button to add one. And then you can follow the instructions on the right
129:57 hand side here. No matter your OS, they got instructions to help you out. You
130:00 just have to paste in your public key and then give it a name. So, I've got
130:03 mine added already. And then the only other thing that I really have to select
130:07 here is a unique name. So, I'll just say local AI package. And I'll just say GPU
130:11 because I already have the regular version just deployed on a CPU instance.
130:16 And then for my project, I will select Dynamis. I can add it along with my
130:19 other instance that I've got up and running. And there we go. So now I can
130:22 go ahead and just click on create GPU droplet. It is that easy to get our
130:27 instance ready for us to access it and start installing everything and getting
130:30 everything up and running just like we did on our computer. And so I'll go
130:34 ahead and pause and come back once our machine is created in just a few
130:37 minutes. And boom, there we go. Just after a minute, we have our GPU droplet
130:42 up and running. And so the one thing I will say is I had to request access to
130:46 create a GPU instance on the Digital Ocean platform. However, they approved
130:50 it in less than 24 hours. So, it's very easy to get that if you do want to
130:53 create a GPU instance. Otherwise, you can just create one of their normal
130:58 droplets, one of their CPU instances. Now, before we connect to this machine,
131:01 the one thing that you want to take note of is the public IPv4 address. We'll use
131:06 this to set up subdomains in a little bit. And so, this is how we get to it
131:12 for a GPU droplet. For a CPU droplet, it looks a little bit different. You'll
131:15 usually see the IPv4 somewhere at the top right here. And so, take note of
131:18 that. Save it for later. We'll be using that in a little bit. And then to
131:22 connect to our droplet, we can either do it through SSH with our IPv4 and the SSH
131:27 key that we set up when we configured this instance or we can access the web
131:31 console. For a CPU instance, usually you go to like an access tab and then you
131:35 can launch the web console. For the GPU instance, I can just click this to
131:38 launch it right here. So we have a separate window that comes up and boom,
131:42 we now have access to our instance. It is that easy to get connected. And now
131:46 we can go through the same process that we did on our computer to install the
131:51 local AI package. Now there are some different steps that we have to take and
131:55 that's why I'm including this at the end of the master class especially. Um so if
131:59 you scroll down in the read me here there are some specific instructions for
132:03 deploying to the cloud. And so you have to make sure that you have a Linux
132:07 machine preferably on the Ubuntu distribution which is that is what we
132:11 are using. And then there are a couple of extra steps. And so the first thing
132:15 that we have to set up is our firewall. We have to open up a couple of ports so
132:20 that we can access our machine from the internet. Set up our subdomains for
132:24 things like N8N and Open Web UI. And so you want to take this command UFW
132:29 enable. I'll just go ahead and paste it in. And so we are going to and you can
132:33 just type Y to continue here. It's going to disrupt SSH connections, but we don't
132:36 really care. So UFW enable. So we're enabling the firewall. And then you want
132:41 to copy this command to allow both ports 80 and 443. And so 80 is HTTP and then
132:48 443 is HTTPS. And then you can just do the last command here, UFW reload. So
132:54 now we have those ports available for us to communicate with all of our services.
132:59 And so this is the entry point to Caddy. Caddy is the service in the local AI
133:03 package that is going to allow us to set up subdomains for all of our services.
133:08 is and so any kind of communication to our droplet is going to go through caddy
133:13 and then caddy will distribute to our different services based on the port or
133:17 the subdomain that we are using. So this is called a reverse proxy. You've
133:21 probably heard of like EngineX or traffic before. Caddy is something very
133:25 much like that. And it also makes it so that we can get HTTPS so we can have
133:29 secure endpoints set up automatically and it manages that encryption for us.
133:33 It's a very beautiful platform. So, let me scroll back down to the specific
133:36 steps for deploying to the cloud. There is a quick warning here about how Docker
133:40 manages ports and things, but this is as secure as we possibly can make it. So,
133:43 trust me, we've put a lot of effort and I actually had someone from the Dynamis
133:48 community, um, Benny right here. He actually helped me a lot with security
133:53 for the local AI package. So, thank you, Benny, for helping out with that. We're
133:56 really making sure that because local AI like the whole thing is like you want to
133:59 be private and secure and so we're making sure that this package handles
134:03 all the best practices for that. So very much top of mind for us. Um and then we
134:07 can go ahead and go through the usual steps for setting up the local AI
134:11 package. The only other thing we have to do that's unique for cloud is we have to
134:16 set up a records for our DNS provider so we can have our subdomain set up for our
134:20 different services. So we'll get into that in a little bit. But first I just
134:24 want to get the local AI package up and running. And so I'm going to go ahead
134:28 and paste this command here to clone the repository. Git comes automatically
134:33 installed with our GPU droplets. And then I can change my directory into the
134:38 local AI package. And let me zoom in on this a little bit here so it's very easy
134:41 for you to see because now then what I want to do is I want to copy
134:45 the.env.ample file to a new file called. So then if I do an ls command so we can see all the
134:51 files that are available in our directory. We I guess it doesn't show um
134:56 I do ls- a there. Now we can see the env.env.example. example. And so now I can do nano.env.
135:05 This is going to give us a basically a text editor directly in the terminal
135:08 here so that we can set all of our environment variables just like we did
135:12 with the local AI package on our computer. So this time I'm not going to
135:15 go into the nitty-gritty details of setting up all these environment
135:19 variables because it is the exact same process. In fact, you can literally
135:23 reuse all of the secrets that you already set up when you hosted it on
135:27 your own computer as long as those are actually secure. Like I know a lot of
135:29 times you might just do some kind of placeholder stuff when it's just running
135:32 on your computer and then you want it more secure in the cloud. So make sure
135:36 you have real values for everything, but yeah, you can reuse a lot of the same
135:39 things. Um though for best security practice, you probably do want to make
135:42 everything different. But yeah, the one thing that I want to focus on with you
135:46 here that does change is our configuration for Caddy. Now that we are
135:50 deploying to the cloud, we want to have subdomains for our different services.
135:54 And so like NN for example, we want to set the host name for that. You want to
135:59 do the same thing for N8N open web UI superbase and then your let's encrypt
136:04 email. Obviously you want to uncomment that as well because this is the email
136:08 that you want to set for your SSL encryption. And so I'm just going to do
136:13 coal dynamis.ai. And you can just set this to whatever email that you want to use. And so
136:18 basically what you want to do here is just uncomment the line for each of the
136:23 services that you want to have subdomains created for. So, if you're
136:26 also using Flowwise, which I'm just not in this master class here, but if you
136:30 are, then you want to uncomment this line as well. If you're using Langfuse,
136:34 then you uncomment this line. I'm going to leave them commented right now just
136:38 for simplicity sake. The two that I would generally recommend not
136:43 uncommenting ever is Olama and CRXNG. We don't really want to expose them through
136:45 a subdomain because we're just going to use them as internal tooling for our
136:49 agents and applications that we have running on this server. So, we want to
136:53 keep those nice and private. But for everything that we do want to expose
136:56 that is protected with a username and password like N8N open web UI and
137:01 superbase we can uncomment those. And so we got that set up now. But we have to
137:05 obviously provide real values for them as well. And so for example I'm just
137:09 going to say nyt for YouTube dynamis and then I'll do. So you want to
137:17 define your exact URL that you want to have for this domain. Obviously, it has
137:21 to be a domain that you control because we'll go and we'll set up the records in
137:25 the DNS in a little bit. And so, I'll do the exact same thing for open web UI.
137:28 So, it's open web UI and I'm just doing YT because I already have the local AI
137:33 package hosted on my domain. And so, I can't do just open web UI because it's
137:36 already taken. So, open web UIT.dynamis.ai. And then finally, for Superbase, it'll
137:44 be superbaseyt.dynamis.ai. Boom. All right. There we go. So, go
137:47 ahead and take care of this. set up the rest of your environment variables and
137:51 then we can go ahead and move on. And the way that you exit out of this and
137:55 save your changes is you do controll X or command X on Mac, type Y and then hit
138:03 enter. So again that is controll X then type Y then press enter. That is how you
138:08 exit out. And so now if I do a cat of the env this is how you can print it out
138:11 in the terminal. So we can verify that the changes that we made like everything
138:16 for caddy is indeed made. So do that. Also change all the other environment
138:18 variables as well. I'm going to do that off camera and come back once that is
138:22 taken care of. All right. So my environment variables are all set. Now
138:26 the very last thing we have to do before we start all of our services is we need
138:32 to set up DNS. And so remember, copy your IPv4 address and then head on over
138:38 to your DNS provider. And so this process is going to look very similar no
138:42 matter the DNS provider that you have. Like I'm using NameCheep here. A lot of
138:47 people use Hostinger or Bluehost. You are able to with all these providers go
138:50 to something that is usually called like advanced DNS or manage DNS and then you
138:55 can set up custom A records here which we're going to do to set up a connection
139:01 for all of our subdomains to the IPv4 of our digital ocean droplet or whatever
139:05 cloud provider you are using. And so I'm going to go ahead and click on add new
139:09 record. It's going to be an A record for the host. It's going to be the subdomain
139:14 that I want. So, N8NYT for example. And then for the IP address, I just paste in
139:19 the IPv4 of my Digital Ocean GPU droplet. And then I'll go ahead and
139:23 click on the check here to save changes. And then I'll just go ahead and do the
139:28 same thing for open web UIT. I can't forget the YT. There we go. And
139:33 then for you, it might be more than just three, but for me, the only other one
139:36 that I have here right now is Superbase because I'm just keeping it very, very
139:41 simple. So, superbase yt and then paste in the IP again. Okay, there we go.
139:45 Boom. All right, so we have all of our records set up. And it's very important
139:49 to do this before you run things for the first time because otherwise Caddy gets
139:52 very confused. It tries to use these subdomains that you don't actually have
139:57 set up yet. And so take care of that. Then we can go back into our instance
140:01 here and run the last command. And so going back to the readme here, if I
140:06 scroll all the way down to deploying to the cloud, there is a specific parameter
140:11 that you want to add. This is very important for deploying to the cloud
140:15 because when you select the environment of public, it's going to close off a lot
140:20 more ports to make this very very secure. So any of the services that you
140:25 access from outside of the droplet, it has to go through caddy. So we use the
140:30 reverse proxy as the only entry point into any of our local AI services. This
140:34 is how we can make things as airtight as possible. We have security in mind like
140:39 I said. And so make sure that you run this command with the environment
140:43 specified. We didn't do this locally because when we are running things
140:45 locally, we don't care about security as much because it's not like our machine
140:49 is accessible to the internet like a cloud server is. And so it defaults to
140:54 the environment of private which just doesn't do as much security stuff. And
140:57 so go ahead and run this. And then of course make sure that you're using the
141:01 correct profile. So if you're using just a CPU instance with a regular droplet or
141:05 hostinger or whatever, you'd want to change this to CPU instead of GPU
141:10 Nvidia. But in our case, because we are paying the $2 an hour for a killer GPU
141:16 droplet. I can go ahead and run this command with the profile of GPU Nvidia.
141:20 Now, I left this error in here intentionally because I want to show you
141:24 what it looks like. If you get unknown shorthand flag p-p, that means that you
141:29 don't actually have docker compose installed. And this happens for some
141:32 cloud providers. And there's a very easy fix for this that I want to walk you
141:35 through. So you can even test this. Just docker compose. It'll say that compose
141:40 is not a docker command. And so going back to the readme here, I have a couple
141:44 of commands that you just have to run if this happens to you. This is at the
141:47 bottom of the deploying to the cloud section of the readme. So you can just
141:51 copy these one at a time, bring them into your droplet or your machine,
141:55 wherever you're hosting it, and go ahead and run them. And so I'm just going to
141:58 do this off camera really quickly. I'm just going to copy each of these into my
142:02 droplet. It's very easy. You can just run all of these. They're really fast as
142:06 well. So none of them are going to take very long. This is just going to get
142:10 everything ready for you so that Docker Compose is a valid command. So you can
142:14 then run the start services script. So there we go. All right. I went ahead and
142:18 ran all of those. I'll clear my terminal again and then go back to the main
142:23 command here to start our services. Boom. All right. So now we pulled
142:26 everything from Superbase, set up our CRXNG config. Now we are pulling our
142:31 Superbase containers. So again, same process is running on our computer where
142:35 it'll pull Superbase, it'll run everything for Superbase, then it'll
142:38 pull and run everything for the rest of our services. So I'll pause and come
142:42 back once this is all complete. And boom, there we go. We have all of our
142:46 services up and running. You should see green check marks across the board like
142:51 this. We are good to go. And we don't have Docker Desktop, so it's not as easy
142:55 to dive into the logs for our containers. But one quick sanity check
142:59 that you can do just in the terminal is run the command docker ps- a. This will
143:03 give you a list of all of our containers that are running here. We can make sure
143:06 that all of them are running, that we don't see any that are constantly
143:10 restarting or ones that are down. So we do have two that are exited, but these
143:14 are Nit and import and the Olama pull. These are the two that we know should be
143:17 exited. Just make sure that everything is good to go. Then we can head on over
143:22 to our browser. And because we have DNS set up already, we configure Caddy. We
143:25 can now navigate to our different services. Like I can go to
143:30 nadnyt.dynamus.ai. And boom, there we go. It's having us set up our owner account or um we can
143:39 just go to open web ui yt.dynamis.ai. Boom. And there is our open web UI. All
143:43 right. So I'll go ahead and get started. Uh, we'll have to create our account.
143:46 I'll do this off camerara, but yeah, you just create your first-time accounts for
143:49 everything. And then we'll do the same thing for let's do superbaseyt.dynamus.ai.
143:55 And boom, there we go. So, all of our services are up and running. And so now
143:59 we can log into these and create our accounts and we can interact with our
144:03 agents and bring them in. We can work with Llama in the same way. And so,
144:06 let's go ahead and do that. I'll just go ahead and create these accounts off
144:09 camera. So, I've got my accounts created for N8N and then also open WebUI. And
144:13 you can do the same for all the other accounts you might have to create for
144:16 things like Langfuse as well. And then within Open Web UI, we'll go to the
144:21 admin panel, settings, connections. Make sure that your Olama API is set
144:25 correctly to reference the service Olama. Usually this will default to
144:30 localhost or host.docer.in internal. So you can get that there. You have to set
144:34 the OpenAI API key as well, just to any kind of random value. It's just a little
144:38 bug in open web UI. Then click on save and then you can go back and your models
144:41 will be loaded. Now, now, one thing I found with Open Web UI, after you change
144:46 the Olama base URL, you have to do a full refresh of the website. Otherwise,
144:49 you'll get an error when you use the LLM. So, just a really small tangent
144:52 there, a little tidbit there, but yeah, we'll go ahead and select the model
144:55 that's pulled by default. And you can pull other ones into your Llama
144:57 container as well, like we already covered. You don't have to restart
145:00 things. So, let's run a little test. I'll just say hello, and we'll see if we
145:04 can load the model now. And boom, look at how fast that was, because we have a
145:08 killer GPU instance right now. I could run much larger LLMs if I wanted to. Um,
145:14 so yeah, let's see. What did I just say? All right, we'll do another test here.
145:17 And yeah, look at how fast that is. It's blazing fast because everything is
145:20 running locally on the same infrastructure. There's no network
145:24 delays. And so we have a powerful GPU, no network delays. We get some blazing
145:28 fast responses from these LLMs right now. And so I don't want to go and test
145:34 everything with N8N again. But what I do want to show you how to do right now is
145:39 take the Python agent that we have in this repository and deploy this onto the
145:44 cloud as well. Adding it into the local AI Docker Compose stack just like we did
145:49 on our computer, but now hosting it all in the cloud. So that's the very last
145:52 thing that I want to cover with you for our cloud deployment. And so just like
145:56 with the local AI package, we can follow the instructions here in the readme to
145:59 get everything up and running on our machine. And so the first thing we have
146:03 to do is we need to clone our repository. And so I'm going to copy
146:07 this command, go back over into my terminal here for my instance. And I
146:11 step back one directory level by the way. So I'm now at the same place where
146:14 I have the local AI package so we can run them side by side. So I'll paste
146:19 this command to clone automator agents. And then I can cd into it. And then I
146:22 also want to change my directory within the Python local AI agent specifically.
146:28 And so now doing an ls- a we can see the enenv.ample. So I'm going to just like
146:33 we did before copy this and turn it into aenv. And then I can do nano.env.
146:39 And there we can edit all of our environment variables. And so because
146:42 we're running this in the docker container attaching it to the local AI
146:46 stack. The way that I referenced was going to be just calling out the service
146:51 name. So port 111434 /v1. And then the API key. It's just that placeholder there for lama for the
146:58 LLM choice. If I want to get the exact ID of one that I already had pulled,
147:01 I'll actually show you how to do this really quick. So, I'm going to do
147:06 controlx y enter to save and exit. And then the way that you can execute a
147:12 container, it's docker exec-it and then the name of our container which
147:15 is lama. We already have this running. And then /bin/bash. And so what this is going to do is now
147:22 instead of being within our machine, we are within our Olama container. And so
147:28 now I can run the command Olama list and then I can see the LLMs that I have
147:33 available to me. So I have Quen 2.57B. So I'm going to go ahead and copy this
147:37 ID. I don't have it memorized. So this is my way to go and reference it really
147:41 quickly. And then this is also how you can access each of your containers when
147:44 you don't have Docker Desktop. You just do docker exec-it the name of your
147:50 container and then bin /bash. So kind of like how we had that exec tab in docker
147:53 desktop. And then once I'm done in here, I can just do exit. And now I'm back
147:58 within my host machine, my GPU droplet. So that's another little tidbit, another
148:02 golden nugget I wanted to give you there. But yeah, we'll go back into our
148:05 environment variables here. And I have that ID for quen 2.57b copied. So I'll
148:12 paste that in. And boom, there we go. And then for our superbase URL, it's
148:17 going to be http col back slashbackslash and then it is kong. So I guess that I
148:21 should have been more clear on this when I set things up locally. So I'll be sure
148:24 to update the documentation for this, but it's going to be Kong port 8000
148:29 because Kong is the service that we have in Superbase specifically for the
148:33 dashboard. And then the service key, well, I'm just going to go ahead and get
148:37 that from my local AI package because I have this set up in environment
148:40 variables. And so I just have to go and reference my environment variables here
148:46 to get my service ro key. And boom, there we go. Okay. So I'm going to go
148:49 ahead and paste this in. And um now I'm just going to delete this instance
148:51 after. I don't really care that I'm exposing this right now. And then for
148:57 the CRNG base URL, HTTP CRXNG 8080. And then for my bearer token, I
149:01 just have it set to test off. And then we don't need to set the OpenAI API key
149:05 because that was just for the OpenAI compatible demo earlier. So that is all
149:09 of my configuration for this container. I got to be really clear on this. I'll
149:12 update the docs for this. But uh otherwise we are looking good for our
149:17 environment variables. So controll x y enter to save. You can do a cat.env just
149:22 for that sanity check to make sure that everything's saved. We are looking good.
149:27 All right. And so now within the readme here, so I'll go back to the
149:30 instructions. We did all this already. We changed our directory. We set up our
149:33 environment variables and configured them. Now we need to run this stuff in
149:38 our SQL editor in Superbase. This is how we can get our table set up because we
149:42 haven't run things with N8N first. So we don't have this table created already.
149:47 And so now I just have to sign into Superbase here. So I've got my username
149:51 which is Superbase. And then I'm just copy and pasting the username and
149:55 password that I have um that I have set here in my environment variables. And so
150:00 I'll go to my SQL editor and go back here. I know I'm moving kind of quick
150:03 here, but I got these instructions laid out in the readme. I'm going to paste
150:08 this like that and go ahead and click on run. And then boom, there we go. So now
150:12 if I go into tables and search for NN, we have NN chat histories, a new
150:16 currently empty table. All right, looking good. And then going back after
150:21 we do that, now we can run the agent. So I just have to take this command right
150:24 here and then I'll go back to my droplet. And the one thing that I
150:28 mentioned earlier, but I want to cover again. If I go into the uh hold on, I
150:33 need to change my directory back. So, automator agents and then python local
150:38 AI. If I go into my docker compose, you have to make sure that the include path
150:42 is correct. And so, I'm going to update this by the time you get your hands on
150:44 it here where it's just going to be going two levels back. That's what we
150:47 need to do. So, make sure that we reference the right path to the local AI
150:52 package on our machine and then crl + x y enter to save. That's because we have
150:57 to go back from Python local AI agent, then back from the automator agents
151:00 directory, and then within that same directory, we have the local AI package.
151:04 So, we're good to go. Now, I can go ahead and paste in the command here to
151:08 build our agent and include it in the local AI stack. And so, it's going to
151:11 have to build everything. Takes a minute like we saw already. So, I'll pause and
151:15 come back once this is done. And all right, there we go. About 30 seconds
151:18 later and we are good to go. So, now I can do the docker ps- a again. And this
151:23 time if I look through this list very carefully, take a little bit here, I
151:28 should be able to see my Python agent. There we go. Local AI, Python, local AI
151:32 agent. And it starts with local AI because it is a part of that docker
151:36 compose stack. And by the way, I can do docker exec-it. And then I can do python-lo
151:46 agent bin bash. I can I can run this as well. And then what I can do is if I do
151:51 a print env command, I can see all the environment variables that are set
151:54 within this container. That's everything that we set up in the env. So I'm being
151:59 very comprehensive with this master class, showing you how you can tinker
152:02 around with different things like accessing your containers and seeing the
152:05 environment variables, making sure that everything that we specified in thev is
152:09 actually taking effect here. And sure enough, it is. So we are looking good.
152:13 So I'll go ahead and exit. We're back in our root machine now. We have our
152:18 container up and running and also it's running on port 8055. And so now we can
152:23 go back to open web UI within uh open webyt.dynamus.ai and we can set up our pipe. And so I'm
152:30 going to go to the admin panel functions. We don't have a function
152:33 here. So I have to import it. And so what I can do I'll actually do this
152:36 here. I'll just Google you can literally Google np pipe open web UI. And it'll
152:40 bring you to the one that I have here. You just have to sign into open web UI.
152:44 I'll click on get. And then this time for my URL instead of being something on
152:49 local host, I'm going to copy my actual subdomain here. So import to open web UI
152:53 and then boom, we have our pipe. So I'll click on save, confirm, and then within
152:59 the valves here, I can set all of my values. So I'll just click on default
153:02 for all of these so I can get a starting point here. And then yeah chat input is
153:07 good output is good the bearer token is test off and then for my URL it's going
153:14 to be http colon and then it's going to be uh the name of my service python local aai
153:24 agent port 8055. Let me get that right. 8055 and then slashinvoke- python- agent. I believe I
153:32 have this memorized. I think we are good there. So I'm going back if I clear this
153:37 and run a docker ps- a it is indeed called um python local AI agent that is
153:42 the name of our service so open web UI is able to connect to the agent directly
153:46 with this name because we are deploying it in the same docker network and so I
153:51 think we are looking good all right so I'm going to go ahead and click on save
153:56 all right and then go back and start a new chat and then also like I said a lot
154:00 of times it helps just to refresh open UI completely open web UI completely
154:03 completely. All right, there we go. And then now instead of Oh, I have to
154:07 actually enable. Let me go back to the admin panel. Functions, you have to make
154:11 sure this is ticked on. Um, so that we have the pipe enabled. Now going back
154:15 here and I'll refresh as well. We've got our pipe selected. And now I can say
154:20 hello. And there we go. Super super fast. We got a response from our Python
154:26 agent. Take a look at that. And then also going into my database here, you
154:29 can see that I have all these messages in the NAN chat histories table. We'll
154:33 take a look at that. All right. And then we can also ask it to do web search. I
154:38 can say like what is the latest LLM from um Anthropic for example. So it has to
154:44 do a quick Seir XNG search leveraging that. Uh latest is Claude Opus 4. All
154:49 right. And man that was so fast. We have no network delays now because everything
154:53 is running on the same network and we have an absolute killer GPU. So this is
154:56 so cool. Also, one thing that I want to mention is sometimes depending on your
155:02 cloud provider, CRXNG will not start successfully. There's one thing you have
155:05 to do. It's just a really small tidbit. If you run into this issue where the
155:09 CRXNG container is constantly restarting, what you want to do is go to
155:14 your local AI package and then run the command chmod 755 SER XNG. That's the
155:22 Seir XNG folder. And so the CRXNG folder is responsible for storing the
155:26 configuration that we have for CRXNG by default. Sometimes you don't have
155:29 permissions to write this file and it needs to do so. So I'm going to update
155:33 the troubleshooting to include this. But yeah, just a small tidbit. And then you
155:37 can just go ahead and run the command to start everything again. Um, obviously
155:41 you have to go back one directory then you can run this um and restart
155:45 everything. That easy to restart things to make changes take effect for your
155:49 package and then you'll be good to go. So yeah, we have everything working
155:54 here. So this is pretty much it for the master class. Now we have our local AI
155:58 package up and running with an agent and the network as well. We're communicating
156:02 to it within Open Web UI directly. There is so much that we have gotten through
156:07 now. So congratulations for making it this far. All right, I'm going to be totally
156:12 honest. This was very hard to make this master class, but it was so worth it.
156:16 And I hope that you got a lot out of this. We really covered it all. All the
156:21 way from starting with what is local AI and why we should care about it to
156:25 deploying it on our machine, building agents, deploying it to the cloud, and
156:28 configuring everything with DNS. Like man, we basically did everything you
156:32 could possibly need to get the foundation laid out to build anything
156:36 that you want with local AI and local AI agents. And so the very last thing that
156:40 I want to cover here is just a couple of additional resources that I have for you
156:45 now that you know how local AI works and how to get it set up. You want to dive
156:48 into building more complex agents with it now. And so there's a few things that
156:52 I want to call out for you. So starting with my YouTube channel, I have a lot of
156:56 videos on my channel diving more specifically into building more complex
157:00 AI agents with local AI. And the main resource that I want to point you to
157:03 right now if you really want to go deeper into building agents with local
157:08 AI is the ultimate N8N rag AI agent template local AI edition. And so this
157:13 is using the local AI package and I dive really deep into rag and local AI which
157:17 was outside of the scope of this master class because that's more about building
157:21 agents versus setting up local AI. But this is a great video to dive into. Um,
157:25 and then also I got to call out the Dynamus community again because man, I
157:29 put so much effort into building local AI into a core part of this course here.
157:34 And so, like I said at the start of this master class, when I build the full
157:38 agent out throughout the AI agent mastery course, local AI is an option
157:42 the entire time and I show exactly how to set up everything for local AI using
157:47 the local AI package. Like, I really have this ingrained into everything in
157:51 Dynamus and in my YouTube channel. This local AI package is the core of
157:56 everything that I do with local AI. So great resources for you. With that, that
158:01 is everything that I have for this master class. So I know this is my third
158:05 time saying it, but congratulations if you made it this far. You now have what
158:08 it takes to really build anything that you want with local AI and you can use
158:12 these additional resources to go much further as well. So I hope to see you in
158:15 the Dynamist community. Let me know in the comments if you have any questions
158:19 on anything that I dove into here because I know that it is a lot a lot to
158:23 digest, but I'm trying my best to make it as digestible as I possibly can. So,
158:27 with that, if you appreciated this master class and you're looking forward
158:32 to more things local AI or AI agents, I'd really appreciate a like and a
158:35 subscribe. And with that, I will see you
$

The Ultimate Guide to Local AI and AI Agents (The Future is Here)

@ColeMedin 2:38:37 29 chapters
[AI agents and automation][developer tools and coding][productivity and workflows][security and privacy][hardware setup and infrastructure]
// chapters
// description

The future of AI and AI agents is running everything locally - your LLMs, databases, agent tooling, knowledge bases, and automation platforms. No cloud dependencies, no data leaving your machine, and full control over your AI infrastructure. There are pros and cons to local vs. cloud AI, but the advantages of cloud AI are diminishing every day and we're heading toward a future where running your own LLMs is a must - so master this now. In this masterclass, I'll walk you through everything you n

now: 0:00
// tags
[AI agents and automation][developer tools and coding][productivity and workflows][security and privacy][hardware setup and infrastructure]