// transcript — 7060 segments
0:00 Welcome to the Local AI Masterclass
0:02 Welcome to the video that I have put the most effort into creating by far on my
0:08 channel to date. This is the local AI master class and we're going to dive
0:12 into everything you need to know about local AI. What it is, why it's so
0:16 important for you no matter what you are building with AI. How you can run your
0:20 own large language models and self-host your own infrastructure. How you can
0:26 build 100% private and offline AI agents and deploy them to the cloud in a secure
0:31 way. I have everything that you need to get started here even if you haven't
0:35 done anything with local AI before and I take things pretty far. There is a lot
0:38 of value packed into this for you. So, buckle up, enjoy the ride and follow
0:41 Agenda for the Local AI Masterclass
0:43 along as well. So, first things first, let's start with an agenda for the
0:47 master class. There are so many things that I cannot wait to share with you.
0:51 And I have very detailed chapters for this YouTube video so you can easily
0:55 navigate between everything that I'm going to show you. I just want to make
0:59 it super easy for you to get exactly what you want out of this master class.
1:03 Nothing more, nothing less. We'll start by diving into what is local AI and I
1:07 have a quick demo to make this very, very hands-on. And then with that, we'll
1:12 get into the why. Why local AI? Why should you care about it? Why do I
1:16 believe so firmly that it is the future of AI? I'll dive into all my reasoning
1:20 there. And then we'll get into hardware requirements because these local LLMs
1:24 are beasts and you have to have specific hardware to be able to run them. So I'll
1:28 dive into all that based on different large language models and some
1:32 alternatives as well. Then we'll get into all of the tricky stuff. There are
1:35 a few things that are usually pretty daunting for people. So I want to break
1:39 down those barriers just to make you super confident running your own local
1:43 LLMs and infrastructure. And then with that, we'll get into how you can use
1:48 local AI anywhere. Because Olama and other solutions for running your own
1:52 large language models, they are OpenAI API compatible. I'll get into what that
1:56 means when we get to this point. But basically, any agents that you already
2:00 have running with Python or N8N, whatever. If you're using OpenAI or
2:04 Gemini or Enthropic, you can very easily swap them to use local AI instead. So
2:09 you can turn your existing agents into ones that are 100% offline, free, and
2:15 private. And then with that, we will get into the local AI package. This is a set
2:20 of services that I've curated for you to run your entire local AI infrastructure
2:25 like your UI, your database, your large language models, and a lot more. This is
2:28 where we really start to build out our full infrastructure. I'll walk you
2:32 through setting up the local AI package, getting into the nitty-gritty details to
2:35 make sure that you have everything set up at this point. And then once we have
2:40 that set up, we can dive into building a fully local AI agent with N8N. And then
2:44 we'll transition that same agent into Python as well. So that you'll see once
2:49 we have the local AI package set up, how you can build a 100% offline and private
2:54 agent both with no code and with code. And then we'll take those agents and
2:58 deploy them to the cloud, specifically on the Digital Ocean platform. But I'll
3:02 walk you through a process that you can use no matter the cloud provider that
3:05 you are using. and we'll deploy things in a very secure way both for the
3:09 package for our infrastructure and the AI agent itself. And then last, I want
3:14 to end with some additional resources just to make sure you have everything
3:17 that you need to take this master class forward and really use this to build any
3:21 AI agent that you could possibly want 100% local. And also, if you are
3:26 interested in mastering more than just local AI, but building entire AI agents
3:31 in a local AI environment or even with cloud AI, definitely check out
3:35 dynamis.ai. AI. This is my community for early AI adopters just like yourself.
3:40 And a big part of this community is the AI agent mastery course where I dive
3:45 super deep into my full process for building AI agents. I'm talking planning
3:50 and prototyping and coding and using AI coding assistance and building full
3:55 frontends for our AI agents and securing things and deploying things. There's a
3:59 lot more coming soon for this course as well. I'm very actively working on it.
4:03 And a big part of this course is the complete agent that I build throughout
4:07 it. I build both with cloud AI and local AI. And so this master class will help
4:11 you get very very comfortable with local AI. But when it comes to building
4:15 complex agents and really getting deep into building out agents, then you
4:19 definitely want to check out the AI agent mastery course here in Dynamis.AI.
4:23 So with that, let's get back to the master class, diving into what local AI
4:28 is all about. Let's start by laying the foundation. What is local AI in the
4:32 first place? Well, very simply put, local AI is running your own large
4:36 language models and infrastructure like your database and your UI entirely on
4:42 your own machine 100% offline. So when you think about when you typically want
4:46 to build an AI agent, you need a large language model maybe like GPT4.1
4:51 or Claude 4 and then you need something like your database like Superbase and
4:55 you need a way to create a user interface. you have all these different
4:58 components for your agent and typically you're using APIs to access things that
5:03 are hosted on your behalf. But with local AI, we can take all of these
5:07 things completely in our own control running them ourselves. So this is
5:11 possible through open-source large language models and software. So
5:15 everything is running on your own hardware instead of you paying for APIs.
5:20 So we run the large language model ourself on our own machine instead of
5:25 paying for the OpenAI API for example. And so for large language models, there
5:29 are thousands of different open source large language models available for us
5:32 to use in a lot of different ways. And some of these you've probably heard of
5:37 before, like Deepseek R1, Quen 3, Mistral 3.1, Llama 4. These are just a
5:41 couple of examples of the most popular ones that you've probably heard of
5:44 before. We'll be tinkering around with using some of these in this master
5:48 class. And then we also have open-source software. So all of our infrastructure
5:53 that goes along with our agents and LLMs, things like Olama for running our
5:58 LLMs, Superbase for our database, N8N for our no/ lowode workflow automations,
6:04 and open web UI to have a nice user interface to talk to our agents and
6:07 LLMs. And we'll dive into using all of these as well. Now, because local AI
6:12 means running large language models on our own computer, it's not as easy as
6:18 just going to claw.ai AI or chatgbt.com and typing in a prompt. We have to
6:22 actually install something, but it still is very easy to get started. So, let me
6:25 show you right now with a hands-on example. So, here we are within the
6:30 website for Olama. This is just.com. I'll have a link to this in the
6:33 description of the video. This is one of the open- source platforms that allows
6:39 us to very easily download and run local large language models. And so, you just
6:42 have to go to their homepage here and click on this nice big download button.
6:45 You can install it for Windows, Mac or Linux. It really works for any operating
6:49 system. Then once you have it up and running on your machine, you can open up
6:53 any terminal. Like I'm on Windows here, so I'm in a PowerShell session and I can
6:58 run Olama commands now to do things like view the models that I have available on
7:03 my machine. I can download models and I can run them as well. And the way that I
7:08 know how to pull and run specific models is I can just go to this models tab in
7:12 their navigation and I can browse and filter through all of the open source
7:16 LLMs that are available to me like DeepSeek R1. Almost everyone is familiar
7:20 with DeepSeek. It just totally blew up back in February and March. We have
7:25 Gemma 3, Quen 3, Llama 4, a few of them that I mentioned earlier when we had the
7:30 presentation up. And so we can click into any one of these like I can go into
7:35 DeepSeek R1 for example and then I have the command right here that I can copy
7:39 to download and run this specific model in my terminal. And there are a lot of
7:44 different model variants of DeepSeek R1. So we'll get into different sizes and
7:47 hardware requirements and what that all means in a little bit, but I'll just
7:50 take one of them and run it as an example. So I'll just do a really small
7:54 one right now. I'll do a 1.5 billion parameter large language model. And
7:58 again, I'll explain what that means in a little bit. I can copy this command.
8:01 It's just lama run and then the unique ID of this large language model. So I'll
8:06 go back into my terminal. I'll clear it here and then paste in this command. And
8:09 so first it's going to have to pull this large language model. And the total size
8:15 for this is 1.1 GB. And so it'll have to download it. And then because I used the
8:20 run command, it will immediately get me into a chat interface with the model
8:24 once it's downloaded. Also, if you don't want to run it right right away, you
8:27 just want to install it, you can do Olama pull instead of Olama run. And
8:32 then again, to view the models that you have available to you installed already,
8:36 you can just do the Olama list command like I did earlier. And so, right now,
8:39 I'll pause and come back once it's installed in about 30 seconds. All
8:43 right, it is now installed. And now I can just send in a message like hello.
8:47 And then boom, we are now talking to a large language model. But instead of it
8:50 being hosted somewhere else and we're just using a website, this is running on
8:54 my own infrastructure, the large language model and all the billions of
8:59 parameters are getting loaded onto my graphics card and running the inference.
9:02 That's what it's called when we're generating a response from the LLM
9:06 directly within this terminal here. And so I can ask another question like um
9:13 what is the best GPU right now? We'll see what it says. So it's thinking
9:16 first. This is actually a thinking model. Deepseek R1 is a reasoning LLM.
9:20 And then it gives us an answer. It's top GPU models today. 3080 RX6700.
9:27 Obviously, we have a training cutoff for local large language models just like we
9:30 do with ones in the cloud like GPT. And so the information is a little outdated
9:34 here, but yeah, this is a good answer. So we have a large language model that
9:38 we're talking to directly on our machine. And then to close out of this,
9:42 I can just do control D or command D on Mac. And if I do list, we have all the
9:47 other models that you saw earlier, plus now this one that I just installed. So
9:50 these are all available for me to run again just with that Olama run command.
9:54 And it won't have to reinstall if you already have it installed. Run just
9:58 installs it if you don't have it yet already. So that is just a quick demo of
10:03 using Olama. We'll dive a lot more into Olama later, like how we can actually
10:07 use it within our Python code and within our N8N workflows. This is just our
10:11 Why Local AI? (Local AI vs. Cloud AI)
10:11 quick way to try it out within the terminal. Now, to really get into why we
10:15 should care about local AI now that we know what it is, I want to cover the
10:19 pros and cons of local AI and what I like to call cloud AI. That's just when
10:23 you're paying for things to be hosted for you, like using Claude or Gemini or
10:28 using the cloud version of N8N instead of hosting it yourself. And I also want
10:32 to cover the advantages of each because I don't want to sugarcoat things and
10:35 just hype up this master class by telling you that you should always use
10:39 local AI. That is certainly not the case. There is a time and place for both
10:43 of these categories here, but there are so many use cases where local AI is
10:50 absolutely crucial. You have no idea how many businesses I have talked to that
10:54 are willing to put tens of thousands of dollars into running their own LLMs and
10:58 infrastructure because privacy and security is so crucial for the things
11:02 that they're building with AI. And that actually gets into the first advantage
11:06 here of local AI, which is privacy and security. You can run things 100%
11:11 offline. The data that you're giving to your LLMs as prompts, it now doesn't
11:16 leave your hardware. It stays entirely within your own control. And for a lot
11:20 of businesses, that is 100% crucial, especially when they're in highly
11:24 regulated industries like the health industry, finance, uh even real estate.
11:28 Like there's so many use cases where you're working with intellectual
11:32 property or just really sensitive information. You don't want to be
11:36 sending your data off to an LLM provider like Google or OpenAI or Enthropic. And
11:41 so as a business owner, you should definitely be paying attention to this
11:45 if you are working with automation use cases where you're dealing with any kind
11:48 of sensitive data. And then also if you're a freelancer, you're starting an
11:52 AI automation agency, anything where you're building for other businesses,
11:55 you are going to have so many opportunities open up to you when you're
11:59 able to work with local AI because you can handle those use cases now where
12:03 they need to work with sensitive data and you can't just go and use the OpenAI
12:08 API. And that is the main advantage of local AI. It is a very big deal. But
12:11 there are a few other things that are worth focusing on as well. Starting with
12:16 model fine-tuning, you can take any open- source large language model and
12:20 add additional training on top with your own data. Basically making it a domain
12:24 expert on your business or the problem that you are solving. It's so so
12:29 powerful. You can make models through fine-tuning more powerful than the best
12:33 of the best in the cloud depending on what you are able to fine-tune with
12:38 depending on the data that you have. And you can do fine-tuning with some cloud
12:42 models like with GPT, but your options are pretty limited and it can be quite
12:46 expensive. And so it definitely is a huge advantage to local AI. And local AI
12:52 in general can be very coste effective, including the infrastructure as well. So
12:56 your LLMs and your infrastructure. You run it all yourself and you pay for
13:00 nothing besides the electricity bill if it's running on your computer at your
13:04 house or if you have some private server in the cloud. You just have to pay for
13:07 that server and that's it. There's no N8N bill, no Superbase bill, no OpenAI
13:13 bill. You can save a lot of money. It's really, really nice. And on top of that,
13:17 when everything is running on your own infrastructure, the agents that you
13:22 create can run on the same server, the same place as your infrastructure. And
13:26 so it can actually be faster because you don't have network delays calling APIs
13:30 for all your different services for your LLMs and your database and things like
13:34 that. And then with that, we can now get into the advantages of cloud AI.
13:38 Starting with it's a lot easier to set up. There's a reason why I have to have
13:42 this master class for you in the first place. There are some initial hurdles
13:47 that we have to jump over to really have everything fully set up for our local
13:51 LLMs and infrastructure. And you just don't have that with cloud AI because
13:54 you can very simply call into these APIs. You just have to sign up and get
13:58 an API key and that's about it. So, it certainly is easier to get up and
14:02 running and there's less maintenance overall because they are hosting things
14:05 for you. Superbase is hosting the database for you. OpenAI is hosting the
14:09 LLM for you. So, you don't have to manage things on your own hardware. With
14:13 Local AI, you have to apply patches and updates if you have a private server in
14:17 the cloud. You have to manage your own hardware if you're running on your own
14:20 computer, making sure that it's on 24/7, if you want your database on 24/7, that
14:24 kind of thing. It's just less maintenance with cloud AI. And then
14:28 probably the biggest advantage of cloud AI overall is that you have better
14:34 models available to you. Claude 4 sonnet or opus for example is more powerful
14:40 than any local AI that you could run. So we have this gap here and this gap was a
14:45 lot bigger at one point even a year ago. The best local LLMs absolutely crushed
14:50 the best local LLMs and that gap is starting to diminish. And so I really
14:55 see a future where that gap is diminished entirely and all the best
14:59 local LLMs are actually on par with the best cloud ones. That's the future I
15:03 see. That's why I think that cloud that's why I think that local AI is such
15:07 a big deal because the advantages of local AI, those are just going to get
15:12 more prevalent over time when businesses realize they really want private and
15:15 secure solutions. And then the advantages of cloud AI, I think those
15:19 are actually going to diminish over time. That's the key. minimal setup,
15:24 less maintenance. Well, those advantages are going to go away as we have
15:28 platforms and better instructions and solutions to make the setup and
15:32 maintenance easier for local AI and we have the gap that's continuing to
15:35 diminish between the power of these LLMs. All these advantages are going to
15:39 actually go away and then it'll just completely make sense to use local AI
15:43 honestly probably for like every single solution in the future. That's really
15:47 what I see us heading towards. And then the last advantage to cloud AI which
15:51 also I think will go away over time is that you have some features out of the
15:55 box like you have memory that's built directly into chat GPT. Gemini has web
15:59 search baked in even when you use it through the API like these kind of
16:02 capabilities that are out of the box that you have to implement yourself with
16:07 local AI maybe as tools for your agent and you can definitely do that but it is
16:10 nice that these things are out of the box for cloud AI. So those are the pros
16:15 and cons between the two. I hope that this makes it very clear for you to pick
16:19 right now for your own use case. Should I implement local AI or cloud AI? A lot
16:24 of it comes down to the security and privacy requirements for your use case.
16:26 Hardware Requirements for Local LLMs
16:28 Now, the next big thing that we need to talk about for local AI is hardware
16:33 requirements. Cuz here's the thing, large language models are very resource
16:39 inensive. You can't just run any LLM on any computer. And the reason for that is
16:43 large language models are made up of billions or even trillions of numbers
16:47 called parameters. And they're all connected together in a web that looks
16:51 kind of like this. This is a very simplified view with just a few
16:54 parameters here. But each of the parameters are nodes and they're
16:58 connected together. The input layer is where our prompt comes in and our prompt
17:02 is fed through all these hidden layers and then we have the output at the end.
17:05 This is the response we get back from the LLM. But like I said, this is a very
17:10 simplified view. GPT4, for example, like you can see on the right hand side, is
17:15 estimated to have 1.4 trillion parameters. And so, if you want to fit
17:20 an entire large language model into your graphics card, you have to store all of
17:25 these numbers. And even though we can handle gigabytes at a time in our
17:28 graphics cards through what is called VRAMm, storing billions or trillions of
17:34 numbers is absolutely insane. And so that's why large language models, you
17:37 actually have to have a pretty good graphics card if you want to run some of
17:42 the best ones. And so looking at Olama here, when we see these different sizes,
17:47 going back to their model list, like 1.5 billion parameters or 27 billion
17:51 parameters, there are different sizes for the local LLMs. Obviously, the
17:56 larger a local LM that you are running, the more performance you are going to
17:59 get, but you are going to be limited to what you are capable of running with
18:04 your graphics card or your hardware. So, with that in mind, I now want to dive
18:07 into the nitty-gritty details with you so you know exactly the kind of models
18:11 that you can run, the kind of speeds you can expect depending on your hardware.
18:15 And if you want to invest in new hardware to run local AI, I've got some
18:19 recommendations as well. So there are generally four primary size ranges for
18:25 large language models based on the speed and the power that you are looking for.
18:29 You have models that are around seven or 8 billion parameters. Those are
18:33 generally the smallest that I'd recommend trying to run. There are a lot
18:37 of smaller LLMs available like 1 billion parameters or three billion parameters,
18:41 but I'm so unimpressed when I use those LLMs that I don't really want to focus
18:45 on them here. 7 billion parameters is still tiny compared to the large cloud
18:51 AI models like Claude or GBT, but you can get pretty good results with them
18:55 for just simple chat use cases. And so for these models, assuming a Q4
18:59 quantization, which I'll get into quantization in a little bit, it's
19:02 basically just a way to make the LLM a lot smaller without hurting performance
19:06 that much, a 7 billion parameter model will need about four to 5 GB of VRAM on
19:11 your graphics card. And so if you have something like a 3060 Ti from Nvidia
19:16 with 8 GB of VRAM, you can very comfortably run a 7 billion parameter
19:20 model and you can expect to get very roughly around 25 to 35 tokens per
19:27 second. A token is roughly equivalent to a word. And so your local large language
19:31 model at 7 billion parameters with this graphics card will get about 25 to 35
19:37 words per second out on the screen to you being streamed out. And then if you
19:42 use much more powerful hardware like a 3090 to run a 7 billion parameter model
19:47 then you'll just jack up the speed a lot more. So that's 7 billion or 8 billion
19:51 parameters. Another very common size is something around 14 billion parameters.
19:57 This will take about 8 to 10 GB of VRAM. And so just a couple of options for
20:01 this. You have the 4070Ti which is usually 16 GB of VRAM or you could go as
20:08 low as 12 GB of VRAM with the 3080 Ti. And you could expect to get about 15 to
20:13 25 words per second. And then this is where you start to get into basic tool
20:17 calling. So I find that when you are building with a 7 billion parameter
20:21 model, they don't do tool calling very well. So you can't really build that
20:25 powerful of agents around a 7 billion parameter model. But once you get to
20:29 something around 14 billion parameters, that's when I see agents being able to
20:33 really accept instructions well around tools and system prompts and leveraging
20:37 tools to do things on our behalf. That's when we can really start to use LLMs to
20:42 make things that are agentic. And then the next big category of LLMs
20:47 is somewhere between 30 and 34 billion parameters. You see a lot of LM that
20:51 fall in that size range. This will typically need 16 to 20 gigabyt of VRAM.
20:57 And so a 3090 is a really good example of a graphics card that can run this. It
21:03 has 24 GB of VRAM. I actually have two 3090s myself. And I'll have a link to my
21:08 exact PC that I built for running local AI in the description of this video. So
21:12 I have two 3090s, which we'll need in a second for a 70 billion parameter, but
21:16 one is enough for a 32 billion parameter model. And then also Macs with their new
21:22 M4 chips are very powerful with their unified memory architecture. So if you
21:27 get a Mac M4 Pro with 24 GB of unified memory, you can also run 32 billion
21:32 parameter models. Now the speed isn't going to be the best necessarily, and
21:36 again, this does depend a lot on your computer overall, but you can expect
21:40 something around 10 to 15, maybe up to 20 tokens per second. and 32 billion
21:46 parameters is when you really start to see LLMs that are actually pretty
21:50 impressive. Like 7 billion and 14 billion, they are disappointing quite a
21:54 bit. I'll be totally honest. Especially when you try to use them with more
21:58 complicated agentic tasks. 32 billion when you start to get into this range is
22:02 when I I'm actually genuinely impressed. I'm like, "Oh, this is actually pretty
22:05 close to the performance of some of the best cloud AI." And then 70 billion
22:10 parameters. This is going to take about 35 to 40 GB of VRAM for most consumer
22:19 GPUs like 3090s and 4090s even 5090s. It's not actually enough VRAM. And so
22:22 this is when you have to start to split a large language model across multiple
22:28 GPUs which solutions like Olama will actually help you do this right out of
22:31 the box. So it's not this insane setup even though it might feel kind of
22:35 daunting like oh I have to split the layers of my LLM between GPUs. It's not
22:38 actually that complicated. And so 23090s, 24090s, that will be necessary.
22:44 Um, or you could have more of like an enterprisegrade GPU like an H100. So
22:49 Nvidia has a lot of these non-consumer grade GPUs that have a lot more VRAMm to
22:53 handle things like 70 billion parameter models. And the speed won't be the best
22:58 if you're using something like 23090s, especially because performance is hurt
23:01 when you have to split an LM between GPUs. You could expect something like 8
23:06 to 12 tokens per second. And this is obviously if you have the most complex
23:09 agents that you're really trying to match the performance of cloud AI as
23:13 much as possible, that's when you'd want to use a 7 billion parameter model. And
23:16 then if you're investing in hardware to run local AI, I have a couple of quick
23:20 recommendations here. And a lot of this depends on the size of the model that's
23:24 going to be good enough for your use case. And so I'll dive into some
23:28 alternatives for running local AI directly if you want to do testing
23:32 before you buy infrastructure. I'll get into that in a little bit, but
23:35 recommended builds. If you want to spend around $800 to build a PC, I'd recommend
23:41 getting a 4060Ti graphics card and then 32 GB of RAM. If you want to spend
23:47 $2,000, I'd recommend either getting a PC with a 3090 and 64 GB of RAM or
23:53 getting that Mac M4 Pro with 24 GB of unified memory. And then lastly, if you
23:58 want to spend $4,000, which is about what I spent for my PC, then I'd
24:02 recommend getting two 3090 graphics cards, and I got both of mine used for
24:07 around $700 each. Um, and then also getting 128 GB of RAM, or you can get a
24:15 Mac M4 Max with 64 GB of unified memory. So, I wanted to really get into the
24:19 nitty-gritty details there. So, I know I spent a good amount of time diving into
24:22 super specific numbers, but I hope this is really helpful for you. No matter the
24:26 large language model or your hardware, you now know generally where you're at
24:30 for what you can run. So, to go along with that information overload, I want
24:34 to give you some specifics, individual LLMs that you can try right now based on
24:38 the size range that you know will work for your hardware. So, just a couple of
24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This
24:47 is the most popular local LLM ever. It completely blew up a few months ago. And
24:52 the best part about DeepSeek R1 is they have an option that fits into each of
24:56 the size ranges that I just covered in that chart. So they have a 7 billion
25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And
25:05 then there is also the full real version of R1, which is 671 billion parameters.
25:10 I'm sorry though, you probably don't have the hardware to run that unless
25:13 you're spending tens of thousands on your infrastructure. So, probably stick
25:16 with one of these based on your graphics card or if you have a Mac computer, pick
25:19 the one that'll work for you and just try it out. You can click on any one of
25:23 these sizes here. And then here's your command to download and run it. And this
25:28 is defaulting to a Q4 quantization, which is what I was assuming in the
25:31 chart earlier. And again, I will cover what that actually means in a little bit
25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.
25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do
25:45 have all the other um sizes that fit into those ranges that I mentioned
25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And
25:52 the same kind of deal where you click on the size that you want and you've got
25:55 your command to install it here. And this is a reasoning LLM just like
26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.
26:05 I've had really good results with this as well. There are less options here,
26:08 but you've got 22 or 24 billion parameters, which is going to work well
26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified
26:18 memory. Really, really good model. And then also, there is a version of it that
26:22 is fine-tuned for coding specifically called Devstrol, which is a another
26:26 really cool LLM worth checking out as well if you have the hardware to run it.
26:30 So, that is everything for just general recommendations for local LMS to try
26:34 right now. This is the part of the master class that is going to become
26:38 outdated the fastest because there are new local LMS coming out every single
26:42 month. I don't really know how long my recommendations will last for. But in
26:45 general, you can just go to the model list in Olama, search for the ones,
26:49 finds one that has the size that works with your graphics card and just give it
26:52 a shot. You can install it and run it very easily with Olama. And the other
26:57 thing that I want to mention here is you don't always have to run open- source
27:01 large language models yourself. You can use a platform like Open Router. You can
27:05 just go to open router.ai, sign up, add in some API credits. You can try these
27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for
27:15 your agents before you invest in hardware to actually run them yourself.
27:18 And so within Open Router, I can just search for Quen here. And I can go down
27:23 to Quen and I can go to 32 billion. They have a free offering as well that
27:26 doesn't have the best rate limits. So I'll just go to this one right here,
27:31 Quen 3 32B. So I can try the model out through open router. They actually host
27:35 it for me. So it's an open- source non-local version, but now I can try it
27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now
27:43 I want to buy a 3090 graphics card so that I can install it directly through
27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here
27:51 in open router. And there are other platforms like Grock as well where you
27:55 can run these open source large language models um not on your own infrastructure
27:58 if you just want to do some testing before beforehand or whatever that might
28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's
28:04 everything for my general recommendations for LLMs to try and use
28:09 in your agents. All right, it is time to take a quick breather. This is
28:12 everything that we've covered already in our master class. What is local AI? Why
28:17 we care about it? Why it's the future and hardware requirements. And I really
28:20 wanted to dive deep into this stuff because it sets the stage for everything
28:24 that we do when we actually build agents and deploy our infrastructure. And so
28:28 the last thing that I want to do with you before we really start to get into
28:33 building agents and setting up our package is I want to talk about some of
28:37 the tricky stuff that is usually pretty daunting for anyone getting into local
28:42 AI. I'm talking things like offloading models, quantization, environment
28:47 variables to handle things like uh flash attention, all the stuff that is really
28:51 important that I want to break down simply for you so you can feel confident
28:55 that you have everything set up right, that you know what goes into using local
29:00 LLMs. The first big concept to focus on here is quantization. And this is
29:04 crucial. It's how we can make large language models a lot smaller so they
29:10 can fit on our GPUs without hurting performance too much. We are lowering
29:14 the model precision here. And so what basically what that means is we have
29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits
29:23 with the full size, but we can lower the precision of each of those parameters to
29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of
29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the
29:35 parameters that we already covered. And we can make these numbers less precise
29:40 or smaller without losing much performance. So, we can fit larger LLMs
29:45 within a GPU that normally wouldn't even be close to running the full-size model.
29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4
29:55 quantization like four bit per parameter in that diagram earlier. If you had the
30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could
30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it
30:10 possible. It's like rounding a number that has a long decimal to something
30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're
30:19 doing it for each of the billions of parameters, those numbers that we have.
30:23 And so just to give you a visual representation of this, you can also
30:27 quantize images just like you can quantize LLMs. And so we have our full
30:31 scale image on the lefth hand side here comparing it to different levels of
30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with
30:40 a 16- bit quantization, it almost looks the same. But then once we go down to
30:44 4bit, you can very much see that we have a huge loss in quality for the image.
30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit
30:54 quantization, we don't actually lose that much performance like we lose a lot
30:58 of quality with images. And so that's why it's so useful for us. And so I have
31:01 a table just to kind of describe what this looks like. So FP16, that's the
31:07 16bit precision that all LMS have as a base. That is the full size. The speed
31:11 is obviously going to be very slow because the model is a lot bigger, but
31:16 your quality is perfect compared to what it could be. I mean, obviously that
31:18 doesn't mean that you're going to get perfect answers all the time. I'm just
31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8
31:28 precision, so it's half the size. The speed is going to be a lot better. And
31:33 the quality is nearperfect. So it's not like performance is cut in half just
31:37 because size is. You still have the same number of parameters. Each one is just a
31:42 bit less precise. And so you're still going to get almost the same results.
31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very
31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these
31:57 numbers are very vague on purpose. There's not a huge way to for me to like
32:01 qualify exactly the difference, especially because it changes per LLM
32:05 and your hardware and everything like that. So, I'm just being very general
32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to
32:13 be very very fast, but usually your performance starts to go down quite a
32:17 bit once you go down to a Q2. And then like the note that I have in the bottom
32:22 left here, a Q4 quantization is generally the best balance. And so when
32:26 you are thinking to yourself, which large language model should I run? What
32:31 size should I use? My rule of thumb is to pick the largest large language model
32:37 that can work with your hardware with a Q4 quantization. That is why I assumed
32:42 that in the table earlier. And then also like we saw in Olama earlier, it always
32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4
32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion
32:59 parameter model is still going to be a lot more powerful than the full 7
33:02 billion parameter or 14 billion parameter because you don't actually
33:07 lose that much performance. So that is quantization. So just to make this very
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 just fun fact, you can actually do that. Now, the very last thing that I want to
37:15 cover before we dive into some code, setting up the local AI package, and
37:21 building out some agents is a few very crucial parameters, environment
37:25 variables for Olama. So, these are environment variables that you can set
37:28 on your machine just like any other based on your operating system. And
37:32 Olama does have an FAQ for setting up some of these things, which I'll link to
37:36 in the description as well. But yeah, these are a bit more technical, so
37:40 people skip past setting this stuff up a lot, but it's actually really, really
37:44 important to make things very efficient when running local LLMs. So the first
37:49 environment variable is flash attention. You want to set this to one or true.
37:54 When you have this set to true, it's going to make the attention calculation
37:59 a lot more efficient. It sounds fancy, but basically large language models when
38:04 they are generating a response, they have to calculate which parts of your
38:08 prompt to pay the most attention to. That's the calculation. And you can make
38:12 it a lot more efficient without losing much performance at all by setting up
38:16 the flash attention, setting that to true. And then for another optimization,
38:21 just like we can quantize the LLM itself, you can also quantize or
38:27 compress the context. So your system prompt, the tool descriptions, your
38:31 prompt and conversation history, all that context that's being sent to your
38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for
38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the
38:46 context memory. It's a very simplified explanation, but it's really, really
38:50 useful because a long conversation can also take a lot of VRAM just like larger
38:55 LLM. And so it's good to compress that. And then the third environment variable,
38:58 this is actually probably the most crucial one to set up for Olama. There
39:02 is this crazy thing. I don't know why Olama does it, but by default, they
39:07 limit every single large language model to 2,000 tokens for the context limit,
39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and
39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a
39:21 lot of local large language models can also handle large prompts. But Olamo
39:25 will limit you to default to 2,000 tokens. And so you have to override that
39:30 yourself with this environment variable. And so generally I recommend starting
39:34 with about 8,000 tokens to start. You can move this all the way up to
39:38 something like 32,000 tokens if your local large language model supports
39:42 that. And if you view the model page on Alama, you can see the context link
39:46 that's supported by the LLM. But you definitely want to, you know, jack this
39:50 up more from just 2,000 because a lot of times when you have longer
39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do
39:57 not miss this. If your large language model is starting to go completely off
40:02 the rails and ignore your system prompt and forget that it has these tools that
40:06 you gave it, it's probably because you reached the context length. And so, just
40:10 keep that in mind. I see people miss this a lot. And then the very last
40:14 environment variable, uh, probably the least important out of all these four,
40:18 but if you're running a lot of different large language models at once and you're
40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so
40:25 in Olama, you can limit the number of models that are allowed to be in your
40:29 memory at a single time. With this one, typically you want to set this to either
40:33 one or two. Definitely set this to just one if you are using large language
40:37 models that are basically fit for your GPU. like it's going to fit exactly into
40:41 your VRAM and you're not going to have room for another large language model.
40:44 But if you are running more smaller ones and maybe you could actually fit two on
40:48 your GPU with the VRAM that you have, you can set this to two. So again, more
40:52 technical overall, but it's very important to have these right. And we'll
40:55 get into the local AI package where I already have these set up in the
40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a
41:02 minute ago that I'll have linked in the description. And so there's actually a
41:06 lot of good things to read into here. um like being able to verify that your GPU
41:10 is compatible with Olama. How can you tell if the model's actually loaded on
41:13 your GPU? So, a lot of like sanity check things that they walk you through in the
41:17 FAQ as well. Also talking about environment variables, which I just
41:20 covered. And so, they've got some instructions here depending on your OS
41:23 how to get those set up. So, if there's anything that's confusing to you, this
41:26 is a very good resource to start with. So, I'm trying to make it possible for
41:30 you to look into things further if there's anything that doesn't quite make
41:33 sense for what I explained here. And of course, always let me know in the
41:35 comments if you have any questions on this stuff as well, especially the more
41:39 technical stuff that I just got to cover because it's so important even though I
41:43 know we really want to dive into the meat of things, which we are actually
41:47 going to do now. All right, here is everything that we have covered at this
41:50 point. And congratulations if you have made it this far because I covered all
41:55 the tricky stuff with quantization and the hardware requirements and offloading
41:59 and some of our little configuration and parameters. So, if you got all of that,
42:03 the rest of it is going to be a walk in the park as we start to dive into code,
42:07 getting all of our local AI set up and building out some agents. You understand
42:10 the foundation now that we're going to build on top of to make some cool stuff.
42:15 And so, now the next thing that we're going to do is talk about how we can use
42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show
42:23 you an example. We can take something that is using OpenAI right now,
42:27 transform it into something that is using OAMA and local LLM. So, we'll
42:31 actually dive into some code here. And I've got my fair share of no code stuff
42:35 in this master class as well, but I want to focus on both because I think it's
42:38 really important to use both code and no code whenever applicable. And that
42:42 applies to local AI just like building agents in general. So, I've already
42:45 promised a couple of times that I would dive into OpenAI API compatibility, what
42:50 it is, and why it's so important. And we're going to dive into this now so you
42:54 can really start to see how you can take existing agents and transform them into
42:59 being 100% local with local large language models without really having to
43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI
43:10 has created a standard for exposing large language models through an API.
43:14 It's called the chat completions API. It's kind of like how model context
43:19 protocol MCP is a standard for connecting agents to tools. The chat
43:23 completions API is a standard for exposing large language models over an
43:28 API. So you have this common endpoint along with a few other ones that all of
43:35 these providers implement. This is the way to access the large language model
43:39 to get a response based on some conversation history that you pass in.
43:43 So, Olama is implementing this as of February. We have other providers like
43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.
43:53 Almost every single provider is OpenAI API compatible. And so, not only is it
43:57 very easy to swap between large language models within a specific provider, it's
44:02 also very easy to swap between providers entirely. You can go from Gemini to
44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one
44:13 piece of configuration pointing to a different base URL as it is called. So
44:18 you can access that provider and then the actual API endpoint that you hit
44:22 once you are connected to that specific provider is always the exact same and
44:26 the response that you get back is also always the exact same. And so Olama has
44:31 this implemented now. And I'll link to this article in the description as well
44:33 if you want to read through this because they have a really neat Python example.
44:37 It shows where we create an OpenAI client and the only thing we have to do
44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are
44:47 pointing to Olama that is hosted locally instead of pointing to the URL for
44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And
44:55 then with Olama, you don't actually need an API key because everything's running
44:58 locally. So you just need some placeholder value here. But there is no
45:02 authentication that is going on. You can set that up. I'm not going to dive into
45:05 that right now. But by default, because it's all just running locally, you don't
45:09 even need an API key to connect to Olama. And then once we have our OpenAI
45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in
45:18 exactly the same way. But now we can specify a model that we have downloaded
45:22 locally already through Lama. We pass in our conversation history in the same way
45:27 and we access the response like the content the AI produced the token usage
45:31 like all those things that we get back from the response in the same way.
45:34 They've got a JavaScript example as well. They have a couple of examples
45:38 using different frameworks like the Versell AI SDK and Autogen. Really any
45:44 AI agent framework can work with OpenAI API compatibility to make it very easy
45:47 to swap between these different providers. like Pyantic AI, my favorite
45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily
45:57 within your Pantic AI agents swap between these different providers. And
46:02 so what I have for you now is two code bases that I want to cover. The first
46:07 one is the local AI package, which we'll dive into in a little bit. But right
46:13 now, we have all of the agents that we are going to be creating in this master
46:17 class. So I have a couple for N8N that are also available in this repository.
46:21 And then a couple of scripts that I want to share with you as well. And so the
46:25 very first thing that I want to show you is this simple script that I have called
46:31 OpenAI compatible demo. And so you can download this repository. I'll have this
46:34 linked in the description as well. There's instructions for downloading and
46:38 setting up everything in here. And this is all 100% local AI. And so with that,
46:43 I'm going to go over into my windsurf here where I have this OpenAI compatible
46:47 demo set up. So I've got a comment at the top reminding us what the OpenAI API
46:52 compatibility looks like. We set our base URL to point to Olama hosted
46:59 locally and it's hosted on port 11434 by default. So I can actually show you
47:02 this. I have Ola running in a Docker container, which we're going to dive
47:05 into this when we set up the local AI package, but you can see that it is
47:11 being exposed on port 11434. And by the way, you can see the
47:14 127.0.0.1 in that URL that I have highlighted here, that is synonymous with localhost.
47:21 And so this right here, you could also replace with 127.0.0.1.
47:25 Just a little tidbit there. It's not super important. I just typically leave
47:28 it as localhost. And then you can change the port as well. I'm just sticking to
47:32 what the default is. And then again, we don't need to set our API key. We can
47:36 just set it to any value that we want here. We just need some placeholder even
47:39 though there is no authentication with a llama for real unless you configure
47:43 that. So that's OpenAI compatibility. And the important thing with this script
47:47 here is I have two different configurations here. I have one for
47:51 talking to OpenAI and then one for OALMA. So with OpenAI, we set our base
47:57 URL to point to api.openai.com. We have our OpenAI API key set in our
48:01 environment variables. So you can just set all your environment variables here
48:05 and then rename this to env. I've got instructions for that in the readme of
48:08 course. And then going back to the script, we are using GPT4.1 nano for our
48:13 large language model. There's something super fast and cheap. And then for our
48:17 Lama configuration, we are setting the base URL here, localhost1434
48:22 or just whatever we have set in our environment variables. Same thing for
48:26 the API key. And then same thing for our large language model. And what I'm going
48:31 to be using in this case is Quen 314B. That is one of the large language models
48:34 that I showed you within the Olama website. Definitely a smaller one
48:38 compared to what I could run, but I just want to run something fast. And very
48:41 small large language models are great for simple tasks like summarization or
48:45 just basic chat. And that's what I'm going to be using here just for a simple
48:49 demo. And so whether it's enabled or not, this configuration is just based on
48:53 what we have set for our environment variables. And the important thing here
48:59 is the code that runs for each of these configurations just as we go through
49:03 this demo is exactly the same. We are parameterizing the configuration for the
49:08 base URL and API key. So we are setting up the exact same OpenAI client just
49:13 like we saw in the Olama article but just changing the base URL and API key.
49:17 And so then for example when we use it right here it's client.hat.comp
49:22 completions.create create calling the exact same function no matter if we're
49:26 using OpenAI or Olama. And then we're handling the response in the same way as
49:31 well. And so I'll go back to my terminal now. And so I went through all the steps
49:34 already to set up my virtual environment, install all of my dependencies. And so now I can run the
49:40 command OpenAI compatible demo. And now it's going to present the two
49:43 configuration options for me. And so I can run through OpenAI. So we'll go
49:46 ahead and do that first. And these two demos are going to look exactly the
49:50 same, but that is the point. And so we have our base URL here for OpenAI. We
49:54 have a basic example of a completion with GPT4.1 Nano. There we go. So this
49:59 is the model that was used. Here are the number of tokens. And this is our
50:03 response. And then I can press enter to see a streaming response now as well. So
50:07 we saw it type out our answer in real time. And then I can press enter one
50:10 more time. This is the last part of the demo. Just say multi-turn conversation.
50:14 So we got a couple of messages here in our conversation history. So very nice
50:19 and simple. The point here is to now show you that I can run this and select
50:23 Olama now instead and everything is going to look exactly the same and all
50:27 of the code is the same as wallet. It is only our configuration that is
50:31 different. And so it will take a little bit when you first run this because
50:36 Olama has to load the large language model into your GPU. And so going to the
50:42 logs for Olama, I can show you what this looks like here. And so when we first
50:47 make a request when Quen 314B is not loaded into our GPU yet, you're going to
50:52 see a lot of logs come in here and we'll and you'll have this container up and
50:54 running when you have the local AI package which we'll cover in a little
50:57 bit. So it shows all the metadata about our model like it's Quen 314b. Uh we can
51:05 see here that uh we have a Q4 KM quantization like we saw in the Olama
51:09 website. Uh what other information do we have here? There's just so much to to
51:14 digest here. Um, yeah, another really important thing is we have the uh
51:19 context link. I have that set to 8,192 just like I recommended in the
51:22 environment variables. And then we can see that we offloaded all of the layers
51:26 to the GPU. So I don't have to do any offloading to the CPU or the RAM. I can
51:30 keep everything in the GPU, which is certainly ideal, like I said, to make
51:34 sure this is actually fast. And then when we get a response from quen 314b,
51:41 we are calling the v1/hatcompletions endpoint because it is openi API
51:46 compatible. So that exact endpoint that we hit for openai is the one that we are
51:50 hitting here with a large language model that is running entirely on our computer
51:54 in Olama. And so the response I get back, it's actually a reasoning LLM as
51:58 well. So we even have the thinking tokens here, which is super cool. And so
52:02 we got our response. It's just printing out the first part of it here just to
52:04 keep it short. And then I can press enter. And we can see a streaming demo
52:08 as well. And it's going to be a lot faster this time because we do already
52:11 have the model loaded into our GPU. And so that first request when it first has
52:15 to load a model is always the slower one. And then it's faster going forward
52:19 once that model is already loaded in our GPU. And then as long as we don't swap
52:24 to another large language model and use that one, then it will remain in our GPU
52:28 for some time. And so then all of our responses after are faster. And then we
52:33 just have the last part of our demo here with a multi-turn conversation. So we
52:37 can see conversation history in action as well, just not with streaming here.
52:40 Um, and and everything's a bit slower with this large language model because
52:43 it is a reasoning one. And so you can certainly if you want faster uh
52:47 inference, you can always use a non-reasoning local LLM like Mistl or
52:52 Gemma for example. So that is our very simple demo showing how this works. I
52:55 hope that you can see with this and again this works with other AI agent
52:59 frameworks like eggno or pideantic AI or crew AI as well like they all work in
53:03 this way where you can use openAI API compatibility to swap between providers
53:08 so easily so you don't have to recreate things to use local AI and that's
53:11 something so important that I want to communicate with you because if I'm the
53:15 one introducing you to local AI I also want to show you how it can very easily
53:19 fit into your existing systems and automations. All right. Now, we have
53:23 gotten to the part of the local AI master class that I'm actually the most
53:27 excited for because over the past months, I have very much been pouring my
53:31 heart and soul into building up something to make it infinitely easier
53:35 for you to get everything up and running for local AI. And that is the local AI
53:40 package. And so, right now, we're going to walk through installing it step by
53:44 step. I don't want you to miss anything here because it's so important to get
53:47 this up and running, get it all working well. Because if you have the local AI
53:51 package running on your machine and everything is working, you don't need
53:55 anything else to start building AI agents running 100% offline and
53:59 completely private. And so here's the thing. At this point, we've been
54:04 focusing mostly on Olama and running our local large language models. But there's
54:08 the whole other component to local AI that I introduced at the start of the
54:13 master class for our infrastructure. things like our database and local and
54:18 private web search, our user interface, agent monitoring. We have all these
54:23 other open-source platforms that we also want to run along with our large
54:27 language models and the local AI package is the solution to bring all of that
54:32 together curated for you to install in just a few steps. So, here is the GitHub
54:37 repository for the local AI package. I'll have this linked in the description
54:41 below. Just to be very clear, there are two GitHub repos for this master class.
54:45 We have this one that we covered earlier. This has our N8N and Python
54:49 agents that we'll cover in a bit, as well as the OpenAI compatible demo that
54:53 we saw earlier. So, you want to have this cloned and the local AI package as
54:57 well. Very easy to get both up and running. And if you scroll down in the
55:02 local AI package, I have very comprehensive instructions for setting
55:06 up everything, including how to deploy it to a private server in the cloud,
55:10 which we'll get into at the end of this master class, and a troubleshooting
55:13 section at the bottom. So, everything that I'm about to walk you through here,
55:17 there's instructions in the readme as well if you just want to circle back to
55:21 clarify anything. Also, I dive into all of the platforms that are included in
55:26 the local AI package. And this is very important because like I said, when you
55:30 want to build a 100% offline and private AI agent, it's a lot more than just the
55:35 large language model. You have all of the accompanying infrastructure like
55:39 your database and your UI. And so I have all that included. First of all, I have
55:44 N8N that is our low/noodeode workflow automation platform. We'll be building
55:48 an agent with N8N in the local AI package in a little bit once we have it
55:52 set up. We have Superbase for our open- source database. We have Olama. Of
55:56 course, we want to have this in the package as well for our LLMs. Open Web
56:01 UI, which gives us a chat GPT like interface for us to talk to our LLMs and
56:06 have things like conversation history. Very, very nice. So, we're looking at
56:09 this right here. This is included in the package. Then we have Flowwise. It's
56:13 similar to N8N. It's another really good tool to build AI agents with no slash
56:18 low code. Quadrant, which is an open- source vector database. Neo4j which is a
56:25 knowledge graph engine and then seir xng for open-source completely free and
56:31 private web search caddy which this is going to be very important for us once
56:35 we deploy the local AI package to the cloud and we actually want to have
56:38 domains for our different services like nn and open web UI and then the last
56:42 thing is langfuse this is an open- source LLM engineering platform it helps
56:47 us with agent observability now some of these services are outside of the scope
56:52 for this local AI master class. I don't want to spend a half hour on every
56:56 single one of these services and make this a 10-hour video. I will be focusing
57:02 in this video on N8N, Superbase, Olama, Open WebUI, CRXNG, and then Caddy once
57:08 we deploy everything to the cloud. So, I do cover like half of these services.
57:12 And the other thing that I want to touch on here is that there are quite a few
57:16 things included here. And so you do need about 8 GB of RAM on your machine or
57:21 your cloud server to run everything. It is pretty big overall. And so you can
57:26 remove certain things like if you don't want Quadrant and Langfuse for example,
57:30 you can take those out of the package. More on that later. It doesn't have to
57:34 be super bloated, you can whittle this down to what you need. But yeah, there's
57:37 a lot of different things that go into building AI agents. And so I have all of
57:40 these services here so that no matter what you need, I've got you covered. And
57:44 so with that, we can now move on to installing the local AI package. And
57:48 these instructions will work for you on any operating system, any computer. Even
57:52 if you don't have a really good GPU to run local large language models, you
57:56 still could always use OpenAI or Anthropic, something like that, and then
58:00 run everything else locally to save on costs or just to have everything running
58:04 on your computer. And so there are a couple of prerequisites that you have to
58:08 have before you can do the instructions below. You need Python so you can run
58:12 the start script that boots everything up. Git or GitHub desktop so you can
58:16 clone this GitHub repository, bring it all onto your own machine. And then you
58:21 want Docker or Docker Desktop. And so I've got links for all of these. Docker
58:25 and Docker Desktop we need because all of these local AI services that I've
58:29 curated for you, they all run as individual Docker containers that are
58:34 all combined together in a stack. And so I'll actually show you this is the end
58:36 result once we have everything up and running within your docker desktop. You
58:41 have this local AI docker compos stack that has all of the services running in
58:45 tandem like superbase and reddus and nitn and flowwise caddy neo4j. All of
58:50 these are running within this stack. That is what we're working towards right
58:54 now. And so make sure you have all these things installed. I've got links that'll
58:57 take you to installing no matter your operating system. Very easy to get all
59:01 of this up and running on your machine. Then we can move on to our first command
59:05 here, which is to clone this GitHub repository, bringing all of this code on
59:10 your machine so you can get everything running. And so you want to open up a
59:14 new terminal. So I've got a new PowerShell session open here. Going to
59:18 paste in this command. And I'm going to be doing this completely from scratch
59:22 with you. So you clone the repo and then I'm just going to change my directory
59:26 into local AI package, which was just created from this get clone command. So
59:31 those are the first two steps. The next thing is we have to configure all of our
59:36 environment variables. And believe it or not, this is actually the longest part
59:41 of the process. And once we have this taken care of, it's a breeze getting the
59:44 rest of this up and running. But there's a lot of configuration that we have to
59:49 set up for our different services like credentials for logging into our
59:54 Superbase dashboard or Neo4j. Uh things like our Superbase um anonymous key and
59:59 private key. All these things we have to configure. And so within our terminal
60:04 here, you can do code dot to open this within VS code or windsurf. Open this in
60:09 windsurf. You just want to open up this folder within your IDE and the specific
60:14 IDE that you use. Really doesn't matter. You just want to get to this.env.example
60:20 here. I'm going to copy it and then I'm going to paste it. And then I'm going to
60:24 rename this toenv. So we're taking the example. example, turning it into av file. So,
60:31 you want to make sure that you copy it and rename it like this. Then we can go
60:36 ahead and start setting all of our configuration. And I'll even zoom in on
60:39 this just so that it's very easy for you to see everything that we are setting up
60:44 here. So, first up, we have a couple of credentials for N8N. We have our
60:49 encryption key and our JWT secret. And it's very easy to generate these. In
60:53 fact, we'll be doing this a couple of times, but we'll use this open SSL
60:58 command to generate a random 32 character alpha numeric string that
61:02 we're going to use for things like our encryption key and JWT secret. And so,
61:08 OpenSSL is a command that is available for you by default on Linux and Macs.
61:12 You can just open up any terminal and run this command and it'll spit out a
61:16 long string that you can then just paste in for this value. For Windows, you
61:20 can't just open up any terminal and use OpenSSL, but you can use Git Bash, which
61:26 is going to come with GitHub Desktop when you install it. And so, I'll go
61:29 ahead and just search for that. If you just go to your search bar on your
61:32 bottom left on Windows and search for Git Bash, it's going to open up this
61:37 terminal like this. And so, I can go ahead and copy this command, go in here,
61:42 and paste it in. And then I can run it. And then, boom, there we go. This is I
61:45 know it's really small for you to see right now. I'm going to go ahead and
61:48 copy this because this is now the value that I can use for my encryption key.
61:52 And then you want to do the exact same thing to generate a JWT secret. And then
61:57 the other way that you can do this if you don't want to install git bash or
62:01 it's not working for whatever reason, you can use Python to generate this as
62:05 well. So I can just copy this command and then I can go into the terminal here
62:10 and I can just paste this in. And so it's going to just like with OpenSSL
62:15 generate this random 32 character string that I can copy and then use for my JWT
62:20 secret. There we go. And so I am going to get in the weeds a little bit here
62:24 with each of these different parameters, but I really want to make sure that I'm
62:27 clear on how to set up everything for you so you can really walk through this
62:31 step by step with me. And like I said, setting up the environment variables is
62:35 the longest part by far for getting the local AI package set up. So if you bear
62:39 with me on this, you get through this configuration, you will have everything
62:43 running that you need for local AI for the LLMs and your infrastructure. So
62:47 that's everything for N8N. Now we have some secrets for Superbase. And there
62:53 are some instructions in the Superbase documentation for how to get some of
62:57 these values. So it's this link right here, which I have open up on my
63:01 browser. So we'll we'll reference this in a little bit here. But first, we can
63:05 set up a couple of other things. The first thing we need to define is our
63:11 Postgress password. So, Supphabase uses Postgress under the hood for the
63:14 database. And so, we want to set a password here that we'll use to connect
63:19 to Postgress within N8N or a connection string that we have for our Python code,
63:23 whatever that might be. And this value can be really anything that you want.
63:26 Just note that you have to be very careful at using special characters like
63:31 percent symbols. So if you ever have any issues with Postgress, it's probably
63:36 because you have special characters that are throwing it off. U that's something
63:39 that I've seen happen quite a few times. And so like I said, I want to mention
63:42 troubleshooting steps and things to make sure that it is very clear for you. So
63:47 for this Postgress password here, I'm just going to say test Postgress pass.
63:51 I'm just going to give some kind of random value here. Just end with a
63:54 couple of numbers. I don't care that I'm exposing this information to you because
63:58 this is a local AI package. These passwords are for services that never
64:02 leave my computer. So, it's not like you could hack me by connecting to anything
64:08 here. And then we have a JWT secret. And this is where we get into this link
64:13 right here in the Superbase docs. And so they walk you through generating a JWT
64:18 secret and then using that to create both your anonymous and your service
64:22 role keys. If you're familiar with Superbase at all, we need both of these
64:27 pieces of information. The anonymous key is what we share to our front end. This
64:30 is our public key. And then the service role key has all permissions for
64:33 Superbase. We'll use this in our backends for things like our agents. And
64:39 so you can just go ahead and copy. You can go ahead and copy this JWT secret.
64:43 And then you can paste this in right here. This is 32 characters long just
64:47 like the things that we generated with OpenSSL. I'm just going to be using
64:52 exactly what Superbase tells me to. And then what you can do with this is you
64:56 can select the anonymous key. Click on generate JWT and then I can copy this
65:02 value and then I will paste this for my anonymous token. And so I'm just
65:06 replacing the default value there for the anonymous key. And then going back
65:10 and selecting the service key, I'm going to generate that one as well. So it
65:13 looks very similar. They'll always start with ey, but these values are different
65:18 if you go towards the end. And so I'll go ahead and paste this for my service
65:22 ro key. Boom. There we go. All right. And then for the Superbase dashboard
65:27 that we'll log into to see our tables and our SQL editor and authentication
65:31 and everything like that, we have our username here, which I'm just going to
65:35 keep as superbase. And then for the password, I can just say test superbase
65:39 pass. I'll just kind of use that as my common nomenclature here for my
65:42 passwords cuz I don't really care what that is right now. And then the last
65:45 thing that we have to set up is our pooler tenant ID. And it's not really
65:49 important to dive into what exactly this means. Just know that you can set this
65:52 to really anything that you want. Like I typically will just choose four digits
65:57 here like 1,00 for my pooler tenant ID. So that is everything that we need for
66:01 superbase. And actually most of the configuration is for superbase. Then we
66:06 have Neo4j. This is really simple. You can leave Neo4j for the username and
66:11 then I'll just say test Neo4j pass for my password here. So you just set the
66:15 password for knowledger graph and even if you're not using Neo4j you still have
66:19 to set this but yeah it just takes two seconds. Then we have langfuse. This is
66:23 for agent observability. We have a few secrets that we need here. And for these
66:28 values they can really just be whatever you want. It doesn't matter because
66:31 these are just passwords just like we had passwords for things like Neoforj.
66:35 So I can just say test click house pass. Um and then I can do test mo pass. And
66:43 um I mean it really doesn't matter here. Random Langfuse salt. I'm just doing
66:47 completely whack values here. You probably want something more secure in
66:51 this case, but um I'm just doing something as a placeholder for now. Um
66:56 yeah, that there we go. Okay, good. And then then the last thing that we need
66:59 for Langfuse is an encryption key. And this is also generated with OpenSSL like
67:04 we did for the N8N credentials. And so I'll go back to my git bash terminal.
67:08 And again, you can do this with Python as well. I'll just run the exact same
67:12 command. I'll get a different value this time. And so I'll go ahead and copy
67:16 that. You could technically use the same value over and over if you wanted to,
67:20 but obviously it's way more secure to use a different value for each of the
67:24 encryption keys that you generate with OpenSSL. So there we go. That is our
67:28 encryption key. And that is actually everything that we have to set up for
67:32 our environment variables when we are just running the local AI package on our
67:37 computer. Once we deploy it to the cloud and we actually want domains for our
67:41 different services like open web UI and N8N then we'll have to set up caddy. So
67:45 this is where we'll dive into domains and we'll get into this at the end of
67:49 the master class here. But everything past this point for environment
67:54 variables is completely optional. You can leave all of this exactly as it is
67:59 and everything will work. Most of this is just extra configuration for
68:03 superbase. So, Superbase is definitely the biggest service that's included in
68:08 this list of, you know, curated services for you. And so, there's a lot of
68:11 different configuration things you can play around with if you want to dive
68:15 more into this. You can definitely look at the same documentation page that we
68:19 were using for the Superbase Secrets. And so, you can scroll through this if
68:22 you want to learn more um like setting up email authentication or Google
68:27 authentication. um diving more into all of those different configuration things
68:32 for Superbase if you want to dive more into that. I'm not going to get into all
68:36 of this right now because the core of getting Superbase up and running we
68:40 already have taken care of with the credentials that we set up at the top um
68:44 right here. And so that these are these are just the base things and so that's
68:47 what we'll stick to right now. So that is everything for our environment
68:51 variables. So then going back to our readme now which I have open directly in
68:55 windsurf now instead of my browser we have finished our configuration and I do
68:59 have a note here that you want to set things up for caddy if you're deploying
69:03 to production. Obviously we're doing that later not right now like I said and
1:09:07 Customizing the Local AI Package
69:07 so with that we are good to start everything. Now before we spin up the
69:12 entire local AI package there is one thing that I want to cover. It's
69:14 important to cover this before we run things. If you don't want to run
69:19 everything in the package cuz it is a lot like maybe you only want to use half
69:22 of these services and you don't want Neo4j and Langfuse and Flowwise right
69:28 now. There are two options that you have. The easiest one right now is to go
69:33 into the docker compose file. This is the main file where all of the services
69:38 are curated together and you can just remove the services that you don't want
69:42 to include. So, for example, if you don't want Quadrant right now, cuz it is
69:46 actually one of the larger services. It's like 600 u megabytes of RAM just
69:50 having this running, you can search for Quadrant, and you can just go ahead and
69:54 delete this service from the stack like that. Boom. Now I don't have Quadrant.
69:59 It won't spin up as a part of the stack anymore. And then also I have a volume
70:03 for Quadrant. So, you can remove that as well. Volumes, by the way, is how we are
70:08 able to persist data for these containers. So if we tear down
70:12 everything and then we spin it back up, we still are going to have our open web
70:17 UI conversations and our N8N workflows, everything in Superbase, like all that
70:21 is still going to be saved because we're storing it all in volumes. So we can do
70:25 whatever the heck we want with these containers. We can tear them down. We
70:28 can update them, which I'll show you how to do later. We can spin it back up. And
70:32 all of our data will always be persisted. So you don't have to worry
70:35 about losing information. And you can always back things up if you want to be
70:39 really secure, but I've never done that before and I've been updating this
70:42 package for months and months and months and all of my workflows from 6 months
70:46 ago are still there. I haven't lost anything. And so that's just a quick
70:50 caveat there for how you can remove services if you want. And then another
70:54 thing that we don't have available yet, but I'm very excited to, you know, kind
70:58 of talk about this right now. It's in beta right now. We are creating me and
71:03 one other guy uh that's actually on my Dynamist team. Um Thomas, he's got a
71:07 YouTube channel as well. He's a great guy. We're working together on this.
71:09 He's actually been putting in most of the work creating a front-end
71:13 application for us to manage our local AI package. And one of the big things
71:17 with this is that we're going to make it possible for you to toggle on and off
71:22 the services that you want to have within your local AI package. So you can
71:27 very much customize the package to the services that you want to run. So you
71:31 can keep it lightweight just to the things you care about. Also, we'll be
71:34 able to manage environment variables and monitor the containers. Not all of this
71:38 is up and running at this point, but this is in beta. We're working on it.
71:41 I'm really excited for this. So, not available yet, but at once this is
71:44 available, this will be a really good way for you to customize the package to
71:47 your needs. So, you don't have to go and edit the docker compose file directly.
71:51 So, that's something that I just wanted to get out of the way now. But, we can
71:57 start and actually execute our package now. Get all these containers up and
1:11:59 Running the Local AI Package
72:02 running. So the command that you run to start the local AI package is different
72:07 depending on your operating system and the hardware that you have. So for
72:13 example, if you are an Nvidia GPU user, you want to run this start services.py
72:18 script. This boots up all of the containers and you want to specifically
72:23 pass in the profile of GPU NVIDIA. This is going to start Ola in a way where the
72:29 Olama container is able to leverage your GPU automatically. And then if you are
72:34 using an AMD GPU and you're on Linux, then you can run it this way. Which by
72:38 the way, unfortunately, if you have an AMD GPU on Windows, you aren't able to
72:46 run O Lama in a container. And it's the same thing with Mac computers.
72:49 Unfortunately, like you see right here, you cannot expose your GPU to the Docker
72:55 instance. And so if you are an AMD GPU on Windows or running on Mac, you cannot
73:01 run Olama in the local AI package. You just have to install it on your own
73:04 machine like I already showed you in this master class and then you'll just
73:08 run everything else through the local AI package and they can actually go out to
73:12 your machine and communicate to Olama directly. So just a small limitation for
73:17 Mac and AMD on Windows. But if you're running on Linux or an Nvidia GPU on
73:21 Windows like I'm using, then you can go ahead and run this command right here.
73:27 So if you can't run a GPU in the Olama container, then you can always just
73:32 start in CPU mode or you can run with a profile of none. This will actually make
73:36 it so that Olama never starts in the local AI package. So you can just
73:40 leverage the Olama that you have already running on your computer like I showed
73:43 you how to install already. So, just a couple of small caveats that I really
73:46 want to hit on there. I need to make sure that you're using the right
73:51 command. And so, in my case, I'm Nvidia on Windows. So, I'm going to copy this
73:55 command. Go back over into my terminal. I'll just clear it here. So, we have a
73:59 blank slate. And I'll paste in this command. And so, it's going to do quite
74:02 a few things initially. First, it's going to clone the Superbase repository
74:07 because Superbase actually manages the stack in a separate place. And so, we
74:10 have to pull that in. Then there's some configuration for CRXNG for our uh local
74:17 and private web search. And then I have a couple of warnings here saying that
74:20 the Flowwise username and password are not set, which by the way for that if
74:24 you want to set the Flow Wise username and password, it's optional, but you can
74:29 do that if I scroll down right here. So you can set these values, those will
74:32 actually make those warnings go away, but you can also ignore them, too. So
74:35 anyway, I just wanted to mention that really quickly. But now what's happening
74:39 here is it starts by running all of the Superbase containers. And so there's
74:44 quite a bit that goes into Superbase, like I said. So we're running all of
74:47 that. It's getting all that spun up. And then once we run all of these, it's
74:51 going to move on to deploying the rest of our stack. And if you're running this
74:55 for the very first time, it will take a while to download all of these images.
74:59 They're not super small. There's a lot of infrastructure that we're starting up
75:03 here. And so it'll take a bit. You just have to be patient. maybe go grab your
75:06 coffee or make your next meal, whatever that is. And then everything will be up
75:09 and running once you are back. And so yeah, now you can see that we are
75:13 running the rest of the containers here. Um, and so we'll just wait for that to
75:16 be done. And then I'll show you what that looks like in Docker Desktop as
75:19 well. And so I'll give it a second here just to finish. Uh, looks like my
75:24 terminal glitched a little bit. Like I was scrolling and so it kind of broke it
75:27 a bit. But anyway, everything is up and running now. It'll look like this where
75:30 it'll say all of the containers are healthy or running or started. And then
75:34 if I go into Docker Desktop and I expand the local AI compost stack, you want to
75:39 make sure that you have a green dot for everything except for the Olama pull and
75:45 N8N import. These just run once initially and then they go down because
75:49 they're responsible for pulling some things for our local AI package. And so
75:53 yeah, I've got green dots for everything except for two right here. Now I'm
75:57 leaving this in here intentionally actually because there is a bug with
76:02 Superbase specifically if you are on Windows. So you'll see this issue where
76:08 the Superbase pooler is constantly restarting and that also affects N8N
76:12 because N8N relies on the Superbase pooler. So it's constantly restarting as
76:17 well. If you see this problem, I actually talk about this in the
76:21 troubleshooting section of the readme. If you scroll all the way down, if the
76:24 Superbase pooler is restarting, you can check out this GitHub issue. And so I
76:29 linked to this right here, and he tells you exactly which file you want to
76:33 change. It's this one right here. So it's docker volumes poolerpooler.exs.
76:39 And you need to change the file to end in lf. And so I'll show you what I mean
76:43 by that. I'll show you exactly how to do this. It's like a super tiny random
76:47 thing, but this has tripped up so many people. So I want to include this
76:51 explicitly in the master class here. So you want to go within the superbase
76:56 folder within docker volumes and then it's within pooler and then we have
77:00 pooler.exs and basically no matter your IDE you can see the crlf in the bottom right here.
77:08 You want to click on this and then change it to lf and then make sure that
77:13 you save this file. Very easy to fix that. And then what you can do is you
77:19 can run the exact same command to spin everything up again. And so I'm going to
77:22 do this now. It's going to go through all the same steps. It'll be faster this
77:25 time because you already have everything pulled. And this, by the way, is how you
77:28 can just restart everything really quickly if you want to enforce new
77:31 environment variables or anything like that. So I want to include that
77:35 explicitly um for that reason as well. And I'll go ahead and close out of this.
77:39 And and while this is all restarting, the other thing that I want to show you
77:42 in the readme is I also have instructions for upgrading the containers in the local AI package. So
77:49 when N8N has an update or Superbase has an update, it is your responsibility
77:53 because you're managing the infrastructure to update things yourself. And so you very simply just
77:58 have to run these three commands to update everything. You want to tear down
78:03 all of the containers and make sure you specify your profile like GPU Nvidia and
78:09 then you want to pull all of the latest containers and again specifying your
78:13 profile. And then once you do those two things, you'll have the most up-to-date
78:17 versions of the containers downloaded. So you can go ahead and run the start
78:21 services with your profile just like we just did to restart things. Very easy to
78:25 update everything. And even though we are completely tearing down our
78:29 containers here before we upgrade them, we aren't losing any information because
78:33 we are persisting things in the volumes that we have set up at the top of our
78:37 Docker Compose stack. And so this is where we store all of our data in our
78:41 database and and workflows. All these things are persisted. So we don't have
78:45 to worry about losing them. Very easy to upgrade things and you still get to keep
78:48 everything. You don't have to make backups and things like that unless you
78:54 just want to be ultra ultra safe. So now we can go back to our Docker desktop and
79:00 we've got green dots for everything now since we fixed that pooler.exs issue.
79:04 The only thing that we don't have green dots for is the N8N import and then we
79:08 have our Olama pull as well because like I said those are the two things that
79:11 just have to run at the beginning and then they aren't ongoing processes like
79:16 the rest of our services. So, we have everything up and running. And if there
79:22 is anything that is a white dot besides Olama pull or n import or if there's
79:27 anything that is constantly restarting, just feel free to post a comment and
79:31 I'll definitely be sure to help you out. And then also check out the
79:34 troubleshooting section as well. One thing that I'll mention really quick is
79:37 sometimes your N8N will constantly restart and it'll say something like the
79:42 N8N encryption key doesn't match what you have in the config. And the big
79:46 thing to keep in mind for that is you want to make sure that you set this
79:51 value for the encryption key before you ever run it for the first time.
79:53 Otherwise, it's going to generate some random default value and then if you
79:56 change this later, it won't match with what it expects. And so, yeah, my big
79:59 recommendation is like make sure you have everything set up in your
80:03 environment variables before you ever run the start services for the first
80:08 time. This should be run once you have your environment variables set up.
80:11 Otherwise, you risk any of these services creating default values that
80:14 then wouldn't match with the keys and things that you set up later. And so
80:18 with that, we can now go into our browser and actually explore all of
80:22 these local AI services that we have running on our computer now. Now over in
1:20:24 Testing Our Local AI Services
80:26 our browser, we can start visiting the different services that we have spun up.
80:30 Like here is N8N. You just have to go to localhost port 5678. It'll have you
80:35 create a local account when you first visit it. And then you'll have this
80:38 workflow view that should look very familiar to you. if you have used NAND
80:41 in the past. And then we have open web UI localhost port 8080. This is our chat
80:47 GPT like interface where we can directly talk to all of the models that we have
80:52 pulled in our Olama container. Really, really neat. And then we have local host
80:57 port 8000 for our Superbase dashboard. The signin definitely isn't pretty
81:00 compared to the managed version of Superbase. But once you enter in your
81:04 username and password that you have set for the environment variables for the
81:07 dashboard, then you have the very typical view where we have our tables
81:11 and we've got our SQL editor. Everything that you're familiar with with
81:14 Superbase. And that's the key thing with all these different services. They all
81:18 will look the exact same for you pretty much. Um like another one for example,
81:25 if I go to localhost um port 3000, we have languages. This is for agent
81:28 observability and monitoring. And this is something I'm not going to dive into
81:31 in this master class. Like I said, I'm not covering all the services. But yeah,
81:34 I just want to show that like every single one of these pretty much you can
81:39 access in your browser. And by the way, the way that we know the specific port
81:43 to access for each of these services is by taking a look at either what it tells
81:48 us in Docker Desktop. So like we can see that Neo4j is um let's see, we have port
81:53 7474. For uh CR XNG, it's port 8081. For Flowwise, it's port 3001. What's one
82:01 that we've seen already? Um, let me Yeah, like Open Web UI is port 8080. So,
82:06 the port on the left is the one that we access in our browser. And then the port
82:11 on the right is what's mapped on the container. So, when we visit port 8080
82:16 on our computer, that goes into port 8080 on the container. And that's what
82:21 we have exposed. The other way that you can see the port that you need to use is
82:24 just by taking a look at this docker compose file. And you don't need to have
82:29 like a super good understanding of this docker compose file. But if you want to
82:33 customize your stack or even help me by making contributions to local AI
82:36 package, this is the main place to make changes. And so for example, I can go
82:41 down to flowwise and I can see that the port is 3001. Or if I go down to let's say N8N, we can
82:49 see that the port is 5678. And so the port is always going to be
82:52 there somewhere in the service that you have set up. Like for the Langfuse
82:56 worker, it's 3030. That's more of a behind-the-scenes kind of service. But
82:59 let me just find one more example for you here. Um yeah, like Reddus for
83:04 example is 6379. So you can see the ports in the Docker Compose as well. I
83:07 just want to call it out just to at least get you a little bit comfortable
83:11 and familiar with the Docker Compose file in case you want to customize
83:14 things. But the main thing is just leveraging what you see here in Docker
83:18 Desktop. Last thing in Docker Desktop really quickly, if you want to bring
83:21 more local large language models into the mix, you can do it without having to
83:25 restart anything. You just have to find the Olama container in the Docker
83:29 Compose stack. Head on over to the exec tab. And now here we can run any
83:33 commands that we'd want. We're directly within the container here. And we can
83:36 use Lama commands just like we did earlier on our host machine. And so for
83:40 example, I ran Lama list already. So I can see the large language models that
83:43 have already been pulled in my Olama container. If I want to pull more, I can
83:48 just do Olama pull and then find that ID for the model I want to use on the Olama
83:52 website. And like I said, you don't have to restart anything. If I pull it here,
83:56 it's now in the container and I can immediately start using it in Open Web
84:00 UI or N8N. We'll see that in a little bit. And so that's just really important
84:03 because a lot of times you're going to want to start to use different large
84:06 language models and you don't want to have to restart anything. The ones that
84:11 are brought into the machine by default is it's determined by this line right
84:15 here. So if you want to change the ones that are pulled by default, I just have
84:20 Quinn 2.57B instructs like a really small lightweight one that I have
84:23 brought into your Lama container by default. Uh if you want to add in
84:27 different ones, you can just update this line right here to include multiple
84:32 Olama pulls. And so that way you can bring in Quen 3 or Mistral 3.1 small,
84:36 whatever you want. This is just the one I have by default. And then all the
84:40 other ones that you saw in my list here, I've pulled myself. All right. Now that
1:24:41 Testing Ollama within Open WebUI
84:45 we have the local AI package up and running, it is time to build some
84:50 agents. Now, we get to use our local AI package to actually build out an
84:53 application. And so, I'm going to start by introducing you to Open Web UI, and
84:58 we'll use it to talk to our Olama LLM. So, we have an application kind of right
85:02 out of the box for us. Then I'll dive into building a local AI agent with N8N,
85:08 even connecting it to Open Web UI. So we have this custom agent that we built in
85:12 N8N and then we immediately have a really nice UI to chat with it. And then
85:16 we'll transition to Python building the exact same agent in Python as well. Like
85:21 I said, I want to focus on both no code and code to really make this a complete
85:24 master class so that whether you want to build with N8N or Python, you can see
85:28 how to connect to our different services that we have running locally like
85:32 Superbase and CRXNG and Open Web UI. So, we'll cover all of that and then I'll
85:36 get into deployments after this. But yeah, let's go ahead right now focus on
85:40 open web UI and building out some agents. So, back over in Open Web UI,
85:45 remember this is localhost port 8080. You want to set up your connection to
85:49 Olama so we can start talking with our local LLMs in this nice interface. And
85:54 so bottom left, go to the admin panel, then go to settings and then the
85:58 connections tab. Here we can set up our connections both to OpenAI with our API
86:02 key, which we're not going to do right now, but then also the Olama API. This
86:07 is what we want to set up. Now, usually by default, this value is just
86:12 localhost. And this is actually wrong. This is something that is so important
86:16 to understand. And this will apply when we set up credentials in N8N and Python
86:20 as well. When you are within a container, localhost means that you are
86:26 referencing still within the container. Open web UI needs to reach out to the
86:32 Olama container, not itself. So localhost is not correct here. This is
86:36 generally the default just because open web UI assumes that you're running on
86:39 your machine and so then you would also have Lama running on your machine. So
86:42 local host usually works when you're outside of containers. But here we have
86:47 to change this. This is super important to get right. And so there are two
86:51 options we have. If you are running on a Mac or AMD on Windows and you want to
86:56 use Lama running on your machine not within a container, then you want to do
87:00 host.doccker.in. This is the way in docker to tell the container to look outside to the host
87:07 machine where you're running the containers and you're running
87:11 separately. Very important to know that. And then if you are running Olama in the
87:16 container like I am doing. I have Ola running in my Docker desktop. You want
87:21 to change this to Olama, you're specifically calling out the service
87:26 that is running the Olama container in your Docker Compose stack. And the way
87:30 that we know that this is the name specifically is because we just go back
87:35 to our allimportant Docker Compose file. Olama. So whenever there's an X and a
87:40 dash, you just ignore that. It's just the thing after it. So, ola is the name
87:45 of our service running the container. And then if we wanted to connect to
87:49 something else like flow-wise, flow-wise is the name of the service. Open WebUI,
87:55 it's open- web UI. All of these tople keywords, these are the names when we
87:59 want our containers to be talking to each other. And all of this is possible
88:03 because they are within the same Docker network. And so I'll just show you that
88:06 so you know what I'm talking about here. If I go back to Docker Desktop, we have
88:11 this local AI compos stack. All of these containers can now communicate
88:14 internally with each other by referencing the names like Reddus or
88:19 CRXNG. So, we'll be seeing that a lot when we're building out our agents as
88:22 well. So, I wanted to spend a couple minutes to focus on that. And so, you
88:25 can go ahead and click on save in the very bottom right. I know my face is
88:28 covering this right now, but you have a save button here. Make sure you actually
88:32 do that. Um, and for this API key, I don't know why it's asking me to fill it
88:34 out. I don't really care about connecting to open AI. So I'll just put
88:37 some random value there and click save. And then boom, there we go. We are good.
88:40 And then a lot of times with open web UI, it also helps to refresh otherwise
88:43 it doesn't load the models for some reason. So I just did a refresh of the
88:49 site here. Control F5. And then now we can select all of the local LLMs that we
88:53 have pulled in our Olama container. And so for example, I can do Quinn 2.57B.
88:57 That's the one that I just have by default. I can say hello. And it's going
89:02 to take a little bit cuz it has to load this model onto my GPU just like we saw
89:06 with quen 3 earlier. But then in a second here we'll get a response. And
89:10 there are actually multiple calls that are being done here. We have one to get
89:15 our response, one to get a title for our conversation on the lefth hand side. And
89:19 then also if you click on the three dots here, you can see that it created a
89:23 couple of tags for this conversation. So couple of things that are fired off all
89:26 at once there. And I can test conversation history. What did I just
89:30 say? So yeah, I mean everything's working really well here. We have chat
89:34 history, conversation history on the lefth hand side. There's so much that we
89:37 get out of the box. And so I wanted to show you this really quickly. Now we can
89:41 move on to building an agent in N8N. And I'll even show you how to connect it to
89:46 Open Web UI as well through this N8N agent connector. Really exciting stuff.
1:29:49 Building a Local n8n AI Agent
89:49 So let's get right into it. So I'm going to start really simple here by building
89:53 a basic agent. The main thing that I want to focus on is just connecting to
89:57 our different local AI services. So I am going to assume that you have a basic
90:01 knowledge of N8N here because this is not an N8N master class. And so I'm
90:04 starting with a chat trigger so we can talk to our agent directly in the UI.
90:08 We'll connect this to open web UI in a bit as well. And then I want to connect
90:14 an AI agent node. And so what we want to do is connect for the chat model and
90:18 then local superbase for our conversation history, our agent memory.
90:22 And so for the chat model I'm going to do lama chat model. I'm going to create
90:26 brand new credentials. You can see me do this from scratch. The URL that you want
90:32 for the base URL is exactly the same as what we just entered into open web UI.
90:35 And so if you are running Olama on your host machine like an AMD on Windows or
90:40 you are running on a Mac or you just don't want to run the Olama container,
90:46 then it is host.doccker.in. And then if you are referencing the
90:49 Olama container, we just reference Olama. That's the name of the service
90:53 running the Olama container in our stack. And then the port is 11434 by
90:57 default. And you can test this connection. So it'll do a quick ping to
91:01 the container to make sure that we are good to go. And I'll even show you what
91:04 that looks like. So right here in my Olama container, I have the logs up. And
91:09 the last two requests were just a simple get request to the root endpoint. We
91:13 have two of those right here. And if I click on retry and I go back to the
91:18 logs, boom, we are at three now. So it made three requests. So it's just making
91:22 that simple ping each time to make sure the container is available. And so I'm
91:26 going to go ahead and click on save and then close out. So now we have our
91:29 credentials and then we can automatically select the model that we
91:33 have loaded now in our container. And so just to keep things really lightweight,
91:36 I'm going to go with the 7 billion parameter model right now from Quen 2.5.
91:40 Cool. All right. So that is everything that we need to connect Olama. It is
91:44 that easy. And then we could even test it right now. So, I'm going to go ahead
91:47 and save this workflow. And I'm going to just say hello. And uh we don't need the
91:52 conversation history or tools or anything at this point. We're already
91:55 getting a response here from the LLM. It's working on loading the model into
91:59 my GPU as we speak. And so there we go. We got our answer looking really good.
92:04 Cool. So now we can add memory as well. So I'm going to add Postgress because
92:08 remember Superbase uses Postgress under the hood. And then I'm going to create
92:12 brand new credentials here. And this is actually probably the hardest one to set
92:16 up out of all of the credentials for connecting to our local AI service. And
92:20 so I'm going to show you what the Docker Compose file looks like just that it's
92:24 clear how I'm getting these different values. And so I'll point out all of
92:28 them. So the first one for our host it is DB because this is the name of the
92:35 specific Superbase service that we have that is the underlying Postgress
92:38 database. And I can show you how I got that really quick. If you go to the
92:42 superbase folder that we pull when we run that start services script, I go to
92:47 docker and then docker compose. If I search for db and there's quite a few
92:53 dependencies on db here. So let me find the actual reference to it. Where is db?
92:58 Here we go. So yeah, it's really short. Uh db is the name of our service that
93:03 actually is the superbase DB. So this is the container name that this is what
93:07 you'll see in docker desktop. But then this is the underlying service that we
93:10 want to reference when we have our containers communicating with each
93:14 other. Like in this case we have our N8N container talking to our superbase
93:18 database container. And then the database and username are both going to
93:22 be Postgress. Those are the values that we have by default. If you scroll down a
93:26 bit in thev you can see these right here. The Postgress database is
93:29 Postgress and the user is also Postgress. And you can customize these
93:33 things but these are some of the optional parameters that I didn't touch
93:36 in the setup with you. And so you can just leave those as is. Now the
93:41 Postgress password, this is one of them that we set. That was the first
93:44 superbase value that we set there. Make sure you have that from what you have in
93:48 thev. And then everything else you can kind of leave as the defaults here. The port is
93:53 going to be 5432. So that is everything for setting up our connection to
93:57 Postgress. You can test this connection as well. And then we can move on to
94:01 adding in some tools and things like that as well. But yeah, this is like the
94:06 very first basic version of the agent that I wanted to show you. And hopefully
94:10 with this you can see how no matter the service that you have running in the
94:13 local AI package. It's very easy to figure out how to connect to it both
94:17 with the help of N8N because N8N always makes it really easy to connect to
94:20 things. Then also just knowing that like you just have to reference that service
94:25 name that we have for the container in the Docker Compose stack. That's how we
94:28 can talk to it. So you could add in quadrant or you could add in language.
94:31 Like you can connect anything that you want into our agent here. And so now we
94:36 have conversation history. Next up, I want to show you how to build a bit more
94:40 of a complicated agent with N8N using some tools. And then also I'm going to
94:44 show you how to connect it to Open Web UI. And so right now this is a live
94:47 demo. Instead of connecting to one of the Olama LLMs, I'm going straight to
94:53 N8N. I have this custom N8N agent connector. And so we are talking to this
94:57 agent that I'll show you how to build in a little bit. This one has a tool to use
95:02 CRXNG for local and private web search. This is one of the platforms that we
95:06 have included in the local AI package. And so this response is going to take a
95:10 little bit here because it has to search the web. And the response that it
95:13 generates with this question is pretty long. Like there we go. Okay. So we got
95:16 the answer. It's pretty long. But yeah, we are able to search the internet now
95:21 with a local agent. N8N connected to open web UI. We're getting pretty fancy
95:25 here. And we also have the title that was generated on the left. And then we
95:30 have the tags here as well. And so the way that this all works, I'm going to
95:33 start by explaining how we can connect N8N to Open Web UI. And this is just
95:37 crucial. Makes it so easy for us to test agents locally as we are developing
95:42 them. And so if you go to the settings and the admin panel in the bottom left
95:47 and go to functions, open web UI has this thing called functions which gives
95:51 us the ability to add in custom functionality kind of as like custom
95:56 models that we can then use like you saw with the N8N agent connector. And so
96:02 what I have here is this thing that I call the N8N pipe. And I'll have a link
96:05 to this in the description as well. I created this myself and I uploaded it to
96:10 the open web UI directory of functions. And so you can go to this link right
96:14 here. You can even just Google the N8N pipe for open web UI. And then you click
96:19 on this get button. It'll just have you enter in the URL for your open web UI.
96:23 So I can just like paste in this right here. Click on import to open web UI and
96:28 it'll automatically redirect you to your open web UI instance. So you'll have
96:33 this function now. And we don't have to dive into the code for all how how all
96:36 of this works. I worked pretty hard to create this for you. Uh actually quite a
96:40 while ago I made this. And the thing that we need to care about is
96:44 configuring this to talk to our N8N agent. And so if you click on the
96:49 valves, the setting icon in the top right, there are a few values that we
96:54 have to set. And so now I'm going to go over to showing you how to build things
96:57 in N8N. Then all of this will click and it'll make sense. I right now looking at
97:00 these values, you're probably like, how the heck do I get all of these? But
97:02 don't worry, we'll dive into all of that. But first, let's go into our N8N
97:07 agent. I'll explain how all of this works. So, first of all, we have our
97:12 chat trigger that gives us the ability to communicate with our agent very
97:16 easily in the workflow. We have a new trigger now for the web hook. And so,
97:22 this is turning our agent into an API endpoint. So, we're able to talk to it
97:27 with other services like open web UI. And so to configure the web hook here,
97:31 you want to make sure that it is a post request type. And then you can define a
97:35 custom path here. Whatever you set here is going to determine what our URL is.
97:40 So we have our test URL. And then also if you toggle the workflow to active,
97:44 this is really important. The workflow in N does have to be active. Then you
97:49 have access to this production URL. And this is actually the first value that we
97:54 need to set within the valves for this open web UI function. We have our N8N
97:59 URL. And because this is a container talking to another container, we don't
98:03 actually want to use this localhost value that it has here for us. We want
98:08 to specify N8N because N8N again is the name of the service running the N8N
98:13 container in our Docker Compose stack. So N8N port 5678. And then this is the
98:18 custom URL that we can determine based on this. And then the other thing that
98:23 we want to do is set up header authentication. We don't want to expose
98:27 this endpoint without any kind of security. And so we want to set up some
98:31 authentication. And so you can select header off from the authentication
98:34 dropdown. And then for the credentials here, I'll just create brand new ones to
98:38 show you what this looks like. The name needs to be authorization with a capital
98:43 A. This has to be very specific. The name in the top left and the name of
98:46 your credentials. This can be whatever you want, but this has to be
98:51 authorization. And then the value here, the way that we want to format this is
98:55 it's going to be bearer and then the and then a space and then whatever you want
99:00 your bearer token to be. So this is what you get to define, but it needs to start
99:05 with a bearer capital B and a space. And then whatever you type after bearer
99:09 space, this goes in as the NAN bearer token. So you don't include a bearer
99:13 space here because that it's just assumed that it's going to be like that.
99:16 It's going to be prefixed with that. So you just type in like test off is what I
99:21 have. So my bearer token is bearer test off like that. And then this is what I
99:24 enter in for this field. Now I already have mine set up. So I'm just going to
99:27 go ahead and close out of this. And then the last thing that we have to set up
99:30 for the web hook. And don't worry, this is the node that we spend the most time
99:33 with. You want to go to the drop down here and change this to respond using
99:38 the respond to web hook node. very important because then at the end of our
99:40 workflow and we get the response from our agent, we're going to send that back
99:45 to whatever requested our API which is going to be open web UI in this case.
99:48 And so that's everything for our configuration for the web hook. Now the
99:53 next thing that we have to do is we have to determine is open web UI sending in a
99:58 request to get a response for our main agent or is it just looking to generate
100:03 that conversation title or the tags for our conversation? Because like we were
100:06 looking at earlier, I'm going to close out of this for now and go back to a
100:10 conversation, our last conversation here. We get our main response, but then
100:15 also there is a request to an LLM to create a very simple title for our
100:19 conversation and the tags that we can see in the top right. And so our N8
100:24 workflow actually gets invoked three separate times for just the first
100:30 message in a new conversation. And so we need to determine, are we getting a main
100:34 response? Like should we go to our main agent or should we just go to a simple
100:39 LLM that I have set up here to help generate the tags or title? And so the
100:44 way that we can determine that is whenever Open Web UI is requesting
100:47 something like a title for a conversation, it always prefixes the
100:53 prompt with three pound symbols, a space, and then the word task. And so we
100:58 can key off of this. If the prompt starts with this, and that prompt just
101:02 is coming in from our web hook here. If it does start with it, then we're just
101:06 going to go to this simple LLM, we're just going to be using Quen 2.514b
101:11 instruct. We have no tools, no memory or anything like our main agent because
101:14 we're just very simply going to generate that title or the tags. And I can even
101:19 show you in the execution history what that looks like. So in this case, we
101:22 have our web hook that comes in. The chat input starts with the triple pound
101:28 and task. And so sure enough, we are deeming it to be a metadata request is
101:32 what I'm calling it. And so then it then goes down to this LLM that is just
101:36 generating some text here. We just have this JSON response with the tags for the
101:41 conversation, technology, hardware, and gaming. So we're asking about the price
101:45 of the 5090 GPU. And then we do the exact same thing to also generate the
101:51 title GPU specs. And so exactly what we see here is the title of this last
101:55 conversation. So I hope that makes sense. And then if it doesn't start with
101:59 task and the triple pound and so it's actually our request. Then we go to our
102:03 main agent. We don't want our main agent to have to handle those super simple
102:06 tasks. You can also just use a really tiny LLM. Like this would be the perfect
102:11 case to actually use a super tiny LLM like um even like DeepSec R1 1.5B. You
102:15 could because it's just such a simple task. Otherwise though we are going to
102:20 go to our main agent. And so I'm not going to dive into like all these nodes
102:24 in a ton of detail, but basically we are are expecting the chat input to contain
102:30 the prompt for our agent. And the way that we know to expect chat input
102:35 specifically is because going back to the settings for the function here with
102:38 the valves, we are saying right here chat input. So you want to make sure
102:43 that the value that you put in here for input matches exactly with what you are
102:48 expecting from our web hook. And so chat input is the one that I have by default.
102:51 So you can just copy me if you want. Then we go into our agent where we're
102:55 hooked into Olama and we've got our local superbase. I already showed you
102:58 how to connect up all this and that looks exactly the same. The only thing
103:01 that is different now is we have a single tool to search the web with
103:07 CRXNG. So it's a web search tool. I have a description here just telling it what
103:11 is going to get back from using this tool. And then for the workflow ID, this
103:16 is if I go to add a node here and I just go for uh workflow tools, call N8N
103:23 workflow tool. So this is basically taking an N8N workflow and using it as a
103:28 tool for our agent. So this is the node that we have right here. But then I'm
103:32 referencing the ID of this N8N workflow. So this ID because I'm going to just
103:37 call the subworkflow that I have defined below. And again, I don't want to dive
103:40 into all the details of NAN right now and how this all works, but the agent is
103:44 going to decide the query. What should I search the web with? It decides that and
103:49 then it invokes this sub workflow here where we have our call to CR XNG. So the
103:54 name of the container service in our docker and compost stack is just CR XNG
103:59 and it runs on port 8080. And then if you look at the CXNG documentation, you
104:03 can look at how to invoke their API and things like this. So I'm just doing a
104:07 simple search here and then there are a few different nodes because what I want
104:10 to do is I want to split out the search and actually I can show you this by
104:14 going to an execution history where we're actually using this tool. So take
104:18 a look at this. So in this case the LLM decided to invoke this tool and the
104:23 query that it decided is current price of the 5090 GPU. So this is going along
104:28 with the conversation that we had last in open web UI. we get some results from
104:33 CRXNG, which is just going to be a bunch of different websites. And so, we don't
104:37 have the answer quite yet. We just have a bunch of resources that can help us
104:41 get there. And so, I'm going to split out. So, we have a bunch of different
104:45 websites. We're going to now limit to just one. I just want to pull one
104:48 website right now just to keep it really, really simple because now we're
104:52 going to actually visit that website. I'm going to make an HTTP request to
104:57 this website, which yeah, I mean, if it's literally an Nvidia official site
105:01 for the 5090, like this definitely has the information that we need. We're
105:04 going to make a request to it, and then we're also going to use this HTML node
105:08 to make sure that we are only selecting the body of the site. So, we take out
105:12 all the footers and headers and all that junk. So, we just have the key
105:15 information. And then that is what we aggregate and then return back to our AI
105:19 agent. So it now has the content, the core content of this website to get us
105:24 that answer. That is how we invoke our web search tool. And then at the very
105:29 end, we're just going to set this output field. And that's going to be the
105:33 response that we got back either from like generating a title or calling our
105:37 main agent. And this is really important. the output field specifically
105:41 whatever we call it here we have to make sure that that is corresponding to this
105:46 value as the last thing we have to set for the settings for our open web UI
105:50 function. So output here has to match with output here because that is what
105:55 we're going to return in this respond to web hook. Whatever open web UI gets back
105:59 it's getting back from what we return right here. So that is everything for
106:03 our agent. I could probably dive in quite a bit more into explaining how
106:06 this all works and building out a lot more complex agents, which I definitely
106:10 do with local AI in the Dynamis AI agent mastery course. So check that out if you
106:13 are interested. I just wanted to give you a simple example here showing how we
106:17 can talk to our different services like Olama, Superbase, and Seir XNG. And then
106:22 also open web UI as well. So once you have all these settings set, make sure
106:26 of course that you click on save. It's very, very important. These two things
106:29 at the bottom don't really matter, by the way. But yeah, click on save once
106:32 you have all of the settings there. And then you can go ahead and have a
106:36 conversation with your agent just like I did when I was demoing things before we
106:40 dove into the workflow. And by the way, this NAN agent that works with Open Web
106:44 UI, I have as a template for you. You can go ahead and download that in this
106:48 GitHub repository where I'm storing all the agents for this masterass. So we
106:52 have the JSON for it right here. You can go ahead and download this file. Go into
106:57 your N8N instance. Click on the three dots in the top right once you've
107:01 created a new workflow. Import from file and then you can bring in that JSON
107:03 workflow. You'll just have to set up all your own credentials for things like
107:07 Lama and Superbase and CRXNG. But then you'll be good to go and you can just go
107:11 through the same process that I did setting up the function in open web UI
107:15 and it'll be with like 15 minutes you'll have everything up and running to talk
1:47:18 Building a Local Python AI Agent
107:20 to N8N in open web UI. Next up I want to create now the Python version of our
107:25 local AI agent. And so this is going to be a onetoone translation. Exactly what
107:30 we built here in NN, we are now going to do in Python. So I can show you how to
107:34 work with both noode and code with our local AI package. And so this GitHub
107:39 repo that has the N workflow we were just looking at and that OpenAI
107:43 compatible demo we saw earlier, this has pretty much everything for the agent. So
107:46 most of this repository is for this agent that we're about to dive into now
107:50 with Python. And in this readme here, I have very detailed instructions for
107:54 setting up everything. And a lot of what we do with the Python agent, especially
107:57 when we are configuring our environment variables, it's going to look very
108:01 similar to a lot of those values that we set in N8N. Like we have our base URL
108:05 here, which you'd want to set to something, you know, like HTTP lama port
108:09 11434. We just need to add this /view one, which I guess is a little bit different,
108:14 but yeah, I've got instructions here for setting up all of our environment
108:18 variables, our API key, which you can actually use OpenAI or Open Router as
108:22 well with this agent, taking advantage of the OpenAI API compatibility. This is
108:27 a live example of this because you can change the base URL, API key, and the
108:31 LLM choice to something from Open Router or OpenAI, and then you're good to go
108:35 immediately. It's really, really easy. We will be using Olama in this case, of
108:39 course, though. And then you want to set your superb basease URL and service key.
108:42 You can get that from your environment variables. Same thing with CRXNG with
108:47 that base URL. We'll set that just like we did in N8N. We have our bearer token
108:51 like in our case was test off. It's just whatever comes after the bearer and the
108:55 space. And then the OpenAI API key you can ignore. That's just for the
108:58 compatible demo that we saw earlier. This is everything that we need for our
109:02 main agent now. And so we're using a Python library called fast API to turn
109:09 our AI agent into an API endpoint just like we did in N8N. And so fast API is
109:13 kind of what gives us this web hook both with the entry point and the exit for
109:17 our agent and then everything in between is going to be the logic where we are
109:21 using our agent. And I'm going to be using paidantic AI. It's my favorite AI
109:25 agent framework with Python right now. Makes it really easy to set up agents
109:29 and we'll so we'll dive into that here. And I don't want to get into the
109:32 nitty-gritty of the Python code here because this isn't a master class on
109:36 specifically building agents. I really just want to show you how we can be
109:40 connecting to our local AI services. This agent is 100% offline. Like I could
109:46 cut the internet to my machine and still use everything here. So we create our
109:50 Superbase client and the instance of our fast API endpoint. I have some models
109:55 here that define the the requests coming in, the response going out. So we have
109:59 the chat input and the session ID just like we saw in N8N. And then the output
110:04 is going to be this output field. And so that corresponds to exactly what we're
110:07 expecting with those settings that we set up in the function in open web UI.
110:11 So this Python agent is also going to work directly with open web UI. And then
110:16 we have some dependencies for our Pantic AI agent because it needs to have an
110:21 HTTP client and the CRXNG base URL to make those requests for the web search
110:25 tool. And then we're setting up our model here. It's an OpenAI model, but we
110:29 can override the base URL and API key to communicate with Olama or Open Router as
110:34 well like we will be doing. And then we create our Pantic AI agent just getting
110:38 that model based on our environment variables. I've got a very simple system
110:43 prompt and then the dependencies here because we need that HTTP client to talk
110:48 to CR XNG. And then I'm just allowing it to retry twice. So if there's any kind
110:51 of error that comes up, the agent can re retry automatically, which is one of the
110:55 really awesome things that we have in Pyantic AI. And then I'm also creating a
111:01 second agent here. This is the agent that is going to be responsible like we
111:05 have in NADM for handling the metadata for open web UI like conversation titles
111:10 and tags for our conversation. And so it's an entirely separate Asian because
111:14 we just have a another system prompt. In this case, I'm just doing something
111:18 really simple here. Uh we don't have any dependencies for this agent because it's
111:21 not going to be using the web search tool. And then for the model, I'm just
111:25 using the exact same model that we have for our primary agent. But like I shared
111:30 with N8N, you could make it so that this is like a much smaller model, like a one
111:33 or three billion parameter model because the task is just so basic or maybe like
111:38 a 7 billion parameter model. So you can tweak that if you want. Just for
111:41 simplicity sake, I'm using the same LLM for both of these agents.
111:46 And then we get to our web search tool. So in Pantic AI, the way that you give a
111:51 tool to your agent is you do at@ and then the name of your agent and then
111:55 tool and then the function that you define below this is now going to be
112:00 given as a tool to the agent. And then this description that we have in the doc
112:04 string here that is given as a part of the prompt to your agent. So it knows
112:09 when and how to use this tool. And so the exact, you know, details of how
112:13 we're using CRX and G here, I won't dive into, but it is the exact same as what
112:17 we did in N8N where we make that request to the search endpoint of CRX andG. We
112:22 go through the page results here. We limit to just the top three results or I
112:26 could even change this to make it even simpler and just the top result. So we
112:30 have the smallest prompt possible to the LLM. And then we get the the content of
112:34 that page specifically. And then we return that to our AI agent with some
112:40 JSON here. So now once it invokes this tool, it has a full page back with
112:44 information to help answer the question from the user. It has that web search
112:47 complete now. And then we have some security here to make sure that the
112:51 bearer token matches what we get into our API endpoint. So that's that header
112:55 authentication that we set up in N8N. So this part right here where we're
112:59 verifying the header authentication that corresponds to this verify token
113:03 function. And then we have a function to fetch conversation history to store a
113:08 new message in conversation history. So both of these are just making requests
113:12 to our locally hosted superbase using that superbase client that we created
113:16 above. And then we have the definition for our actual API endpoint. And so in
113:23 N8N we were using invoke N8N agent for our path to our agent. So this was our
113:29 production URL. In this fast API endpoint, our endpoint is slashinvoke
113:34 python agent. And then we're specifically expecting the chat input
113:39 and session ID. So that is our um chat request request type right here. And
113:43 then sorry I highlighted the wrong thing. We have our response model here
113:46 that has the output field. So we're defining the exact types for the inputs
113:50 and the outputs for this API endpoint. And then we're also using this verify
113:55 token to protect our endpoint at the start. And then the key thing here, if
113:59 the chat input starts with that task, then we're going to call our metadata
114:02 agent. And so it's just going to spit out the title or the tags, whatever that
114:06 might be. Otherwise, we're going to fetch the conversation history, format
114:11 that for Pyantic AI, store the user's message so that we have that
114:15 conversation history stored, create our dependencies so that we can communicate
114:20 with CRXNG, and then we'll just do agent.run. We'll pass in the latest
114:24 message from the user, the past conversation history and the
114:27 dependencies that we created. So it can use those when it invokes the web search
114:31 tool and then we just get the response back from the agent and we'll you can
114:34 print that out in the terminal as well and then we'll just store it in
114:38 superbase and then return the output field. So I'm going kind of fast here. I
114:42 there definitely a lot more videos on my channel where I break down in more
114:45 detail building out agents with podantic AI and turning them into API endpoints
114:49 and things like that. Um but yeah, just going a little bit faster here. here.
114:52 And then the last thing is with any kind of exception that we encounter, we're
114:55 just going to return a response to the front end saying that there was an issue
114:59 and then specifying what that is. And then we are using Ubicorn to host our
115:06 API endpoint specifically on port 8055. So that is everything for our Python
115:11 agent exactly the same as what we set up in N8N. And now going to the readme
115:16 here, I'll open up the preview. The way that we can run this agent, we just have
115:20 to open up a terminal just like we did with the OpenAI compatible demo. I've
115:25 got instructions here for um setting up the database table, which this is using
115:30 the same table as the one in N8N and N8N creates it automatically. So, if you
115:34 have already been using the N8N agent, you don't actually have to run this SQL
115:38 here. Um, and then you want to obviously set up your environment variables like
115:42 we covered. Uh, open your virtual environment and install the requirements
115:46 there. And then you can go ahead and run the command python main.py.
115:52 And so this will start the API endpoint. So it'll just be hanging here because
115:56 now it's waiting for requests to come in on a port 8055. And so what I can do is
116:03 I can go back to open web UI. I can go to the admin panel functions. Go to the
116:08 settings. I can now change this URL. So everything else is the same. I have my
116:11 bearer token, the input field, and the output field the same as N8N. The only
116:16 thing I have to change now is my URL. And so I know this is an N8N pipe and I
116:20 have N8N in the name everywhere, but this does work with just any API
116:23 endpoint that we have created with this format here. And so I'm going to say for
116:27 my URL, it's actually going to be host.doccker.in because I have my API endpoint for
116:33 Python running outside on my host machine. So I need my open web UI
116:38 container to go outside to my host machine. And then specifically the port
116:43 is going to be 8055. And then the endpoint here, I'm going to
116:46 delete this web hook here because it's invoke-python- agent. Take a look at that. All right.
116:52 Boom. So I'm going to go ahead and save this. And then I can go over to my chat.
116:57 And it says n agent connector here still. But this is actually talking to
117:00 my Python agent now. So I'll go ahead and start by asking it the exact same
117:05 question that I asked the N8N agent. And I do have this pipe set up to always say
117:08 that it's calling nan, but this is indeed calling our Python API endpoint.
117:12 And we can see that now. So there we go. We got all the requests coming in, the
117:16 response from the agent, and then also the metadata for the title and the tags
117:20 for the conversation. Take a look at that. So we got our title here. We have
117:24 our tags, and then we have our answer. It's a starting price of $2,000, which
117:28 it's a lot more right now. The starting point, the starting price is kind of
117:32 misleading, but like yeah, this is a good answer. and it did use CRXNG to do
117:36 that web search for us. This is really, really neat. Now, the last thing that we
1:57:37 Containerizing our Local Python Agent
24:29 Specific Local LLM Recommendations
24:30 for what you can run. So, to go along with that information overload, I want
24:34 to give you some specifics, individual LLMs that you can try right now based on
24:38 the size range that you know will work for your hardware. So, just a couple of
24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This
24:47 is the most popular local LLM ever. It completely blew up a few months ago. And
24:52 the best part about DeepSeek R1 is they have an option that fits into each of
24:56 the size ranges that I just covered in that chart. So they have a 7 billion
25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And
25:05 then there is also the full real version of R1, which is 671 billion parameters.
25:10 I'm sorry though, you probably don't have the hardware to run that unless
25:13 you're spending tens of thousands on your infrastructure. So, probably stick
25:16 with one of these based on your graphics card or if you have a Mac computer, pick
25:19 the one that'll work for you and just try it out. You can click on any one of
25:23 these sizes here. And then here's your command to download and run it. And this
25:28 is defaulting to a Q4 quantization, which is what I was assuming in the
25:31 chart earlier. And again, I will cover what that actually means in a little bit
25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.
25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do
25:45 have all the other um sizes that fit into those ranges that I mentioned
25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And
25:52 the same kind of deal where you click on the size that you want and you've got
25:55 your command to install it here. And this is a reasoning LLM just like
26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.
26:05 I've had really good results with this as well. There are less options here,
26:08 but you've got 22 or 24 billion parameters, which is going to work well
26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified
26:18 memory. Really, really good model. And then also, there is a version of it that
26:22 is fine-tuned for coding specifically called Devstrol, which is a another
26:26 really cool LLM worth checking out as well if you have the hardware to run it.
26:30 So, that is everything for just general recommendations for local LMS to try
26:34 right now. This is the part of the master class that is going to become
26:38 outdated the fastest because there are new local LMS coming out every single
26:42 month. I don't really know how long my recommendations will last for. But in
26:45 general, you can just go to the model list in Olama, search for the ones,
26:49 finds one that has the size that works with your graphics card and just give it
26:52 a shot. You can install it and run it very easily with Olama. And the other
26:57 thing that I want to mention here is you don't always have to run open- source
27:01 large language models yourself. You can use a platform like Open Router. You can
27:05 just go to open router.ai, sign up, add in some API credits. You can try these
27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for
27:15 your agents before you invest in hardware to actually run them yourself.
27:18 And so within Open Router, I can just search for Quen here. And I can go down
27:23 to Quen and I can go to 32 billion. They have a free offering as well that
27:26 doesn't have the best rate limits. So I'll just go to this one right here,
27:31 Quen 3 32B. So I can try the model out through open router. They actually host
27:35 it for me. So it's an open- source non-local version, but now I can try it
27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now
27:43 I want to buy a 3090 graphics card so that I can install it directly through
27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here
27:51 in open router. And there are other platforms like Grock as well where you
27:55 can run these open source large language models um not on your own infrastructure
27:58 if you just want to do some testing before beforehand or whatever that might
28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's
28:04 everything for my general recommendations for LLMs to try and use
28:07 Quantization (Run Bigger LLMs)
28:09 in your agents. All right, it is time to take a quick breather. This is
28:12 everything that we've covered already in our master class. What is local AI? Why
28:17 we care about it? Why it's the future and hardware requirements. And I really
28:20 wanted to dive deep into this stuff because it sets the stage for everything
28:24 that we do when we actually build agents and deploy our infrastructure. And so
28:28 the last thing that I want to do with you before we really start to get into
28:33 building agents and setting up our package is I want to talk about some of
28:37 the tricky stuff that is usually pretty daunting for anyone getting into local
28:42 AI. I'm talking things like offloading models, quantization, environment
28:47 variables to handle things like uh flash attention, all the stuff that is really
28:51 important that I want to break down simply for you so you can feel confident
28:55 that you have everything set up right, that you know what goes into using local
29:00 LLMs. The first big concept to focus on here is quantization. And this is
29:04 crucial. It's how we can make large language models a lot smaller so they
29:10 can fit on our GPUs without hurting performance too much. We are lowering
29:14 the model precision here. And so what basically what that means is we have
29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits
29:23 with the full size, but we can lower the precision of each of those parameters to
29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of
29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the
29:35 parameters that we already covered. And we can make these numbers less precise
29:40 or smaller without losing much performance. So, we can fit larger LLMs
29:45 within a GPU that normally wouldn't even be close to running the full-size model.
29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4
29:55 quantization like four bit per parameter in that diagram earlier. If you had the
30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could
30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it
30:10 possible. It's like rounding a number that has a long decimal to something
30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're
30:19 doing it for each of the billions of parameters, those numbers that we have.
30:23 And so just to give you a visual representation of this, you can also
30:27 quantize images just like you can quantize LLMs. And so we have our full
30:31 scale image on the lefth hand side here comparing it to different levels of
30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with
30:40 a 16- bit quantization, it almost looks the same. But then once we go down to
30:44 4bit, you can very much see that we have a huge loss in quality for the image.
30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit
30:54 quantization, we don't actually lose that much performance like we lose a lot
30:58 of quality with images. And so that's why it's so useful for us. And so I have
31:01 a table just to kind of describe what this looks like. So FP16, that's the
31:07 16bit precision that all LMS have as a base. That is the full size. The speed
31:11 is obviously going to be very slow because the model is a lot bigger, but
31:16 your quality is perfect compared to what it could be. I mean, obviously that
31:18 doesn't mean that you're going to get perfect answers all the time. I'm just
31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8
31:28 precision, so it's half the size. The speed is going to be a lot better. And
31:33 the quality is nearperfect. So it's not like performance is cut in half just
31:37 because size is. You still have the same number of parameters. Each one is just a
31:42 bit less precise. And so you're still going to get almost the same results.
31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very
31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these
31:57 numbers are very vague on purpose. There's not a huge way to for me to like
32:01 qualify exactly the difference, especially because it changes per LLM
32:05 and your hardware and everything like that. So, I'm just being very general
32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to
32:13 be very very fast, but usually your performance starts to go down quite a
32:17 bit once you go down to a Q2. And then like the note that I have in the bottom
32:22 left here, a Q4 quantization is generally the best balance. And so when
32:26 you are thinking to yourself, which large language model should I run? What
32:31 size should I use? My rule of thumb is to pick the largest large language model
32:37 that can work with your hardware with a Q4 quantization. That is why I assumed
32:42 that in the table earlier. And then also like we saw in Olama earlier, it always
32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4
32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion
32:59 parameter model is still going to be a lot more powerful than the full 7
33:02 billion parameter or 14 billion parameter because you don't actually
33:07 lose that much performance. So that is quantization. So just to make this very
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 just fun fact, you can actually do that. Now, the very last thing that I want to
37:15 cover before we dive into some code, setting up the local AI package, and
37:21 building out some agents is a few very crucial parameters, environment
37:25 variables for Olama. So, these are environment variables that you can set
37:28 on your machine just like any other based on your operating system. And
37:32 Olama does have an FAQ for setting up some of these things, which I'll link to
37:36 in the description as well. But yeah, these are a bit more technical, so
37:40 people skip past setting this stuff up a lot, but it's actually really, really
37:44 important to make things very efficient when running local LLMs. So the first
37:49 environment variable is flash attention. You want to set this to one or true.
37:54 When you have this set to true, it's going to make the attention calculation
37:59 a lot more efficient. It sounds fancy, but basically large language models when
38:04 they are generating a response, they have to calculate which parts of your
38:08 prompt to pay the most attention to. That's the calculation. And you can make
38:12 it a lot more efficient without losing much performance at all by setting up
38:16 the flash attention, setting that to true. And then for another optimization,
38:21 just like we can quantize the LLM itself, you can also quantize or
38:27 compress the context. So your system prompt, the tool descriptions, your
38:31 prompt and conversation history, all that context that's being sent to your
38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for
38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the
38:46 context memory. It's a very simplified explanation, but it's really, really
38:50 useful because a long conversation can also take a lot of VRAM just like larger
38:55 LLM. And so it's good to compress that. And then the third environment variable,
38:58 this is actually probably the most crucial one to set up for Olama. There
39:02 is this crazy thing. I don't know why Olama does it, but by default, they
39:07 limit every single large language model to 2,000 tokens for the context limit,
39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and
39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a
39:21 lot of local large language models can also handle large prompts. But Olamo
39:25 will limit you to default to 2,000 tokens. And so you have to override that
39:30 yourself with this environment variable. And so generally I recommend starting
39:34 with about 8,000 tokens to start. You can move this all the way up to
39:38 something like 32,000 tokens if your local large language model supports
39:42 that. And if you view the model page on Alama, you can see the context link
39:46 that's supported by the LLM. But you definitely want to, you know, jack this
39:50 up more from just 2,000 because a lot of times when you have longer
39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do
39:57 not miss this. If your large language model is starting to go completely off
40:02 the rails and ignore your system prompt and forget that it has these tools that
40:06 you gave it, it's probably because you reached the context length. And so, just
40:10 keep that in mind. I see people miss this a lot. And then the very last
40:14 environment variable, uh, probably the least important out of all these four,
40:18 but if you're running a lot of different large language models at once and you're
40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so
40:25 in Olama, you can limit the number of models that are allowed to be in your
40:29 memory at a single time. With this one, typically you want to set this to either
40:33 one or two. Definitely set this to just one if you are using large language
40:37 models that are basically fit for your GPU. like it's going to fit exactly into
40:41 your VRAM and you're not going to have room for another large language model.
40:44 But if you are running more smaller ones and maybe you could actually fit two on
40:48 your GPU with the VRAM that you have, you can set this to two. So again, more
40:52 technical overall, but it's very important to have these right. And we'll
40:55 get into the local AI package where I already have these set up in the
40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a
41:02 minute ago that I'll have linked in the description. And so there's actually a
41:06 lot of good things to read into here. um like being able to verify that your GPU
41:10 is compatible with Olama. How can you tell if the model's actually loaded on
41:13 your GPU? So, a lot of like sanity check things that they walk you through in the
41:17 FAQ as well. Also talking about environment variables, which I just
41:20 covered. And so, they've got some instructions here depending on your OS
41:23 how to get those set up. So, if there's anything that's confusing to you, this
41:26 is a very good resource to start with. So, I'm trying to make it possible for
41:30 you to look into things further if there's anything that doesn't quite make
41:33 sense for what I explained here. And of course, always let me know in the
41:35 comments if you have any questions on this stuff as well, especially the more
41:39 technical stuff that I just got to cover because it's so important even though I
41:43 know we really want to dive into the meat of things, which we are actually
41:47 going to do now. All right, here is everything that we have covered at this
41:50 point. And congratulations if you have made it this far because I covered all
41:55 the tricky stuff with quantization and the hardware requirements and offloading
41:59 and some of our little configuration and parameters. So, if you got all of that,
42:03 the rest of it is going to be a walk in the park as we start to dive into code,
42:07 getting all of our local AI set up and building out some agents. You understand
42:10 the foundation now that we're going to build on top of to make some cool stuff.
42:15 And so, now the next thing that we're going to do is talk about how we can use
42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show
42:23 you an example. We can take something that is using OpenAI right now,
42:27 transform it into something that is using OAMA and local LLM. So, we'll
42:31 actually dive into some code here. And I've got my fair share of no code stuff
42:35 in this master class as well, but I want to focus on both because I think it's
42:38 really important to use both code and no code whenever applicable. And that
42:42 applies to local AI just like building agents in general. So, I've already
42:45 promised a couple of times that I would dive into OpenAI API compatibility, what
42:50 it is, and why it's so important. And we're going to dive into this now so you
42:54 can really start to see how you can take existing agents and transform them into
42:59 being 100% local with local large language models without really having to
43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI
43:10 has created a standard for exposing large language models through an API.
43:14 It's called the chat completions API. It's kind of like how model context
43:19 protocol MCP is a standard for connecting agents to tools. The chat
43:23 completions API is a standard for exposing large language models over an
43:28 API. So you have this common endpoint along with a few other ones that all of
43:35 these providers implement. This is the way to access the large language model
43:39 to get a response based on some conversation history that you pass in.
43:43 So, Olama is implementing this as of February. We have other providers like
43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.
43:53 Almost every single provider is OpenAI API compatible. And so, not only is it
43:57 very easy to swap between large language models within a specific provider, it's
44:02 also very easy to swap between providers entirely. You can go from Gemini to
44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one
44:13 piece of configuration pointing to a different base URL as it is called. So
44:18 you can access that provider and then the actual API endpoint that you hit
44:22 once you are connected to that specific provider is always the exact same and
44:26 the response that you get back is also always the exact same. And so Olama has
44:31 this implemented now. And I'll link to this article in the description as well
44:33 if you want to read through this because they have a really neat Python example.
44:37 It shows where we create an OpenAI client and the only thing we have to do
44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are
44:47 pointing to Olama that is hosted locally instead of pointing to the URL for
44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And
44:55 then with Olama, you don't actually need an API key because everything's running
44:58 locally. So you just need some placeholder value here. But there is no
45:02 authentication that is going on. You can set that up. I'm not going to dive into
45:05 that right now. But by default, because it's all just running locally, you don't
45:09 even need an API key to connect to Olama. And then once we have our OpenAI
45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in
45:18 exactly the same way. But now we can specify a model that we have downloaded
45:22 locally already through Lama. We pass in our conversation history in the same way
45:27 and we access the response like the content the AI produced the token usage
45:31 like all those things that we get back from the response in the same way.
45:34 They've got a JavaScript example as well. They have a couple of examples
45:38 using different frameworks like the Versell AI SDK and Autogen. Really any
45:44 AI agent framework can work with OpenAI API compatibility to make it very easy
45:47 to swap between these different providers. like Pyantic AI, my favorite
45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily
45:57 within your Pantic AI agents swap between these different providers. And
46:02 so what I have for you now is two code bases that I want to cover. The first
46:07 one is the local AI package, which we'll dive into in a little bit. But right
46:13 now, we have all of the agents that we are going to be creating in this master
46:17 class. So I have a couple for N8N that are also available in this repository.
46:21 And then a couple of scripts that I want to share with you as well. And so the
46:25 very first thing that I want to show you is this simple script that I have called
46:31 OpenAI compatible demo. And so you can download this repository. I'll have this
46:34 linked in the description as well. There's instructions for downloading and
46:38 setting up everything in here. And this is all 100% local AI. And so with that,
46:43 I'm going to go over into my windsurf here where I have this OpenAI compatible
46:47 demo set up. So I've got a comment at the top reminding us what the OpenAI API
46:52 compatibility looks like. We set our base URL to point to Olama hosted
46:59 locally and it's hosted on port 11434 by default. So I can actually show you
47:02 this. I have Ola running in a Docker container, which we're going to dive
47:05 into this when we set up the local AI package, but you can see that it is
47:11 being exposed on port 11434. And by the way, you can see the
47:14 127.0.0.1 in that URL that I have highlighted here, that is synonymous with localhost.
47:21 And so this right here, you could also replace with 127.0.0.1.
47:25 Just a little tidbit there. It's not super important. I just typically leave
47:28 it as localhost. And then you can change the port as well. I'm just sticking to
47:32 what the default is. And then again, we don't need to set our API key. We can
47:36 just set it to any value that we want here. We just need some placeholder even
47:39 though there is no authentication with a llama for real unless you configure
47:43 that. So that's OpenAI compatibility. And the important thing with this script
47:47 here is I have two different configurations here. I have one for
47:51 talking to OpenAI and then one for OALMA. So with OpenAI, we set our base
47:57 URL to point to api.openai.com. We have our OpenAI API key set in our
48:01 environment variables. So you can just set all your environment variables here
48:05 and then rename this to env. I've got instructions for that in the readme of
48:08 course. And then going back to the script, we are using GPT4.1 nano for our
48:13 large language model. There's something super fast and cheap. And then for our
48:17 Lama configuration, we are setting the base URL here, localhost1434
48:22 or just whatever we have set in our environment variables. Same thing for
48:26 the API key. And then same thing for our large language model. And what I'm going
48:31 to be using in this case is Quen 314B. That is one of the large language models
48:34 that I showed you within the Olama website. Definitely a smaller one
48:38 compared to what I could run, but I just want to run something fast. And very
48:41 small large language models are great for simple tasks like summarization or
48:45 just basic chat. And that's what I'm going to be using here just for a simple
48:49 demo. And so whether it's enabled or not, this configuration is just based on
48:53 what we have set for our environment variables. And the important thing here
48:59 is the code that runs for each of these configurations just as we go through
49:03 this demo is exactly the same. We are parameterizing the configuration for the
49:08 base URL and API key. So we are setting up the exact same OpenAI client just
49:13 like we saw in the Olama article but just changing the base URL and API key.
49:17 And so then for example when we use it right here it's client.hat.comp
49:22 completions.create create calling the exact same function no matter if we're
49:26 using OpenAI or Olama. And then we're handling the response in the same way as
49:31 well. And so I'll go back to my terminal now. And so I went through all the steps
49:34 already to set up my virtual environment, install all of my dependencies. And so now I can run the
49:40 command OpenAI compatible demo. And now it's going to present the two
49:43 configuration options for me. And so I can run through OpenAI. So we'll go
49:46 ahead and do that first. And these two demos are going to look exactly the
49:50 same, but that is the point. And so we have our base URL here for OpenAI. We
49:54 have a basic example of a completion with GPT4.1 Nano. There we go. So this
49:59 is the model that was used. Here are the number of tokens. And this is our
50:03 response. And then I can press enter to see a streaming response now as well. So
50:07 we saw it type out our answer in real time. And then I can press enter one
50:10 more time. This is the last part of the demo. Just say multi-turn conversation.
50:14 So we got a couple of messages here in our conversation history. So very nice
50:19 and simple. The point here is to now show you that I can run this and select
50:23 Olama now instead and everything is going to look exactly the same and all
50:27 of the code is the same as wallet. It is only our configuration that is
50:31 different. And so it will take a little bit when you first run this because
50:36 Olama has to load the large language model into your GPU. And so going to the
50:42 logs for Olama, I can show you what this looks like here. And so when we first
50:47 make a request when Quen 314B is not loaded into our GPU yet, you're going to
50:52 see a lot of logs come in here and we'll and you'll have this container up and
50:54 running when you have the local AI package which we'll cover in a little
50:57 bit. So it shows all the metadata about our model like it's Quen 314b. Uh we can
51:05 see here that uh we have a Q4 KM quantization like we saw in the Olama
51:09 website. Uh what other information do we have here? There's just so much to to
51:14 digest here. Um, yeah, another really important thing is we have the uh
51:19 context link. I have that set to 8,192 just like I recommended in the
51:22 environment variables. And then we can see that we offloaded all of the layers
51:26 to the GPU. So I don't have to do any offloading to the CPU or the RAM. I can
51:30 keep everything in the GPU, which is certainly ideal, like I said, to make
51:34 sure this is actually fast. And then when we get a response from quen 314b,
51:41 we are calling the v1/hatcompletions endpoint because it is openi API
51:46 compatible. So that exact endpoint that we hit for openai is the one that we are
51:50 hitting here with a large language model that is running entirely on our computer
51:54 in Olama. And so the response I get back, it's actually a reasoning LLM as
51:58 well. So we even have the thinking tokens here, which is super cool. And so
52:02 we got our response. It's just printing out the first part of it here just to
52:04 keep it short. And then I can press enter. And we can see a streaming demo
52:08 as well. And it's going to be a lot faster this time because we do already
52:11 have the model loaded into our GPU. And so that first request when it first has
52:15 to load a model is always the slower one. And then it's faster going forward
52:19 once that model is already loaded in our GPU. And then as long as we don't swap
52:24 to another large language model and use that one, then it will remain in our GPU
52:28 for some time. And so then all of our responses after are faster. And then we
52:33 just have the last part of our demo here with a multi-turn conversation. So we
52:37 can see conversation history in action as well, just not with streaming here.
52:40 Um, and and everything's a bit slower with this large language model because
52:43 it is a reasoning one. And so you can certainly if you want faster uh
52:47 inference, you can always use a non-reasoning local LLM like Mistl or
52:52 Gemma for example. So that is our very simple demo showing how this works. I
52:55 hope that you can see with this and again this works with other AI agent
52:59 frameworks like eggno or pideantic AI or crew AI as well like they all work in
53:03 this way where you can use openAI API compatibility to swap between providers
53:08 so easily so you don't have to recreate things to use local AI and that's
53:11 something so important that I want to communicate with you because if I'm the
53:15 one introducing you to local AI I also want to show you how it can very easily
53:19 fit into your existing systems and automations. All right. Now, we have
53:23 gotten to the part of the local AI master class that I'm actually the most
53:27 excited for because over the past months, I have very much been pouring my
53:31 heart and soul into building up something to make it infinitely easier
53:35 for you to get everything up and running for local AI. And that is the local AI
53:40 package. And so, right now, we're going to walk through installing it step by
53:44 step. I don't want you to miss anything here because it's so important to get
53:47 this up and running, get it all working well. Because if you have the local AI
53:51 package running on your machine and everything is working, you don't need
53:55 anything else to start building AI agents running 100% offline and
53:59 completely private. And so here's the thing. At this point, we've been
54:04 focusing mostly on Olama and running our local large language models. But there's
54:08 the whole other component to local AI that I introduced at the start of the
54:13 master class for our infrastructure. things like our database and local and
54:18 private web search, our user interface, agent monitoring. We have all these
54:23 other open-source platforms that we also want to run along with our large
54:27 language models and the local AI package is the solution to bring all of that
54:32 together curated for you to install in just a few steps. So, here is the GitHub
54:37 repository for the local AI package. I'll have this linked in the description
54:41 below. Just to be very clear, there are two GitHub repos for this master class.
54:45 We have this one that we covered earlier. This has our N8N and Python
54:49 agents that we'll cover in a bit, as well as the OpenAI compatible demo that
54:53 we saw earlier. So, you want to have this cloned and the local AI package as
54:57 well. Very easy to get both up and running. And if you scroll down in the
55:02 local AI package, I have very comprehensive instructions for setting
55:06 up everything, including how to deploy it to a private server in the cloud,
55:10 which we'll get into at the end of this master class, and a troubleshooting
55:13 section at the bottom. So, everything that I'm about to walk you through here,
55:17 there's instructions in the readme as well if you just want to circle back to
55:21 clarify anything. Also, I dive into all of the platforms that are included in
55:26 the local AI package. And this is very important because like I said, when you
55:30 want to build a 100% offline and private AI agent, it's a lot more than just the
55:35 large language model. You have all of the accompanying infrastructure like
55:39 your database and your UI. And so I have all that included. First of all, I have
55:44 N8N that is our low/noodeode workflow automation platform. We'll be building
55:48 an agent with N8N in the local AI package in a little bit once we have it
55:52 set up. We have Superbase for our open- source database. We have Olama. Of
55:56 course, we want to have this in the package as well for our LLMs. Open Web
56:01 UI, which gives us a chat GPT like interface for us to talk to our LLMs and
56:06 have things like conversation history. Very, very nice. So, we're looking at
56:09 this right here. This is included in the package. Then we have Flowwise. It's
56:13 similar to N8N. It's another really good tool to build AI agents with no slash
56:18 low code. Quadrant, which is an open- source vector database. Neo4j which is a
56:25 knowledge graph engine and then seir xng for open-source completely free and
56:31 private web search caddy which this is going to be very important for us once
56:35 we deploy the local AI package to the cloud and we actually want to have
56:38 domains for our different services like nn and open web UI and then the last
56:42 thing is langfuse this is an open- source LLM engineering platform it helps
56:47 us with agent observability now some of these services are outside of the scope
56:52 for this local AI master class. I don't want to spend a half hour on every
56:56 single one of these services and make this a 10-hour video. I will be focusing
57:02 in this video on N8N, Superbase, Olama, Open WebUI, CRXNG, and then Caddy once
57:08 we deploy everything to the cloud. So, I do cover like half of these services.
57:12 And the other thing that I want to touch on here is that there are quite a few
57:16 things included here. And so you do need about 8 GB of RAM on your machine or
57:21 your cloud server to run everything. It is pretty big overall. And so you can
57:26 remove certain things like if you don't want Quadrant and Langfuse for example,
57:30 you can take those out of the package. More on that later. It doesn't have to
57:34 be super bloated, you can whittle this down to what you need. But yeah, there's
57:37 a lot of different things that go into building AI agents. And so I have all of
57:40 these services here so that no matter what you need, I've got you covered. And
57:44 so with that, we can now move on to installing the local AI package. And
57:48 these instructions will work for you on any operating system, any computer. Even
57:52 if you don't have a really good GPU to run local large language models, you
57:56 still could always use OpenAI or Anthropic, something like that, and then
58:00 run everything else locally to save on costs or just to have everything running
58:04 on your computer. And so there are a couple of prerequisites that you have to
58:08 have before you can do the instructions below. You need Python so you can run
58:12 the start script that boots everything up. Git or GitHub desktop so you can
58:16 clone this GitHub repository, bring it all onto your own machine. And then you
58:21 want Docker or Docker Desktop. And so I've got links for all of these. Docker
58:25 and Docker Desktop we need because all of these local AI services that I've
58:29 curated for you, they all run as individual Docker containers that are
58:34 all combined together in a stack. And so I'll actually show you this is the end
58:36 result once we have everything up and running within your docker desktop. You
58:41 have this local AI docker compos stack that has all of the services running in
58:45 tandem like superbase and reddus and nitn and flowwise caddy neo4j. All of
58:50 these are running within this stack. That is what we're working towards right
58:54 now. And so make sure you have all these things installed. I've got links that'll
58:57 take you to installing no matter your operating system. Very easy to get all
59:01 of this up and running on your machine. Then we can move on to our first command
59:05 here, which is to clone this GitHub repository, bringing all of this code on
59:10 your machine so you can get everything running. And so you want to open up a
59:14 new terminal. So I've got a new PowerShell session open here. Going to
59:18 paste in this command. And I'm going to be doing this completely from scratch
59:22 with you. So you clone the repo and then I'm just going to change my directory
59:26 into local AI package, which was just created from this get clone command. So
59:31 those are the first two steps. The next thing is we have to configure all of our
59:36 environment variables. And believe it or not, this is actually the longest part
59:41 of the process. And once we have this taken care of, it's a breeze getting the
59:44 rest of this up and running. But there's a lot of configuration that we have to
59:49 set up for our different services like credentials for logging into our
59:54 Superbase dashboard or Neo4j. Uh things like our Superbase um anonymous key and
59:59 private key. All these things we have to configure. And so within our terminal
60:04 here, you can do code dot to open this within VS code or windsurf. Open this in
60:09 windsurf. You just want to open up this folder within your IDE and the specific
60:14 IDE that you use. Really doesn't matter. You just want to get to this.env.example
60:20 here. I'm going to copy it and then I'm going to paste it. And then I'm going to
60:24 rename this toenv. So we're taking the example. example, turning it into av file. So,
60:31 you want to make sure that you copy it and rename it like this. Then we can go
60:36 ahead and start setting all of our configuration. And I'll even zoom in on
60:39 this just so that it's very easy for you to see everything that we are setting up
60:44 here. So, first up, we have a couple of credentials for N8N. We have our
60:49 encryption key and our JWT secret. And it's very easy to generate these. In
60:53 fact, we'll be doing this a couple of times, but we'll use this open SSL
60:58 command to generate a random 32 character alpha numeric string that
61:02 we're going to use for things like our encryption key and JWT secret. And so,
61:08 OpenSSL is a command that is available for you by default on Linux and Macs.
61:12 You can just open up any terminal and run this command and it'll spit out a
61:16 long string that you can then just paste in for this value. For Windows, you
61:20 can't just open up any terminal and use OpenSSL, but you can use Git Bash, which
61:26 is going to come with GitHub Desktop when you install it. And so, I'll go
61:29 ahead and just search for that. If you just go to your search bar on your
61:32 bottom left on Windows and search for Git Bash, it's going to open up this
61:37 terminal like this. And so, I can go ahead and copy this command, go in here,
61:42 and paste it in. And then I can run it. And then, boom, there we go. This is I
61:45 know it's really small for you to see right now. I'm going to go ahead and
61:48 copy this because this is now the value that I can use for my encryption key.
61:52 And then you want to do the exact same thing to generate a JWT secret. And then
61:57 the other way that you can do this if you don't want to install git bash or
62:01 it's not working for whatever reason, you can use Python to generate this as
62:05 well. So I can just copy this command and then I can go into the terminal here
62:10 and I can just paste this in. And so it's going to just like with OpenSSL
62:15 generate this random 32 character string that I can copy and then use for my JWT
62:20 secret. There we go. And so I am going to get in the weeds a little bit here
62:24 with each of these different parameters, but I really want to make sure that I'm
62:27 clear on how to set up everything for you so you can really walk through this
62:31 step by step with me. And like I said, setting up the environment variables is
62:35 the longest part by far for getting the local AI package set up. So if you bear
62:39 with me on this, you get through this configuration, you will have everything
62:43 running that you need for local AI for the LLMs and your infrastructure. So
62:47 that's everything for N8N. Now we have some secrets for Superbase. And there
62:53 are some instructions in the Superbase documentation for how to get some of
62:57 these values. So it's this link right here, which I have open up on my
63:01 browser. So we'll we'll reference this in a little bit here. But first, we can
63:05 set up a couple of other things. The first thing we need to define is our
63:11 Postgress password. So, Supphabase uses Postgress under the hood for the
63:14 database. And so, we want to set a password here that we'll use to connect
63:19 to Postgress within N8N or a connection string that we have for our Python code,
63:23 whatever that might be. And this value can be really anything that you want.
63:26 Just note that you have to be very careful at using special characters like
63:31 percent symbols. So if you ever have any issues with Postgress, it's probably
63:36 because you have special characters that are throwing it off. U that's something
63:39 that I've seen happen quite a few times. And so like I said, I want to mention
63:42 troubleshooting steps and things to make sure that it is very clear for you. So
63:47 for this Postgress password here, I'm just going to say test Postgress pass.
63:51 I'm just going to give some kind of random value here. Just end with a
63:54 couple of numbers. I don't care that I'm exposing this information to you because
63:58 this is a local AI package. These passwords are for services that never
64:02 leave my computer. So, it's not like you could hack me by connecting to anything
64:08 here. And then we have a JWT secret. And this is where we get into this link
64:13 right here in the Superbase docs. And so they walk you through generating a JWT
64:18 secret and then using that to create both your anonymous and your service
64:22 role keys. If you're familiar with Superbase at all, we need both of these
64:27 pieces of information. The anonymous key is what we share to our front end. This
64:30 is our public key. And then the service role key has all permissions for
64:33 Superbase. We'll use this in our backends for things like our agents. And
64:39 so you can just go ahead and copy. You can go ahead and copy this JWT secret.
64:43 And then you can paste this in right here. This is 32 characters long just
64:47 like the things that we generated with OpenSSL. I'm just going to be using
64:52 exactly what Superbase tells me to. And then what you can do with this is you
64:56 can select the anonymous key. Click on generate JWT and then I can copy this
65:02 value and then I will paste this for my anonymous token. And so I'm just
65:06 replacing the default value there for the anonymous key. And then going back
65:10 and selecting the service key, I'm going to generate that one as well. So it
65:13 looks very similar. They'll always start with ey, but these values are different
65:18 if you go towards the end. And so I'll go ahead and paste this for my service
65:22 ro key. Boom. There we go. All right. And then for the Superbase dashboard
65:27 that we'll log into to see our tables and our SQL editor and authentication
65:31 and everything like that, we have our username here, which I'm just going to
65:35 keep as superbase. And then for the password, I can just say test superbase
65:39 pass. I'll just kind of use that as my common nomenclature here for my
65:42 passwords cuz I don't really care what that is right now. And then the last
65:45 thing that we have to set up is our pooler tenant ID. And it's not really
65:49 important to dive into what exactly this means. Just know that you can set this
65:52 to really anything that you want. Like I typically will just choose four digits
65:57 here like 1,00 for my pooler tenant ID. So that is everything that we need for
66:01 superbase. And actually most of the configuration is for superbase. Then we
66:06 have Neo4j. This is really simple. You can leave Neo4j for the username and
66:11 then I'll just say test Neo4j pass for my password here. So you just set the
66:15 password for knowledger graph and even if you're not using Neo4j you still have
66:19 to set this but yeah it just takes two seconds. Then we have langfuse. This is
66:23 for agent observability. We have a few secrets that we need here. And for these
66:28 values they can really just be whatever you want. It doesn't matter because
66:31 these are just passwords just like we had passwords for things like Neoforj.
66:35 So I can just say test click house pass. Um and then I can do test mo pass. And
66:43 um I mean it really doesn't matter here. Random Langfuse salt. I'm just doing
66:47 completely whack values here. You probably want something more secure in
66:51 this case, but um I'm just doing something as a placeholder for now. Um
66:56 yeah, that there we go. Okay, good. And then then the last thing that we need
66:59 for Langfuse is an encryption key. And this is also generated with OpenSSL like
67:04 we did for the N8N credentials. And so I'll go back to my git bash terminal.
67:08 And again, you can do this with Python as well. I'll just run the exact same
67:12 command. I'll get a different value this time. And so I'll go ahead and copy
67:16 that. You could technically use the same value over and over if you wanted to,
67:20 but obviously it's way more secure to use a different value for each of the
67:24 encryption keys that you generate with OpenSSL. So there we go. That is our
67:28 encryption key. And that is actually everything that we have to set up for
67:32 our environment variables when we are just running the local AI package on our
67:37 computer. Once we deploy it to the cloud and we actually want domains for our
67:41 different services like open web UI and N8N then we'll have to set up caddy. So
67:45 this is where we'll dive into domains and we'll get into this at the end of
67:49 the master class here. But everything past this point for environment
67:54 variables is completely optional. You can leave all of this exactly as it is
67:59 and everything will work. Most of this is just extra configuration for
68:03 superbase. So, Superbase is definitely the biggest service that's included in
68:08 this list of, you know, curated services for you. And so, there's a lot of
68:11 different configuration things you can play around with if you want to dive
68:15 more into this. You can definitely look at the same documentation page that we
68:19 were using for the Superbase Secrets. And so, you can scroll through this if
68:22 you want to learn more um like setting up email authentication or Google
68:27 authentication. um diving more into all of those different configuration things
68:32 for Superbase if you want to dive more into that. I'm not going to get into all
68:36 of this right now because the core of getting Superbase up and running we
68:40 already have taken care of with the credentials that we set up at the top um
68:44 right here. And so that these are these are just the base things and so that's
68:47 what we'll stick to right now. So that is everything for our environment
68:51 variables. So then going back to our readme now which I have open directly in
68:55 windsurf now instead of my browser we have finished our configuration and I do
68:59 have a note here that you want to set things up for caddy if you're deploying
69:03 to production. Obviously we're doing that later not right now like I said and
69:07 so with that we are good to start everything. Now before we spin up the
69:12 entire local AI package there is one thing that I want to cover. It's
69:14 important to cover this before we run things. If you don't want to run
69:19 everything in the package cuz it is a lot like maybe you only want to use half
69:22 of these services and you don't want Neo4j and Langfuse and Flowwise right
69:28 now. There are two options that you have. The easiest one right now is to go
69:33 into the docker compose file. This is the main file where all of the services
69:38 are curated together and you can just remove the services that you don't want
69:42 to include. So, for example, if you don't want Quadrant right now, cuz it is
69:46 actually one of the larger services. It's like 600 u megabytes of RAM just
69:50 having this running, you can search for Quadrant, and you can just go ahead and
69:54 delete this service from the stack like that. Boom. Now I don't have Quadrant.
69:59 It won't spin up as a part of the stack anymore. And then also I have a volume
70:03 for Quadrant. So, you can remove that as well. Volumes, by the way, is how we are
70:08 able to persist data for these containers. So if we tear down
70:12 everything and then we spin it back up, we still are going to have our open web
70:17 UI conversations and our N8N workflows, everything in Superbase, like all that
70:21 is still going to be saved because we're storing it all in volumes. So we can do
70:25 whatever the heck we want with these containers. We can tear them down. We
70:28 can update them, which I'll show you how to do later. We can spin it back up. And
70:32 all of our data will always be persisted. So you don't have to worry
70:35 about losing information. And you can always back things up if you want to be
70:39 really secure, but I've never done that before and I've been updating this
70:42 package for months and months and months and all of my workflows from 6 months
70:46 ago are still there. I haven't lost anything. And so that's just a quick
70:50 caveat there for how you can remove services if you want. And then another
70:54 thing that we don't have available yet, but I'm very excited to, you know, kind
70:58 of talk about this right now. It's in beta right now. We are creating me and
71:03 one other guy uh that's actually on my Dynamist team. Um Thomas, he's got a
71:07 YouTube channel as well. He's a great guy. We're working together on this.
71:09 He's actually been putting in most of the work creating a front-end
71:13 application for us to manage our local AI package. And one of the big things
71:17 with this is that we're going to make it possible for you to toggle on and off
71:22 the services that you want to have within your local AI package. So you can
71:27 very much customize the package to the services that you want to run. So you
71:31 can keep it lightweight just to the things you care about. Also, we'll be
71:34 able to manage environment variables and monitor the containers. Not all of this
71:38 is up and running at this point, but this is in beta. We're working on it.
71:41 I'm really excited for this. So, not available yet, but at once this is
71:44 available, this will be a really good way for you to customize the package to
71:47 your needs. So, you don't have to go and edit the docker compose file directly.
71:51 So, that's something that I just wanted to get out of the way now. But, we can
71:57 start and actually execute our package now. Get all these containers up and
72:02 running. So the command that you run to start the local AI package is different
72:07 depending on your operating system and the hardware that you have. So for
72:13 example, if you are an Nvidia GPU user, you want to run this start services.py
72:18 script. This boots up all of the containers and you want to specifically
72:23 pass in the profile of GPU NVIDIA. This is going to start Ola in a way where the
72:29 Olama container is able to leverage your GPU automatically. And then if you are
72:34 using an AMD GPU and you're on Linux, then you can run it this way. Which by
72:38 the way, unfortunately, if you have an AMD GPU on Windows, you aren't able to
72:46 run O Lama in a container. And it's the same thing with Mac computers.
72:49 Unfortunately, like you see right here, you cannot expose your GPU to the Docker
72:55 instance. And so if you are an AMD GPU on Windows or running on Mac, you cannot
73:01 run Olama in the local AI package. You just have to install it on your own
73:04 machine like I already showed you in this master class and then you'll just
73:08 run everything else through the local AI package and they can actually go out to
73:12 your machine and communicate to Olama directly. So just a small limitation for
73:17 Mac and AMD on Windows. But if you're running on Linux or an Nvidia GPU on
73:21 Windows like I'm using, then you can go ahead and run this command right here.
73:27 So if you can't run a GPU in the Olama container, then you can always just
73:32 start in CPU mode or you can run with a profile of none. This will actually make
73:36 it so that Olama never starts in the local AI package. So you can just
73:40 leverage the Olama that you have already running on your computer like I showed
73:43 you how to install already. So, just a couple of small caveats that I really
73:46 want to hit on there. I need to make sure that you're using the right
73:51 command. And so, in my case, I'm Nvidia on Windows. So, I'm going to copy this
73:55 command. Go back over into my terminal. I'll just clear it here. So, we have a
73:59 blank slate. And I'll paste in this command. And so, it's going to do quite
74:02 a few things initially. First, it's going to clone the Superbase repository
74:07 because Superbase actually manages the stack in a separate place. And so, we
74:10 have to pull that in. Then there's some configuration for CRXNG for our uh local
74:17 and private web search. And then I have a couple of warnings here saying that
74:20 the Flowwise username and password are not set, which by the way for that if
74:24 you want to set the Flow Wise username and password, it's optional, but you can
74:29 do that if I scroll down right here. So you can set these values, those will
74:32 actually make those warnings go away, but you can also ignore them, too. So
74:35 anyway, I just wanted to mention that really quickly. But now what's happening
74:39 here is it starts by running all of the Superbase containers. And so there's
74:44 quite a bit that goes into Superbase, like I said. So we're running all of
74:47 that. It's getting all that spun up. And then once we run all of these, it's
74:51 going to move on to deploying the rest of our stack. And if you're running this
74:55 for the very first time, it will take a while to download all of these images.
74:59 They're not super small. There's a lot of infrastructure that we're starting up
75:03 here. And so it'll take a bit. You just have to be patient. maybe go grab your
75:06 coffee or make your next meal, whatever that is. And then everything will be up
75:09 and running once you are back. And so yeah, now you can see that we are
75:13 running the rest of the containers here. Um, and so we'll just wait for that to
75:16 be done. And then I'll show you what that looks like in Docker Desktop as
75:19 well. And so I'll give it a second here just to finish. Uh, looks like my
75:24 terminal glitched a little bit. Like I was scrolling and so it kind of broke it
75:27 a bit. But anyway, everything is up and running now. It'll look like this where
75:30 it'll say all of the containers are healthy or running or started. And then
75:34 if I go into Docker Desktop and I expand the local AI compost stack, you want to
75:39 make sure that you have a green dot for everything except for the Olama pull and
75:45 N8N import. These just run once initially and then they go down because
75:49 they're responsible for pulling some things for our local AI package. And so
75:53 yeah, I've got green dots for everything except for two right here. Now I'm
75:57 leaving this in here intentionally actually because there is a bug with
76:02 Superbase specifically if you are on Windows. So you'll see this issue where
76:08 the Superbase pooler is constantly restarting and that also affects N8N
76:12 because N8N relies on the Superbase pooler. So it's constantly restarting as
76:17 well. If you see this problem, I actually talk about this in the
76:21 troubleshooting section of the readme. If you scroll all the way down, if the
76:24 Superbase pooler is restarting, you can check out this GitHub issue. And so I
76:29 linked to this right here, and he tells you exactly which file you want to
76:33 change. It's this one right here. So it's docker volumes poolerpooler.exs.
76:39 And you need to change the file to end in lf. And so I'll show you what I mean
76:43 by that. I'll show you exactly how to do this. It's like a super tiny random
76:47 thing, but this has tripped up so many people. So I want to include this
76:51 explicitly in the master class here. So you want to go within the superbase
76:56 folder within docker volumes and then it's within pooler and then we have
77:00 pooler.exs and basically no matter your IDE you can see the crlf in the bottom right here.
77:08 You want to click on this and then change it to lf and then make sure that
77:13 you save this file. Very easy to fix that. And then what you can do is you
77:19 can run the exact same command to spin everything up again. And so I'm going to
77:22 do this now. It's going to go through all the same steps. It'll be faster this
77:25 time because you already have everything pulled. And this, by the way, is how you
77:28 can just restart everything really quickly if you want to enforce new
77:31 environment variables or anything like that. So I want to include that
77:35 explicitly um for that reason as well. And I'll go ahead and close out of this.
77:39 And and while this is all restarting, the other thing that I want to show you
77:42 in the readme is I also have instructions for upgrading the containers in the local AI package. So
77:49 when N8N has an update or Superbase has an update, it is your responsibility
77:53 because you're managing the infrastructure to update things yourself. And so you very simply just
77:58 have to run these three commands to update everything. You want to tear down
78:03 all of the containers and make sure you specify your profile like GPU Nvidia and
78:09 then you want to pull all of the latest containers and again specifying your
78:13 profile. And then once you do those two things, you'll have the most up-to-date
78:17 versions of the containers downloaded. So you can go ahead and run the start
78:21 services with your profile just like we just did to restart things. Very easy to
78:25 update everything. And even though we are completely tearing down our
78:29 containers here before we upgrade them, we aren't losing any information because
78:33 we are persisting things in the volumes that we have set up at the top of our
78:37 Docker Compose stack. And so this is where we store all of our data in our
78:41 database and and workflows. All these things are persisted. So we don't have
78:45 to worry about losing them. Very easy to upgrade things and you still get to keep
78:48 everything. You don't have to make backups and things like that unless you
78:54 just want to be ultra ultra safe. So now we can go back to our Docker desktop and
79:00 we've got green dots for everything now since we fixed that pooler.exs issue.
79:04 The only thing that we don't have green dots for is the N8N import and then we
79:08 have our Olama pull as well because like I said those are the two things that
79:11 just have to run at the beginning and then they aren't ongoing processes like
79:16 the rest of our services. So, we have everything up and running. And if there
79:22 is anything that is a white dot besides Olama pull or n import or if there's
79:27 anything that is constantly restarting, just feel free to post a comment and
79:31 I'll definitely be sure to help you out. And then also check out the
79:34 troubleshooting section as well. One thing that I'll mention really quick is
79:37 sometimes your N8N will constantly restart and it'll say something like the
79:42 N8N encryption key doesn't match what you have in the config. And the big
79:46 thing to keep in mind for that is you want to make sure that you set this
79:51 value for the encryption key before you ever run it for the first time.
79:53 Otherwise, it's going to generate some random default value and then if you
79:56 change this later, it won't match with what it expects. And so, yeah, my big
79:59 recommendation is like make sure you have everything set up in your
80:03 environment variables before you ever run the start services for the first
80:08 time. This should be run once you have your environment variables set up.
80:11 Otherwise, you risk any of these services creating default values that
80:14 then wouldn't match with the keys and things that you set up later. And so
80:18 with that, we can now go into our browser and actually explore all of
80:22 these local AI services that we have running on our computer now. Now over in
80:26 our browser, we can start visiting the different services that we have spun up.
80:30 Like here is N8N. You just have to go to localhost port 5678. It'll have you
80:35 create a local account when you first visit it. And then you'll have this
80:38 workflow view that should look very familiar to you. if you have used NAND
80:41 in the past. And then we have open web UI localhost port 8080. This is our chat
80:47 GPT like interface where we can directly talk to all of the models that we have
80:52 pulled in our Olama container. Really, really neat. And then we have local host
80:57 port 8000 for our Superbase dashboard. The signin definitely isn't pretty
81:00 compared to the managed version of Superbase. But once you enter in your
81:04 username and password that you have set for the environment variables for the
81:07 dashboard, then you have the very typical view where we have our tables
81:11 and we've got our SQL editor. Everything that you're familiar with with
81:14 Superbase. And that's the key thing with all these different services. They all
81:18 will look the exact same for you pretty much. Um like another one for example,
81:25 if I go to localhost um port 3000, we have languages. This is for agent
81:28 observability and monitoring. And this is something I'm not going to dive into
81:31 in this master class. Like I said, I'm not covering all the services. But yeah,
81:34 I just want to show that like every single one of these pretty much you can
81:39 access in your browser. And by the way, the way that we know the specific port
81:43 to access for each of these services is by taking a look at either what it tells
81:48 us in Docker Desktop. So like we can see that Neo4j is um let's see, we have port
81:53 7474. For uh CR XNG, it's port 8081. For Flowwise, it's port 3001. What's one
82:01 that we've seen already? Um, let me Yeah, like Open Web UI is port 8080. So,
82:06 the port on the left is the one that we access in our browser. And then the port
82:11 on the right is what's mapped on the container. So, when we visit port 8080
82:16 on our computer, that goes into port 8080 on the container. And that's what
82:21 we have exposed. The other way that you can see the port that you need to use is
82:24 just by taking a look at this docker compose file. And you don't need to have
82:29 like a super good understanding of this docker compose file. But if you want to
82:33 customize your stack or even help me by making contributions to local AI
82:36 package, this is the main place to make changes. And so for example, I can go
82:41 down to flowwise and I can see that the port is 3001. Or if I go down to let's say N8N, we can
82:49 see that the port is 5678. And so the port is always going to be
82:52 there somewhere in the service that you have set up. Like for the Langfuse
82:56 worker, it's 3030. That's more of a behind-the-scenes kind of service. But
82:59 let me just find one more example for you here. Um yeah, like Reddus for
83:04 example is 6379. So you can see the ports in the Docker Compose as well. I
83:07 just want to call it out just to at least get you a little bit comfortable
83:11 and familiar with the Docker Compose file in case you want to customize
83:14 things. But the main thing is just leveraging what you see here in Docker
83:18 Desktop. Last thing in Docker Desktop really quickly, if you want to bring
83:21 more local large language models into the mix, you can do it without having to
83:25 restart anything. You just have to find the Olama container in the Docker
83:29 Compose stack. Head on over to the exec tab. And now here we can run any
83:33 commands that we'd want. We're directly within the container here. And we can
83:36 use Lama commands just like we did earlier on our host machine. And so for
83:40 example, I ran Lama list already. So I can see the large language models that
83:43 have already been pulled in my Olama container. If I want to pull more, I can
83:48 just do Olama pull and then find that ID for the model I want to use on the Olama
83:52 website. And like I said, you don't have to restart anything. If I pull it here,
83:56 it's now in the container and I can immediately start using it in Open Web
84:00 UI or N8N. We'll see that in a little bit. And so that's just really important
84:03 because a lot of times you're going to want to start to use different large
84:06 language models and you don't want to have to restart anything. The ones that
84:11 are brought into the machine by default is it's determined by this line right
84:15 here. So if you want to change the ones that are pulled by default, I just have
84:20 Quinn 2.57B instructs like a really small lightweight one that I have
84:23 brought into your Lama container by default. Uh if you want to add in
84:27 different ones, you can just update this line right here to include multiple
84:32 Olama pulls. And so that way you can bring in Quen 3 or Mistral 3.1 small,
84:36 whatever you want. This is just the one I have by default. And then all the
84:40 other ones that you saw in my list here, I've pulled myself. All right. Now that
84:45 we have the local AI package up and running, it is time to build some
84:50 agents. Now, we get to use our local AI package to actually build out an
84:53 application. And so, I'm going to start by introducing you to Open Web UI, and
84:58 we'll use it to talk to our Olama LLM. So, we have an application kind of right
85:02 out of the box for us. Then I'll dive into building a local AI agent with N8N,
85:08 even connecting it to Open Web UI. So we have this custom agent that we built in
85:12 N8N and then we immediately have a really nice UI to chat with it. And then
85:16 we'll transition to Python building the exact same agent in Python as well. Like
85:21 I said, I want to focus on both no code and code to really make this a complete
85:24 master class so that whether you want to build with N8N or Python, you can see
85:28 how to connect to our different services that we have running locally like
85:32 Superbase and CRXNG and Open Web UI. So, we'll cover all of that and then I'll
85:36 get into deployments after this. But yeah, let's go ahead right now focus on
85:40 open web UI and building out some agents. So, back over in Open Web UI,
85:45 remember this is localhost port 8080. You want to set up your connection to
85:49 Olama so we can start talking with our local LLMs in this nice interface. And
85:54 so bottom left, go to the admin panel, then go to settings and then the
85:58 connections tab. Here we can set up our connections both to OpenAI with our API
86:02 key, which we're not going to do right now, but then also the Olama API. This
86:07 is what we want to set up. Now, usually by default, this value is just
86:12 localhost. And this is actually wrong. This is something that is so important
86:16 to understand. And this will apply when we set up credentials in N8N and Python
86:20 as well. When you are within a container, localhost means that you are
86:26 referencing still within the container. Open web UI needs to reach out to the
86:32 Olama container, not itself. So localhost is not correct here. This is
86:36 generally the default just because open web UI assumes that you're running on
86:39 your machine and so then you would also have Lama running on your machine. So
86:42 local host usually works when you're outside of containers. But here we have
86:47 to change this. This is super important to get right. And so there are two
86:51 options we have. If you are running on a Mac or AMD on Windows and you want to
86:56 use Lama running on your machine not within a container, then you want to do
87:00 host.doccker.in. This is the way in docker to tell the container to look outside to the host
87:07 machine where you're running the containers and you're running
87:11 separately. Very important to know that. And then if you are running Olama in the
87:16 container like I am doing. I have Ola running in my Docker desktop. You want
87:21 to change this to Olama, you're specifically calling out the service
87:26 that is running the Olama container in your Docker Compose stack. And the way
87:30 that we know that this is the name specifically is because we just go back
87:35 to our allimportant Docker Compose file. Olama. So whenever there's an X and a
87:40 dash, you just ignore that. It's just the thing after it. So, ola is the name
87:45 of our service running the container. And then if we wanted to connect to
87:49 something else like flow-wise, flow-wise is the name of the service. Open WebUI,
87:55 it's open- web UI. All of these tople keywords, these are the names when we
87:59 want our containers to be talking to each other. And all of this is possible
88:03 because they are within the same Docker network. And so I'll just show you that
88:06 so you know what I'm talking about here. If I go back to Docker Desktop, we have
88:11 this local AI compos stack. All of these containers can now communicate
88:14 internally with each other by referencing the names like Reddus or
88:19 CRXNG. So, we'll be seeing that a lot when we're building out our agents as
88:22 well. So, I wanted to spend a couple minutes to focus on that. And so, you
88:25 can go ahead and click on save in the very bottom right. I know my face is
88:28 covering this right now, but you have a save button here. Make sure you actually
88:32 do that. Um, and for this API key, I don't know why it's asking me to fill it
88:34 out. I don't really care about connecting to open AI. So I'll just put
88:37 some random value there and click save. And then boom, there we go. We are good.
88:40 And then a lot of times with open web UI, it also helps to refresh otherwise
88:43 it doesn't load the models for some reason. So I just did a refresh of the
88:49 site here. Control F5. And then now we can select all of the local LLMs that we
88:53 have pulled in our Olama container. And so for example, I can do Quinn 2.57B.
88:57 That's the one that I just have by default. I can say hello. And it's going
89:02 to take a little bit cuz it has to load this model onto my GPU just like we saw
89:06 with quen 3 earlier. But then in a second here we'll get a response. And
89:10 there are actually multiple calls that are being done here. We have one to get
89:15 our response, one to get a title for our conversation on the lefth hand side. And
89:19 then also if you click on the three dots here, you can see that it created a
89:23 couple of tags for this conversation. So couple of things that are fired off all
89:26 at once there. And I can test conversation history. What did I just
89:30 say? So yeah, I mean everything's working really well here. We have chat
89:34 history, conversation history on the lefth hand side. There's so much that we
89:37 get out of the box. And so I wanted to show you this really quickly. Now we can
89:41 move on to building an agent in N8N. And I'll even show you how to connect it to
89:46 Open Web UI as well through this N8N agent connector. Really exciting stuff.
89:49 So let's get right into it. So I'm going to start really simple here by building
89:53 a basic agent. The main thing that I want to focus on is just connecting to
89:57 our different local AI services. So I am going to assume that you have a basic
90:01 knowledge of N8N here because this is not an N8N master class. And so I'm
90:04 starting with a chat trigger so we can talk to our agent directly in the UI.
90:08 We'll connect this to open web UI in a bit as well. And then I want to connect
90:14 an AI agent node. And so what we want to do is connect for the chat model and
90:18 then local superbase for our conversation history, our agent memory.
90:22 And so for the chat model I'm going to do lama chat model. I'm going to create
90:26 brand new credentials. You can see me do this from scratch. The URL that you want
90:32 for the base URL is exactly the same as what we just entered into open web UI.
90:35 And so if you are running Olama on your host machine like an AMD on Windows or
90:40 you are running on a Mac or you just don't want to run the Olama container,
90:46 then it is host.doccker.in. And then if you are referencing the
90:49 Olama container, we just reference Olama. That's the name of the service
90:53 running the Olama container in our stack. And then the port is 11434 by
90:57 default. And you can test this connection. So it'll do a quick ping to
91:01 the container to make sure that we are good to go. And I'll even show you what
91:04 that looks like. So right here in my Olama container, I have the logs up. And
91:09 the last two requests were just a simple get request to the root endpoint. We
91:13 have two of those right here. And if I click on retry and I go back to the
91:18 logs, boom, we are at three now. So it made three requests. So it's just making
91:22 that simple ping each time to make sure the container is available. And so I'm
91:26 going to go ahead and click on save and then close out. So now we have our
91:29 credentials and then we can automatically select the model that we
91:33 have loaded now in our container. And so just to keep things really lightweight,
91:36 I'm going to go with the 7 billion parameter model right now from Quen 2.5.
91:40 Cool. All right. So that is everything that we need to connect Olama. It is
91:44 that easy. And then we could even test it right now. So, I'm going to go ahead
91:47 and save this workflow. And I'm going to just say hello. And uh we don't need the
91:52 conversation history or tools or anything at this point. We're already
91:55 getting a response here from the LLM. It's working on loading the model into
91:59 my GPU as we speak. And so there we go. We got our answer looking really good.
92:04 Cool. So now we can add memory as well. So I'm going to add Postgress because
92:08 remember Superbase uses Postgress under the hood. And then I'm going to create
92:12 brand new credentials here. And this is actually probably the hardest one to set
92:16 up out of all of the credentials for connecting to our local AI service. And
92:20 so I'm going to show you what the Docker Compose file looks like just that it's
92:24 clear how I'm getting these different values. And so I'll point out all of
92:28 them. So the first one for our host it is DB because this is the name of the
92:35 specific Superbase service that we have that is the underlying Postgress
92:38 database. And I can show you how I got that really quick. If you go to the
92:42 superbase folder that we pull when we run that start services script, I go to
92:47 docker and then docker compose. If I search for db and there's quite a few
92:53 dependencies on db here. So let me find the actual reference to it. Where is db?
92:58 Here we go. So yeah, it's really short. Uh db is the name of our service that
93:03 actually is the superbase DB. So this is the container name that this is what
93:07 you'll see in docker desktop. But then this is the underlying service that we
93:10 want to reference when we have our containers communicating with each
93:14 other. Like in this case we have our N8N container talking to our superbase
93:18 database container. And then the database and username are both going to
93:22 be Postgress. Those are the values that we have by default. If you scroll down a
93:26 bit in thev you can see these right here. The Postgress database is
93:29 Postgress and the user is also Postgress. And you can customize these
93:33 things but these are some of the optional parameters that I didn't touch
93:36 in the setup with you. And so you can just leave those as is. Now the
93:41 Postgress password, this is one of them that we set. That was the first
93:44 superbase value that we set there. Make sure you have that from what you have in
93:48 thev. And then everything else you can kind of leave as the defaults here. The port is
93:53 going to be 5432. So that is everything for setting up our connection to
93:57 Postgress. You can test this connection as well. And then we can move on to
94:01 adding in some tools and things like that as well. But yeah, this is like the
94:06 very first basic version of the agent that I wanted to show you. And hopefully
94:10 with this you can see how no matter the service that you have running in the
94:13 local AI package. It's very easy to figure out how to connect to it both
94:17 with the help of N8N because N8N always makes it really easy to connect to
94:20 things. Then also just knowing that like you just have to reference that service
94:25 name that we have for the container in the Docker Compose stack. That's how we
94:28 can talk to it. So you could add in quadrant or you could add in language.
94:31 Like you can connect anything that you want into our agent here. And so now we
94:36 have conversation history. Next up, I want to show you how to build a bit more
94:40 of a complicated agent with N8N using some tools. And then also I'm going to
94:44 show you how to connect it to Open Web UI. And so right now this is a live
94:47 demo. Instead of connecting to one of the Olama LLMs, I'm going straight to
94:53 N8N. I have this custom N8N agent connector. And so we are talking to this
94:57 agent that I'll show you how to build in a little bit. This one has a tool to use
95:02 CRXNG for local and private web search. This is one of the platforms that we
95:06 have included in the local AI package. And so this response is going to take a
95:10 little bit here because it has to search the web. And the response that it
95:13 generates with this question is pretty long. Like there we go. Okay. So we got
95:16 the answer. It's pretty long. But yeah, we are able to search the internet now
95:21 with a local agent. N8N connected to open web UI. We're getting pretty fancy
95:25 here. And we also have the title that was generated on the left. And then we
95:30 have the tags here as well. And so the way that this all works, I'm going to
95:33 start by explaining how we can connect N8N to Open Web UI. And this is just
95:37 crucial. Makes it so easy for us to test agents locally as we are developing
95:42 them. And so if you go to the settings and the admin panel in the bottom left
95:47 and go to functions, open web UI has this thing called functions which gives
95:51 us the ability to add in custom functionality kind of as like custom
95:56 models that we can then use like you saw with the N8N agent connector. And so
96:02 what I have here is this thing that I call the N8N pipe. And I'll have a link
96:05 to this in the description as well. I created this myself and I uploaded it to
96:10 the open web UI directory of functions. And so you can go to this link right
96:14 here. You can even just Google the N8N pipe for open web UI. And then you click
96:19 on this get button. It'll just have you enter in the URL for your open web UI.
96:23 So I can just like paste in this right here. Click on import to open web UI and
96:28 it'll automatically redirect you to your open web UI instance. So you'll have
96:33 this function now. And we don't have to dive into the code for all how how all
96:36 of this works. I worked pretty hard to create this for you. Uh actually quite a
96:40 while ago I made this. And the thing that we need to care about is
96:44 configuring this to talk to our N8N agent. And so if you click on the
96:49 valves, the setting icon in the top right, there are a few values that we
96:54 have to set. And so now I'm going to go over to showing you how to build things
96:57 in N8N. Then all of this will click and it'll make sense. I right now looking at
97:00 these values, you're probably like, how the heck do I get all of these? But
97:02 don't worry, we'll dive into all of that. But first, let's go into our N8N
97:07 agent. I'll explain how all of this works. So, first of all, we have our
97:12 chat trigger that gives us the ability to communicate with our agent very
97:16 easily in the workflow. We have a new trigger now for the web hook. And so,
97:22 this is turning our agent into an API endpoint. So, we're able to talk to it
97:27 with other services like open web UI. And so to configure the web hook here,
97:31 you want to make sure that it is a post request type. And then you can define a
97:35 custom path here. Whatever you set here is going to determine what our URL is.
97:40 So we have our test URL. And then also if you toggle the workflow to active,
97:44 this is really important. The workflow in N does have to be active. Then you
97:49 have access to this production URL. And this is actually the first value that we
97:54 need to set within the valves for this open web UI function. We have our N8N
97:59 URL. And because this is a container talking to another container, we don't
98:03 actually want to use this localhost value that it has here for us. We want
98:08 to specify N8N because N8N again is the name of the service running the N8N
98:13 container in our Docker Compose stack. So N8N port 5678. And then this is the
98:18 custom URL that we can determine based on this. And then the other thing that
98:23 we want to do is set up header authentication. We don't want to expose
98:27 this endpoint without any kind of security. And so we want to set up some
98:31 authentication. And so you can select header off from the authentication
98:34 dropdown. And then for the credentials here, I'll just create brand new ones to
98:38 show you what this looks like. The name needs to be authorization with a capital
98:43 A. This has to be very specific. The name in the top left and the name of
98:46 your credentials. This can be whatever you want, but this has to be
98:51 authorization. And then the value here, the way that we want to format this is
98:55 it's going to be bearer and then the and then a space and then whatever you want
99:00 your bearer token to be. So this is what you get to define, but it needs to start
99:05 with a bearer capital B and a space. And then whatever you type after bearer
99:09 space, this goes in as the NAN bearer token. So you don't include a bearer
99:13 space here because that it's just assumed that it's going to be like that.
99:16 It's going to be prefixed with that. So you just type in like test off is what I
99:21 have. So my bearer token is bearer test off like that. And then this is what I
99:24 enter in for this field. Now I already have mine set up. So I'm just going to
99:27 go ahead and close out of this. And then the last thing that we have to set up
99:30 for the web hook. And don't worry, this is the node that we spend the most time
99:33 with. You want to go to the drop down here and change this to respond using
99:38 the respond to web hook node. very important because then at the end of our
99:40 workflow and we get the response from our agent, we're going to send that back
99:45 to whatever requested our API which is going to be open web UI in this case.
99:48 And so that's everything for our configuration for the web hook. Now the
99:53 next thing that we have to do is we have to determine is open web UI sending in a
99:58 request to get a response for our main agent or is it just looking to generate
100:03 that conversation title or the tags for our conversation? Because like we were
100:06 looking at earlier, I'm going to close out of this for now and go back to a
100:10 conversation, our last conversation here. We get our main response, but then
100:15 also there is a request to an LLM to create a very simple title for our
100:19 conversation and the tags that we can see in the top right. And so our N8
100:24 workflow actually gets invoked three separate times for just the first
100:30 message in a new conversation. And so we need to determine, are we getting a main
100:34 response? Like should we go to our main agent or should we just go to a simple
100:39 LLM that I have set up here to help generate the tags or title? And so the
100:44 way that we can determine that is whenever Open Web UI is requesting
100:47 something like a title for a conversation, it always prefixes the
100:53 prompt with three pound symbols, a space, and then the word task. And so we
100:58 can key off of this. If the prompt starts with this, and that prompt just
101:02 is coming in from our web hook here. If it does start with it, then we're just
101:06 going to go to this simple LLM, we're just going to be using Quen 2.514b
101:11 instruct. We have no tools, no memory or anything like our main agent because
101:14 we're just very simply going to generate that title or the tags. And I can even
101:19 show you in the execution history what that looks like. So in this case, we
101:22 have our web hook that comes in. The chat input starts with the triple pound
101:28 and task. And so sure enough, we are deeming it to be a metadata request is
101:32 what I'm calling it. And so then it then goes down to this LLM that is just
101:36 generating some text here. We just have this JSON response with the tags for the
101:41 conversation, technology, hardware, and gaming. So we're asking about the price
101:45 of the 5090 GPU. And then we do the exact same thing to also generate the
101:51 title GPU specs. And so exactly what we see here is the title of this last
101:55 conversation. So I hope that makes sense. And then if it doesn't start with
101:59 task and the triple pound and so it's actually our request. Then we go to our
102:03 main agent. We don't want our main agent to have to handle those super simple
102:06 tasks. You can also just use a really tiny LLM. Like this would be the perfect
102:11 case to actually use a super tiny LLM like um even like DeepSec R1 1.5B. You
102:15 could because it's just such a simple task. Otherwise though we are going to
102:20 go to our main agent. And so I'm not going to dive into like all these nodes
102:24 in a ton of detail, but basically we are are expecting the chat input to contain
102:30 the prompt for our agent. And the way that we know to expect chat input
102:35 specifically is because going back to the settings for the function here with
102:38 the valves, we are saying right here chat input. So you want to make sure
102:43 that the value that you put in here for input matches exactly with what you are
102:48 expecting from our web hook. And so chat input is the one that I have by default.
102:51 So you can just copy me if you want. Then we go into our agent where we're
102:55 hooked into Olama and we've got our local superbase. I already showed you
102:58 how to connect up all this and that looks exactly the same. The only thing
103:01 that is different now is we have a single tool to search the web with
103:07 CRXNG. So it's a web search tool. I have a description here just telling it what
103:11 is going to get back from using this tool. And then for the workflow ID, this
103:16 is if I go to add a node here and I just go for uh workflow tools, call N8N
103:23 workflow tool. So this is basically taking an N8N workflow and using it as a
103:28 tool for our agent. So this is the node that we have right here. But then I'm
103:32 referencing the ID of this N8N workflow. So this ID because I'm going to just
103:37 call the subworkflow that I have defined below. And again, I don't want to dive
103:40 into all the details of NAN right now and how this all works, but the agent is
103:44 going to decide the query. What should I search the web with? It decides that and
103:49 then it invokes this sub workflow here where we have our call to CR XNG. So the
103:54 name of the container service in our docker and compost stack is just CR XNG
103:59 and it runs on port 8080. And then if you look at the CXNG documentation, you
104:03 can look at how to invoke their API and things like this. So I'm just doing a
104:07 simple search here and then there are a few different nodes because what I want
104:10 to do is I want to split out the search and actually I can show you this by
104:14 going to an execution history where we're actually using this tool. So take
104:18 a look at this. So in this case the LLM decided to invoke this tool and the
104:23 query that it decided is current price of the 5090 GPU. So this is going along
104:28 with the conversation that we had last in open web UI. we get some results from
104:33 CRXNG, which is just going to be a bunch of different websites. And so, we don't
104:37 have the answer quite yet. We just have a bunch of resources that can help us
104:41 get there. And so, I'm going to split out. So, we have a bunch of different
104:45 websites. We're going to now limit to just one. I just want to pull one
104:48 website right now just to keep it really, really simple because now we're
104:52 going to actually visit that website. I'm going to make an HTTP request to
104:57 this website, which yeah, I mean, if it's literally an Nvidia official site
105:01 for the 5090, like this definitely has the information that we need. We're
105:04 going to make a request to it, and then we're also going to use this HTML node
105:08 to make sure that we are only selecting the body of the site. So, we take out
105:12 all the footers and headers and all that junk. So, we just have the key
105:15 information. And then that is what we aggregate and then return back to our AI
105:19 agent. So it now has the content, the core content of this website to get us
105:24 that answer. That is how we invoke our web search tool. And then at the very
105:29 end, we're just going to set this output field. And that's going to be the
105:33 response that we got back either from like generating a title or calling our
105:37 main agent. And this is really important. the output field specifically
105:41 whatever we call it here we have to make sure that that is corresponding to this
105:46 value as the last thing we have to set for the settings for our open web UI
105:50 function. So output here has to match with output here because that is what
105:55 we're going to return in this respond to web hook. Whatever open web UI gets back
105:59 it's getting back from what we return right here. So that is everything for
106:03 our agent. I could probably dive in quite a bit more into explaining how
106:06 this all works and building out a lot more complex agents, which I definitely
106:10 do with local AI in the Dynamis AI agent mastery course. So check that out if you
106:13 are interested. I just wanted to give you a simple example here showing how we
106:17 can talk to our different services like Olama, Superbase, and Seir XNG. And then
106:22 also open web UI as well. So once you have all these settings set, make sure
106:26 of course that you click on save. It's very, very important. These two things
106:29 at the bottom don't really matter, by the way. But yeah, click on save once
106:32 you have all of the settings there. And then you can go ahead and have a
106:36 conversation with your agent just like I did when I was demoing things before we
106:40 dove into the workflow. And by the way, this NAN agent that works with Open Web
106:44 UI, I have as a template for you. You can go ahead and download that in this
106:48 GitHub repository where I'm storing all the agents for this masterass. So we
106:52 have the JSON for it right here. You can go ahead and download this file. Go into
106:57 your N8N instance. Click on the three dots in the top right once you've
107:01 created a new workflow. Import from file and then you can bring in that JSON
107:03 workflow. You'll just have to set up all your own credentials for things like
107:07 Lama and Superbase and CRXNG. But then you'll be good to go and you can just go
107:11 through the same process that I did setting up the function in open web UI
107:15 and it'll be with like 15 minutes you'll have everything up and running to talk
107:20 to N8N in open web UI. Next up I want to create now the Python version of our
107:25 local AI agent. And so this is going to be a onetoone translation. Exactly what
107:30 we built here in NN, we are now going to do in Python. So I can show you how to
107:34 work with both noode and code with our local AI package. And so this GitHub
107:39 repo that has the N workflow we were just looking at and that OpenAI
107:43 compatible demo we saw earlier, this has pretty much everything for the agent. So
107:46 most of this repository is for this agent that we're about to dive into now
107:50 with Python. And in this readme here, I have very detailed instructions for
107:54 setting up everything. And a lot of what we do with the Python agent, especially
107:57 when we are configuring our environment variables, it's going to look very
108:01 similar to a lot of those values that we set in N8N. Like we have our base URL
108:05 here, which you'd want to set to something, you know, like HTTP lama port
108:09 11434. We just need to add this /view one, which I guess is a little bit different,
108:14 but yeah, I've got instructions here for setting up all of our environment
108:18 variables, our API key, which you can actually use OpenAI or Open Router as
108:22 well with this agent, taking advantage of the OpenAI API compatibility. This is
108:27 a live example of this because you can change the base URL, API key, and the
108:31 LLM choice to something from Open Router or OpenAI, and then you're good to go
108:35 immediately. It's really, really easy. We will be using Olama in this case, of
108:39 course, though. And then you want to set your superb basease URL and service key.
108:42 You can get that from your environment variables. Same thing with CRXNG with
108:47 that base URL. We'll set that just like we did in N8N. We have our bearer token
108:51 like in our case was test off. It's just whatever comes after the bearer and the
108:55 space. And then the OpenAI API key you can ignore. That's just for the
108:58 compatible demo that we saw earlier. This is everything that we need for our
109:02 main agent now. And so we're using a Python library called fast API to turn
109:09 our AI agent into an API endpoint just like we did in N8N. And so fast API is
109:13 kind of what gives us this web hook both with the entry point and the exit for
109:17 our agent and then everything in between is going to be the logic where we are
109:21 using our agent. And I'm going to be using paidantic AI. It's my favorite AI
109:25 agent framework with Python right now. Makes it really easy to set up agents
109:29 and we'll so we'll dive into that here. And I don't want to get into the
109:32 nitty-gritty of the Python code here because this isn't a master class on
109:36 specifically building agents. I really just want to show you how we can be
109:40 connecting to our local AI services. This agent is 100% offline. Like I could
109:46 cut the internet to my machine and still use everything here. So we create our
109:50 Superbase client and the instance of our fast API endpoint. I have some models
109:55 here that define the the requests coming in, the response going out. So we have
109:59 the chat input and the session ID just like we saw in N8N. And then the output
110:04 is going to be this output field. And so that corresponds to exactly what we're
110:07 expecting with those settings that we set up in the function in open web UI.
110:11 So this Python agent is also going to work directly with open web UI. And then
110:16 we have some dependencies for our Pantic AI agent because it needs to have an
110:21 HTTP client and the CRXNG base URL to make those requests for the web search
110:25 tool. And then we're setting up our model here. It's an OpenAI model, but we
110:29 can override the base URL and API key to communicate with Olama or Open Router as
110:34 well like we will be doing. And then we create our Pantic AI agent just getting
110:38 that model based on our environment variables. I've got a very simple system
110:43 prompt and then the dependencies here because we need that HTTP client to talk
110:48 to CR XNG. And then I'm just allowing it to retry twice. So if there's any kind
110:51 of error that comes up, the agent can re retry automatically, which is one of the
110:55 really awesome things that we have in Pyantic AI. And then I'm also creating a
111:01 second agent here. This is the agent that is going to be responsible like we
111:05 have in NADM for handling the metadata for open web UI like conversation titles
111:10 and tags for our conversation. And so it's an entirely separate Asian because
111:14 we just have a another system prompt. In this case, I'm just doing something
111:18 really simple here. Uh we don't have any dependencies for this agent because it's
111:21 not going to be using the web search tool. And then for the model, I'm just
111:25 using the exact same model that we have for our primary agent. But like I shared
111:30 with N8N, you could make it so that this is like a much smaller model, like a one
111:33 or three billion parameter model because the task is just so basic or maybe like
111:38 a 7 billion parameter model. So you can tweak that if you want. Just for
111:41 simplicity sake, I'm using the same LLM for both of these agents.
111:46 And then we get to our web search tool. So in Pantic AI, the way that you give a
111:51 tool to your agent is you do at@ and then the name of your agent and then
111:55 tool and then the function that you define below this is now going to be
112:00 given as a tool to the agent. And then this description that we have in the doc
112:04 string here that is given as a part of the prompt to your agent. So it knows
112:09 when and how to use this tool. And so the exact, you know, details of how
112:13 we're using CRX and G here, I won't dive into, but it is the exact same as what
112:17 we did in N8N where we make that request to the search endpoint of CRX andG. We
112:22 go through the page results here. We limit to just the top three results or I
112:26 could even change this to make it even simpler and just the top result. So we
112:30 have the smallest prompt possible to the LLM. And then we get the the content of
112:34 that page specifically. And then we return that to our AI agent with some
112:40 JSON here. So now once it invokes this tool, it has a full page back with
112:44 information to help answer the question from the user. It has that web search
112:47 complete now. And then we have some security here to make sure that the
112:51 bearer token matches what we get into our API endpoint. So that's that header
112:55 authentication that we set up in N8N. So this part right here where we're
112:59 verifying the header authentication that corresponds to this verify token
113:03 function. And then we have a function to fetch conversation history to store a
113:08 new message in conversation history. So both of these are just making requests
113:12 to our locally hosted superbase using that superbase client that we created
113:16 above. And then we have the definition for our actual API endpoint. And so in
113:23 N8N we were using invoke N8N agent for our path to our agent. So this was our
113:29 production URL. In this fast API endpoint, our endpoint is slashinvoke
113:34 python agent. And then we're specifically expecting the chat input
113:39 and session ID. So that is our um chat request request type right here. And
113:43 then sorry I highlighted the wrong thing. We have our response model here
113:46 that has the output field. So we're defining the exact types for the inputs
113:50 and the outputs for this API endpoint. And then we're also using this verify
113:55 token to protect our endpoint at the start. And then the key thing here, if
113:59 the chat input starts with that task, then we're going to call our metadata
114:02 agent. And so it's just going to spit out the title or the tags, whatever that
114:06 might be. Otherwise, we're going to fetch the conversation history, format
114:11 that for Pyantic AI, store the user's message so that we have that
114:15 conversation history stored, create our dependencies so that we can communicate
114:20 with CRXNG, and then we'll just do agent.run. We'll pass in the latest
114:24 message from the user, the past conversation history and the
114:27 dependencies that we created. So it can use those when it invokes the web search
114:31 tool and then we just get the response back from the agent and we'll you can
114:34 print that out in the terminal as well and then we'll just store it in
114:38 superbase and then return the output field. So I'm going kind of fast here. I
114:42 there definitely a lot more videos on my channel where I break down in more
114:45 detail building out agents with podantic AI and turning them into API endpoints
114:49 and things like that. Um but yeah, just going a little bit faster here. here.
114:52 And then the last thing is with any kind of exception that we encounter, we're
114:55 just going to return a response to the front end saying that there was an issue
114:59 and then specifying what that is. And then we are using Ubicorn to host our
115:06 API endpoint specifically on port 8055. So that is everything for our Python
115:11 agent exactly the same as what we set up in N8N. And now going to the readme
115:16 here, I'll open up the preview. The way that we can run this agent, we just have
115:20 to open up a terminal just like we did with the OpenAI compatible demo. I've
115:25 got instructions here for um setting up the database table, which this is using
115:30 the same table as the one in N8N and N8N creates it automatically. So, if you
115:34 have already been using the N8N agent, you don't actually have to run this SQL
115:38 here. Um, and then you want to obviously set up your environment variables like
115:42 we covered. Uh, open your virtual environment and install the requirements
115:46 there. And then you can go ahead and run the command python main.py.
115:52 And so this will start the API endpoint. So it'll just be hanging here because
115:56 now it's waiting for requests to come in on a port 8055. And so what I can do is
116:03 I can go back to open web UI. I can go to the admin panel functions. Go to the
116:08 settings. I can now change this URL. So everything else is the same. I have my
116:11 bearer token, the input field, and the output field the same as N8N. The only
116:16 thing I have to change now is my URL. And so I know this is an N8N pipe and I
116:20 have N8N in the name everywhere, but this does work with just any API
116:23 endpoint that we have created with this format here. And so I'm going to say for
116:27 my URL, it's actually going to be host.doccker.in because I have my API endpoint for
116:33 Python running outside on my host machine. So I need my open web UI
116:38 container to go outside to my host machine. And then specifically the port
116:43 is going to be 8055. And then the endpoint here, I'm going to
116:46 delete this web hook here because it's invoke-python- agent. Take a look at that. All right.
116:52 Boom. So I'm going to go ahead and save this. And then I can go over to my chat.
116:57 And it says n agent connector here still. But this is actually talking to
117:00 my Python agent now. So I'll go ahead and start by asking it the exact same
117:05 question that I asked the N8N agent. And I do have this pipe set up to always say
117:08 that it's calling nan, but this is indeed calling our Python API endpoint.
117:12 And we can see that now. So there we go. We got all the requests coming in, the
117:16 response from the agent, and then also the metadata for the title and the tags
117:20 for the conversation. Take a look at that. So we got our title here. We have
117:24 our tags, and then we have our answer. It's a starting price of $2,000, which
117:28 it's a lot more right now. The starting point, the starting price is kind of
117:32 misleading, but like yeah, this is a good answer. and it did use CRXNG to do
117:36 that web search for us. This is really, really neat. Now, the last thing that we
117:40 want to do for our Python agent before we can work on deploying things to the
117:45 cloud, to a private server in the cloud, is we want to containerize it. Now, the
117:49 reason that we want to do this, and this is the Docker file that I have set up to
117:53 turn our Python agent into a container, just like our local AI services, is if
117:59 we have our agent containerized, then we can have it communicate within the
118:03 Docker network just like we have our different local AI services
118:07 communicating with each other. Because right now running directly with Python
118:12 to communicate with Olama, for example, we need our URL to be localhost, not
118:18 Olama. Remember, you can only use the specific name of the container service
118:23 when you are within the docker compose stack. And so we'd have to actually say
118:28 localhost right now. But if we add the container for the agent into the stack
118:32 as well, then we can communicate directly within the private network.
118:36 Like I can say lama and then for sir xng I could use this URL instead. Right now
118:41 we have to actually use localhost port 8081. And so it's really nice for
118:46 security reasons and just to make your deployment um in a nice package to have
118:51 the agents that you're running in the same network as your infrastructure. And
118:55 so that's what we're going to do right now. And so within the read me that I
118:59 have for instructions on setting up everything. I have the instructions that
119:02 we follow to run things with Python. I also have the instructions to run it
119:07 with Docker. And so all you want to do is run this single command. It's
119:11 actually very easy because I have this Docker file set up to turn our agent
119:15 into a container and I've got security and everything taken care of. We're
119:19 running it on port 8055 just like we did with Python. And then I have this very
119:24 simple Docker Compose file. It's just a single service that we're going to tack
119:28 on to all of the other services that we already have running for the local AI
119:32 package. And I'm calling this one the Python local AI agent. And so we're
119:36 using all of our environment variables from our ENV just like we did with the
119:40 local AI package. And then what I have at the top here is I am including the
119:45 docker compose file for the local AI package. So that just kind of solidifies
119:49 the connection there. Otherwise, you'll get this kind of weird error that says
119:52 there are orphaned containers when you run this, even though they aren't
119:55 actually. And so this is optional. You'll just get an orphan container
119:58 warning that you can ignore. But if you don't want to have that warning, you can
120:01 include this right here. You just have to make sure that this path corresponds
120:06 to the path to your docker compose in the local AI package. So in my case, I
120:10 just had to go up to directories and then go into the local AI package
120:13 folder. So yeah, you this is optional, but I want to include this here just to
120:18 make things in tiptop shape for you. So yeah, this is the docker compose. And
120:21 then what we can do now is I'll go back over to my terminal and I will paste in
120:25 this command. And what this will do is it will start or restart my Python local
120:31 AI agent container. And make sure that you specify this here because if you
120:35 don't then it's going to try to rebuild the entire local AI package because we
120:38 have this include. So very important you want to just rebuild or build for the
120:43 first time this agent container. And so I'll go ahead and run this and it's
120:47 going to give me those flow-wise warnings. So I don't have my username
120:49 and password set, but remember we can ignore those. But anyway, it's going to
120:52 build the Python local AI agent container here. And there's a couple of
120:56 steps that it has to do. It has to update some internal packages and then
121:00 also install all of the pit packages we have for our Python requirements for
121:04 things like fast API and Pantic AI. So it'll take a little bit to complete,
121:07 usually just a minute or two. And so I'll go ahead and pause and come back
121:11 once this is done. And there we go. Your output should look something like this.
121:14 It goes through all the build steps and then it says that the container is
121:18 started at the bottom. And this is now in our local AI Docker Compose stack. So
121:23 going back over to Docker Desktop. It'll take a little bit to find it here
121:26 because there are so many services that we have here. But if we scroll down,
121:29 okay, there we go. At the bottom here, we have our Python local AI agent
121:34 waiting on port 8055 just like when we ran it directly with Python, but now it
121:38 is within a container that is directly within our stack. And so now, like I was
121:41 saying, this is super important, so I'm hitting on it again. Now when we set up
121:45 our environment variables for our container, we are going to be
121:49 referencing the service names of our different local AI services like circ or
121:55 lama instead of localhost. And so this whole like localhost versus
121:58 host.docer.ernal versus using the service name, that's the thing I see people get tripped up on
122:03 the most when they're configuring different things in a Docker environment
122:06 like the local AI package. That's why I'm spending a good amount of time
122:09 really hammering that in because I want you to get it right. And of course, if
122:12 you have any issues that come up with this, just let me know. I'd love to help
122:16 walk you through what exactly your configuration should look like. And so,
122:20 we have our agent now up and running in a container. And I'm not going to go and
122:23 demo this again right now because the next thing that we're going to move into
122:26 doing and then I'll give a final demo here is deploying everything to a
122:30 private server that we have in the cloud. All right. So, we have really
2:02:31 Introduction to Deploying & Cloud Provider Options
122:34 gotten through all of the hard stuff already. So, if you have made it this
122:38 far, congratulations. you really have what it takes to start building AI
122:43 agents with local AI now and the sky is the limit for what you can accomplish.
122:47 And so the last thing that I want to really focus on here in this master
122:50 class is taking everything that we've been building on our own computer with
122:54 our infrastructure and our agents and deploying it to a private machine in the
122:59 cloud because then we can have our entire infrastructure and agents running
123:02 24/7. We don't have to rely on having our own computer up all the time. It's
123:05 really nice to have it there because then we can also share it with other
123:10 people as well. So local AI is still considered local as long as it's running
123:14 on a machine in the cloud that you control. And so this is not just going
123:18 to open AAI or Superbase and paying for their API. This is still us running
123:22 everything ourselves on a private server. That's what I'm going to show
123:25 you how to do right now. And this process that I cover with you is going
123:29 to work no matter the cloud provider that you end up choosing. And there are
123:32 some caveats to that that I'll explain in a little bit. But yeah, you can pick
123:36 from a lot of different options. So the cloud platform that we will be deploying
123:41 to today is Digital Ocean. I use Digital Ocean a lot. It's where I deploy most of
123:46 my AI agents. So I highly recommend it. And the best part about Digital Ocean is
123:51 they have both GPU machines if you need to have a lot of power for your local
123:56 LLMs and they have very affordable CPU instances. If you want to deploy
124:00 everything for the local AI package except Olama, you can definitely go a
124:03 more hybrid route if you don't want to pay a lot because these GPU instances in
124:07 the cloud can be pretty expensive like one, two, even $5 per hour. So, what you
124:12 can do with a hybrid setup is deploy everything in the local AI package. So,
124:16 at least you have all that locally and you're not paying for those
124:19 subscriptions. But then you could still use something like OpenAI, Open Router
124:23 or Anthropic for your LLMs. So, Digital Ocean gives us the ability to do both,
124:27 and we'll dive into that when we set things up. Another really good option
124:33 for GPU instances is Tensor Do. Tensor Do is not as nice looking to me as
124:37 Digital Ocean. I generally feel like I have a better experience with Digital
124:40 Ocean, but I have deployed the local AI package to Tensor Do before on a 4090
124:46 GPU that they offer for 37 cents an hour. It's very affordable for GPU
124:49 instances. And so, this is a good platform as well. And then also if
124:55 you're okay with not running Olama on a GPU instance, like you just want a very
124:59 affordable way to host everything in a local AI package except the LLMs, then
125:03 you can use Hostinger. Hostinger is another really really good option. super
125:08 super affordable like $7 a month for a KBM2 which I'd recommend getting if you
125:14 want to deploy everything except Olama because the requirement for the local AI
125:17 package except for running the more resource intense local LLMs is you have
125:23 to have 8 GB of RAM. So don't get a cloud machine that has four or 2 GB. You
125:27 want to have 8 GB of RAM then you'll be good to go. So you can literally do it
125:31 for $7 a month through Hostinger and it's going to be something like $28 a
125:35 month through Digital Ocean unless you want a GPU instance. So I just want to
125:39 spend a couple minutes talking about different platform options. The one
125:43 thing I will say is that the local AI package runs as a bunch of Docker
125:47 containers, right? And so what you have to avoid is using a platform like
125:54 RunPod. So RunPod is a platform for running local AI. The problem is when
125:59 you pay for a GPU instance, you don't actually get the underlying machine.
126:05 You're just sshing into a container. So, you're accessing a container. And I'll
126:09 just save you the pain right now. It is so hard and basically impossible to run
126:15 Docker containers within Docker containers. So, you really can't run the
126:20 local AI package on RunPod. There are other platforms as well like Lambda Labs
126:25 is another one that I've used before. not for the local AI package for other
126:28 things but this also runs containers like you're accessing a container so you
126:34 can't do the local AI package vast.ai AI is another option, but this also is
126:39 you're renting a GPU, but it's you're accessing a container. So, you again
126:43 can't run the local AI package. And so, based on the platform that you choose,
126:47 you have to make sure that you are accessing the underlying machine when
126:51 you rent a GPU instance like Digital Ocean is the one that I will be using.
126:56 You could use a GPU instance through the Google cloud or Azure or AWS if you want
127:00 to go more enterprise. Those all give you access to the underlying machine.
127:04 It's your own private server just like we have in Digital Ocean. So you can use
127:07 that to deploy the local AI package and the agent that we built with Python. So
2:07:11 Deploying the Local AI Package to the Cloud
127:11 that's what we're going to do right now. So once you are signed into Digital
127:14 Ocean and you have your profile and billing set up and you have a project
127:18 created or you can just use the default one, now we can go ahead and create our
127:22 private server in the cloud to host the local AI package and our agent. And so
127:25 you can click on create in the top right. And there are two options here.
127:29 If you want a CPU instance, so that hybrid approach where you're hosting
127:34 everything except for the LLMs, you can select a droplet. Otherwise, what we're
127:37 going to be doing right now so I can demo the full full thing is we will
127:42 create a GPU droplet. Now, these are going to be more expensive. Like I said,
127:48 like running an H100 GPU is $3.40 an hour. It's pretty expensive. But like I
127:52 said at the start of this master class, I know so many businesses that are
127:56 willing to put tens of thousands of dollars per year into running their own
127:59 infrastructure and LLMs. And that biggest cost that contributes to that
128:03 being tens of thousands of dollars is having a GPU droplet that is running 247
128:09 in the cloud. So the hybrid approach I definitely recommend if you don't want
128:12 to pay more, you could go as low as $7 a month with Hostinger. So, there's a very
128:16 wide range of options for you depending on what you want to pay hosting the
128:21 package, LLMs, and your agents. And the other thing I will say is that if you
128:25 want to, you could just create this instance for a day and poke around with
128:29 things and then tear it down. So, you only have to pay, you know, like 20
128:31 bucks or something like that. So, there's a lot of different options for
128:35 flexibility here. And so, I'm going to pick the Toronto data center because
128:38 there's more options for GPUs available here and it's relatively close to me.
128:42 And then for the image, I'll select AI/ML ready and it's recommended because
128:47 you get the Linux bundled with all the required GPU drivers and it does run the
128:52 Ubuntu distribution of Linux. And so this process that I'm going to walk you
128:55 through for deploying local AI package to the cloud is going to work for any
129:00 Ubuntu instance that you have running on AWS or Hostinger or Tensor Do. It's just
129:05 a very standard distribution of Linux. And then for the GPU, there are a couple
129:09 of different options that we have here with Digital Ocean. H100 is an absolute
129:15 beast. 80 GB of VRAM, so it could easily run Q4 large language models, over a 100
129:21 billion parameters, even 240 GB of RAM. So, I'm not going to run this one. I'm
129:24 just kind of pointing out that this is an absolute beast. I think the one that
129:27 I'm going to choose here is going to be the RTX 6000 ADA. So, it's 48 GB of
129:34 VRAM. So, this is enough to run 70 billion parameters or smaller of LLMs at
129:40 a Q4 quantization and it comes with 64 GB of RAM and it's going to be about
129:45 $1.90 per hour. So, I'm going to select this and then I have an SSH key that is
129:50 created already. If you don't have an SSH key, then you can click on this
129:54 button to add one. And then you can follow the instructions on the right
129:57 hand side here. No matter your OS, they got instructions to help you out. You
130:00 just have to paste in your public key and then give it a name. So, I've got
130:03 mine added already. And then the only other thing that I really have to select
130:07 here is a unique name. So, I'll just say local AI package. And I'll just say GPU
130:11 because I already have the regular version just deployed on a CPU instance.
130:16 And then for my project, I will select Dynamis. I can add it along with my
130:19 other instance that I've got up and running. And there we go. So now I can
130:22 go ahead and just click on create GPU droplet. It is that easy to get our
130:27 instance ready for us to access it and start installing everything and getting
130:30 everything up and running just like we did on our computer. And so I'll go
130:34 ahead and pause and come back once our machine is created in just a few
130:37 minutes. And boom, there we go. Just after a minute, we have our GPU droplet
130:42 up and running. And so the one thing I will say is I had to request access to
130:46 create a GPU instance on the Digital Ocean platform. However, they approved
130:50 it in less than 24 hours. So, it's very easy to get that if you do want to
130:53 create a GPU instance. Otherwise, you can just create one of their normal
130:58 droplets, one of their CPU instances. Now, before we connect to this machine,
131:01 the one thing that you want to take note of is the public IPv4 address. We'll use
131:06 this to set up subdomains in a little bit. And so, this is how we get to it
131:12 for a GPU droplet. For a CPU droplet, it looks a little bit different. You'll
131:15 usually see the IPv4 somewhere at the top right here. And so, take note of
131:18 that. Save it for later. We'll be using that in a little bit. And then to
131:22 connect to our droplet, we can either do it through SSH with our IPv4 and the SSH
131:27 key that we set up when we configured this instance or we can access the web
131:31 console. For a CPU instance, usually you go to like an access tab and then you
131:35 can launch the web console. For the GPU instance, I can just click this to
131:38 launch it right here. So we have a separate window that comes up and boom,
131:42 we now have access to our instance. It is that easy to get connected. And now
131:46 we can go through the same process that we did on our computer to install the
131:51 local AI package. Now there are some different steps that we have to take and
131:55 that's why I'm including this at the end of the master class especially. Um so if
131:59 you scroll down in the read me here there are some specific instructions for
132:03 deploying to the cloud. And so you have to make sure that you have a Linux
132:07 machine preferably on the Ubuntu distribution which is that is what we
132:11 are using. And then there are a couple of extra steps. And so the first thing
132:15 that we have to set up is our firewall. We have to open up a couple of ports so
132:20 that we can access our machine from the internet. Set up our subdomains for
132:24 things like N8N and Open Web UI. And so you want to take this command UFW
132:29 enable. I'll just go ahead and paste it in. And so we are going to and you can
132:33 just type Y to continue here. It's going to disrupt SSH connections, but we don't
132:36 really care. So UFW enable. So we're enabling the firewall. And then you want
132:41 to copy this command to allow both ports 80 and 443. And so 80 is HTTP and then
132:48 443 is HTTPS. And then you can just do the last command here, UFW reload. So
132:54 now we have those ports available for us to communicate with all of our services.
132:59 And so this is the entry point to Caddy. Caddy is the service in the local AI
133:03 package that is going to allow us to set up subdomains for all of our services.
133:08 is and so any kind of communication to our droplet is going to go through caddy
133:13 and then caddy will distribute to our different services based on the port or
133:17 the subdomain that we are using. So this is called a reverse proxy. You've
133:21 probably heard of like EngineX or traffic before. Caddy is something very
133:25 much like that. And it also makes it so that we can get HTTPS so we can have
133:29 secure endpoints set up automatically and it manages that encryption for us.
133:33 It's a very beautiful platform. So, let me scroll back down to the specific
133:36 steps for deploying to the cloud. There is a quick warning here about how Docker
133:40 manages ports and things, but this is as secure as we possibly can make it. So,
133:43 trust me, we've put a lot of effort and I actually had someone from the Dynamis
133:48 community, um, Benny right here. He actually helped me a lot with security
133:53 for the local AI package. So, thank you, Benny, for helping out with that. We're
133:56 really making sure that because local AI like the whole thing is like you want to
133:59 be private and secure and so we're making sure that this package handles
134:03 all the best practices for that. So very much top of mind for us. Um and then we
134:07 can go ahead and go through the usual steps for setting up the local AI
134:11 package. The only other thing we have to do that's unique for cloud is we have to
134:16 set up a records for our DNS provider so we can have our subdomain set up for our
134:20 different services. So we'll get into that in a little bit. But first I just
134:24 want to get the local AI package up and running. And so I'm going to go ahead
134:28 and paste this command here to clone the repository. Git comes automatically
134:33 installed with our GPU droplets. And then I can change my directory into the
134:38 local AI package. And let me zoom in on this a little bit here so it's very easy
134:41 for you to see because now then what I want to do is I want to copy
134:45 the.env.ample file to a new file called. So then if I do an ls command so we can see all the
134:51 files that are available in our directory. We I guess it doesn't show um
134:56 I do ls- a there. Now we can see the env.env.example. example. And so now I can do nano.env.
135:05 This is going to give us a basically a text editor directly in the terminal
135:08 here so that we can set all of our environment variables just like we did
135:12 with the local AI package on our computer. So this time I'm not going to
135:15 go into the nitty-gritty details of setting up all these environment
135:19 variables because it is the exact same process. In fact, you can literally
135:23 reuse all of the secrets that you already set up when you hosted it on
135:27 your own computer as long as those are actually secure. Like I know a lot of
135:29 times you might just do some kind of placeholder stuff when it's just running
135:32 on your computer and then you want it more secure in the cloud. So make sure
135:36 you have real values for everything, but yeah, you can reuse a lot of the same
135:39 things. Um though for best security practice, you probably do want to make
135:42 everything different. But yeah, the one thing that I want to focus on with you
135:46 here that does change is our configuration for Caddy. Now that we are
135:50 deploying to the cloud, we want to have subdomains for our different services.
135:54 And so like NN for example, we want to set the host name for that. You want to
135:59 do the same thing for N8N open web UI superbase and then your let's encrypt
136:04 email. Obviously you want to uncomment that as well because this is the email
136:08 that you want to set for your SSL encryption. And so I'm just going to do
136:13 coal dynamis.ai. And you can just set this to whatever email that you want to use. And so
136:18 basically what you want to do here is just uncomment the line for each of the
136:23 services that you want to have subdomains created for. So, if you're
136:26 also using Flowwise, which I'm just not in this master class here, but if you
136:30 are, then you want to uncomment this line as well. If you're using Langfuse,
136:34 then you uncomment this line. I'm going to leave them commented right now just
136:38 for simplicity sake. The two that I would generally recommend not
136:43 uncommenting ever is Olama and CRXNG. We don't really want to expose them through
136:45 a subdomain because we're just going to use them as internal tooling for our
136:49 agents and applications that we have running on this server. So, we want to
136:53 keep those nice and private. But for everything that we do want to expose
136:56 that is protected with a username and password like N8N open web UI and
137:01 superbase we can uncomment those. And so we got that set up now. But we have to
137:05 obviously provide real values for them as well. And so for example I'm just
137:09 going to say nyt for YouTube dynamis and then I'll do. So you want to
137:17 define your exact URL that you want to have for this domain. Obviously, it has
137:21 to be a domain that you control because we'll go and we'll set up the records in
137:25 the DNS in a little bit. And so, I'll do the exact same thing for open web UI.
137:28 So, it's open web UI and I'm just doing YT because I already have the local AI
137:33 package hosted on my domain. And so, I can't do just open web UI because it's
137:36 already taken. So, open web UIT.dynamis.ai. And then finally, for Superbase, it'll
137:44 be superbaseyt.dynamis.ai. Boom. All right. There we go. So, go
137:47 ahead and take care of this. set up the rest of your environment variables and
137:51 then we can go ahead and move on. And the way that you exit out of this and
137:55 save your changes is you do controll X or command X on Mac, type Y and then hit
138:03 enter. So again that is controll X then type Y then press enter. That is how you
138:08 exit out. And so now if I do a cat of the env this is how you can print it out
138:11 in the terminal. So we can verify that the changes that we made like everything
138:16 for caddy is indeed made. So do that. Also change all the other environment
138:18 variables as well. I'm going to do that off camera and come back once that is
138:22 taken care of. All right. So my environment variables are all set. Now
138:26 the very last thing we have to do before we start all of our services is we need
138:32 to set up DNS. And so remember, copy your IPv4 address and then head on over
138:38 to your DNS provider. And so this process is going to look very similar no
138:42 matter the DNS provider that you have. Like I'm using NameCheep here. A lot of
138:47 people use Hostinger or Bluehost. You are able to with all these providers go
138:50 to something that is usually called like advanced DNS or manage DNS and then you
138:55 can set up custom A records here which we're going to do to set up a connection
139:01 for all of our subdomains to the IPv4 of our digital ocean droplet or whatever
139:05 cloud provider you are using. And so I'm going to go ahead and click on add new
139:09 record. It's going to be an A record for the host. It's going to be the subdomain
139:14 that I want. So, N8NYT for example. And then for the IP address, I just paste in
139:19 the IPv4 of my Digital Ocean GPU droplet. And then I'll go ahead and
139:23 click on the check here to save changes. And then I'll just go ahead and do the
139:28 same thing for open web UIT. I can't forget the YT. There we go. And
139:33 then for you, it might be more than just three, but for me, the only other one
139:36 that I have here right now is Superbase because I'm just keeping it very, very
139:41 simple. So, superbase yt and then paste in the IP again. Okay, there we go.
139:45 Boom. All right, so we have all of our records set up. And it's very important
139:49 to do this before you run things for the first time because otherwise Caddy gets
139:52 very confused. It tries to use these subdomains that you don't actually have
139:57 set up yet. And so take care of that. Then we can go back into our instance
140:01 here and run the last command. And so going back to the readme here, if I
140:06 scroll all the way down to deploying to the cloud, there is a specific parameter
140:11 that you want to add. This is very important for deploying to the cloud
140:15 because when you select the environment of public, it's going to close off a lot
140:20 more ports to make this very very secure. So any of the services that you
140:25 access from outside of the droplet, it has to go through caddy. So we use the
140:30 reverse proxy as the only entry point into any of our local AI services. This
140:34 is how we can make things as airtight as possible. We have security in mind like
140:39 I said. And so make sure that you run this command with the environment
140:43 specified. We didn't do this locally because when we are running things
140:45 locally, we don't care about security as much because it's not like our machine
140:49 is accessible to the internet like a cloud server is. And so it defaults to
140:54 the environment of private which just doesn't do as much security stuff. And
140:57 so go ahead and run this. And then of course make sure that you're using the
141:01 correct profile. So if you're using just a CPU instance with a regular droplet or
141:05 hostinger or whatever, you'd want to change this to CPU instead of GPU
141:10 Nvidia. But in our case, because we are paying the $2 an hour for a killer GPU
141:16 droplet. I can go ahead and run this command with the profile of GPU Nvidia.
141:20 Now, I left this error in here intentionally because I want to show you
141:24 what it looks like. If you get unknown shorthand flag p-p, that means that you
141:29 don't actually have docker compose installed. And this happens for some
141:32 cloud providers. And there's a very easy fix for this that I want to walk you
141:35 through. So you can even test this. Just docker compose. It'll say that compose
141:40 is not a docker command. And so going back to the readme here, I have a couple
141:44 of commands that you just have to run if this happens to you. This is at the
141:47 bottom of the deploying to the cloud section of the readme. So you can just
141:51 copy these one at a time, bring them into your droplet or your machine,
141:55 wherever you're hosting it, and go ahead and run them. And so I'm just going to
141:58 do this off camera really quickly. I'm just going to copy each of these into my
142:02 droplet. It's very easy. You can just run all of these. They're really fast as
142:06 well. So none of them are going to take very long. This is just going to get
142:10 everything ready for you so that Docker Compose is a valid command. So you can
142:14 then run the start services script. So there we go. All right. I went ahead and
142:18 ran all of those. I'll clear my terminal again and then go back to the main
142:23 command here to start our services. Boom. All right. So now we pulled
142:26 everything from Superbase, set up our CRXNG config. Now we are pulling our
142:31 Superbase containers. So again, same process is running on our computer where
142:35 it'll pull Superbase, it'll run everything for Superbase, then it'll
142:38 pull and run everything for the rest of our services. So I'll pause and come
142:42 back once this is all complete. And boom, there we go. We have all of our
142:46 services up and running. You should see green check marks across the board like
142:51 this. We are good to go. And we don't have Docker Desktop, so it's not as easy
142:55 to dive into the logs for our containers. But one quick sanity check
142:59 that you can do just in the terminal is run the command docker ps- a. This will
143:03 give you a list of all of our containers that are running here. We can make sure
143:06 that all of them are running, that we don't see any that are constantly
143:10 restarting or ones that are down. So we do have two that are exited, but these
143:14 are Nit and import and the Olama pull. These are the two that we know should be
143:17 exited. Just make sure that everything is good to go. Then we can head on over
2:23:18 Testing Our Deployed Local AI Package
143:22 to our browser. And because we have DNS set up already, we configure Caddy. We
143:25 can now navigate to our different services. Like I can go to
143:30 nadnyt.dynamus.ai. And boom, there we go. It's having us set up our owner account or um we can
143:39 just go to open web ui yt.dynamis.ai. Boom. And there is our open web UI. All
143:43 right. So I'll go ahead and get started. Uh, we'll have to create our account.
143:46 I'll do this off camerara, but yeah, you just create your first-time accounts for
143:49 everything. And then we'll do the same thing for let's do superbaseyt.dynamus.ai.
143:55 And boom, there we go. So, all of our services are up and running. And so now
143:59 we can log into these and create our accounts and we can interact with our
144:03 agents and bring them in. We can work with Llama in the same way. And so,
144:06 let's go ahead and do that. I'll just go ahead and create these accounts off
144:09 camera. So, I've got my accounts created for N8N and then also open WebUI. And
144:13 you can do the same for all the other accounts you might have to create for
144:16 things like Langfuse as well. And then within Open Web UI, we'll go to the
144:21 admin panel, settings, connections. Make sure that your Olama API is set
144:25 correctly to reference the service Olama. Usually this will default to
144:30 localhost or host.docer.in internal. So you can get that there. You have to set
144:34 the OpenAI API key as well, just to any kind of random value. It's just a little
144:38 bug in open web UI. Then click on save and then you can go back and your models
144:41 will be loaded. Now, now, one thing I found with Open Web UI, after you change
144:46 the Olama base URL, you have to do a full refresh of the website. Otherwise,
144:49 you'll get an error when you use the LLM. So, just a really small tangent
144:52 there, a little tidbit there, but yeah, we'll go ahead and select the model
144:55 that's pulled by default. And you can pull other ones into your Llama
144:57 container as well, like we already covered. You don't have to restart
145:00 things. So, let's run a little test. I'll just say hello, and we'll see if we
145:04 can load the model now. And boom, look at how fast that was, because we have a
145:08 killer GPU instance right now. I could run much larger LLMs if I wanted to. Um,
145:14 so yeah, let's see. What did I just say? All right, we'll do another test here.
145:17 And yeah, look at how fast that is. It's blazing fast because everything is
145:20 running locally on the same infrastructure. There's no network
145:24 delays. And so we have a powerful GPU, no network delays. We get some blazing
145:28 fast responses from these LLMs right now. And so I don't want to go and test
2:25:32 Deploying Our Python AI Agent to the Cloud
145:34 everything with N8N again. But what I do want to show you how to do right now is
145:39 take the Python agent that we have in this repository and deploy this onto the
145:44 cloud as well. Adding it into the local AI Docker Compose stack just like we did
145:49 on our computer, but now hosting it all in the cloud. So that's the very last
145:52 thing that I want to cover with you for our cloud deployment. And so just like
145:56 with the local AI package, we can follow the instructions here in the readme to
145:59 get everything up and running on our machine. And so the first thing we have
146:03 to do is we need to clone our repository. And so I'm going to copy
146:07 this command, go back over into my terminal here for my instance. And I
146:11 step back one directory level by the way. So I'm now at the same place where
146:14 I have the local AI package so we can run them side by side. So I'll paste
146:19 this command to clone automator agents. And then I can cd into it. And then I
146:22 also want to change my directory within the Python local AI agent specifically.
146:28 And so now doing an ls- a we can see the enenv.ample. So I'm going to just like
146:33 we did before copy this and turn it into aenv. And then I can do nano.env.
146:39 And there we can edit all of our environment variables. And so because
146:42 we're running this in the docker container attaching it to the local AI
146:46 stack. The way that I referenced was going to be just calling out the service
146:51 name. So port 111434 /v1. And then the API key. It's just that placeholder there for lama for the
146:58 LLM choice. If I want to get the exact ID of one that I already had pulled,
147:01 I'll actually show you how to do this really quick. So, I'm going to do
147:06 controlx y enter to save and exit. And then the way that you can execute a
147:12 container, it's docker exec-it and then the name of our container which
147:15 is lama. We already have this running. And then /bin/bash. And so what this is going to do is now
147:22 instead of being within our machine, we are within our Olama container. And so
147:28 now I can run the command Olama list and then I can see the LLMs that I have
147:33 available to me. So I have Quen 2.57B. So I'm going to go ahead and copy this
147:37 ID. I don't have it memorized. So this is my way to go and reference it really
147:41 quickly. And then this is also how you can access each of your containers when
147:44 you don't have Docker Desktop. You just do docker exec-it the name of your
147:50 container and then bin /bash. So kind of like how we had that exec tab in docker
147:53 desktop. And then once I'm done in here, I can just do exit. And now I'm back
147:58 within my host machine, my GPU droplet. So that's another little tidbit, another
148:02 golden nugget I wanted to give you there. But yeah, we'll go back into our
148:05 environment variables here. And I have that ID for quen 2.57b copied. So I'll
148:12 paste that in. And boom, there we go. And then for our superbase URL, it's
148:17 going to be http col back slashbackslash and then it is kong. So I guess that I
148:21 should have been more clear on this when I set things up locally. So I'll be sure
148:24 to update the documentation for this, but it's going to be Kong port 8000
148:29 because Kong is the service that we have in Superbase specifically for the
148:33 dashboard. And then the service key, well, I'm just going to go ahead and get
148:37 that from my local AI package because I have this set up in environment
148:40 variables. And so I just have to go and reference my environment variables here
148:46 to get my service ro key. And boom, there we go. Okay. So I'm going to go
148:49 ahead and paste this in. And um now I'm just going to delete this instance
148:51 after. I don't really care that I'm exposing this right now. And then for
148:57 the CRNG base URL, HTTP CRXNG 8080. And then for my bearer token, I
149:01 just have it set to test off. And then we don't need to set the OpenAI API key
149:05 because that was just for the OpenAI compatible demo earlier. So that is all
149:09 of my configuration for this container. I got to be really clear on this. I'll
149:12 update the docs for this. But uh otherwise we are looking good for our
149:17 environment variables. So controll x y enter to save. You can do a cat.env just
149:22 for that sanity check to make sure that everything's saved. We are looking good.
149:27 All right. And so now within the readme here, so I'll go back to the
149:30 instructions. We did all this already. We changed our directory. We set up our
149:33 environment variables and configured them. Now we need to run this stuff in
149:38 our SQL editor in Superbase. This is how we can get our table set up because we
149:42 haven't run things with N8N first. So we don't have this table created already.
149:47 And so now I just have to sign into Superbase here. So I've got my username
149:51 which is Superbase. And then I'm just copy and pasting the username and
149:55 password that I have um that I have set here in my environment variables. And so
150:00 I'll go to my SQL editor and go back here. I know I'm moving kind of quick
150:03 here, but I got these instructions laid out in the readme. I'm going to paste
150:08 this like that and go ahead and click on run. And then boom, there we go. So now
150:12 if I go into tables and search for NN, we have NN chat histories, a new
150:16 currently empty table. All right, looking good. And then going back after
150:21 we do that, now we can run the agent. So I just have to take this command right
150:24 here and then I'll go back to my droplet. And the one thing that I
150:28 mentioned earlier, but I want to cover again. If I go into the uh hold on, I
150:33 need to change my directory back. So, automator agents and then python local
150:38 AI. If I go into my docker compose, you have to make sure that the include path
150:42 is correct. And so, I'm going to update this by the time you get your hands on
150:44 it here where it's just going to be going two levels back. That's what we
150:47 need to do. So, make sure that we reference the right path to the local AI
150:52 package on our machine and then crl + x y enter to save. That's because we have
150:57 to go back from Python local AI agent, then back from the automator agents
151:00 directory, and then within that same directory, we have the local AI package.
151:04 So, we're good to go. Now, I can go ahead and paste in the command here to
151:08 build our agent and include it in the local AI stack. And so, it's going to
151:11 have to build everything. Takes a minute like we saw already. So, I'll pause and
151:15 come back once this is done. And all right, there we go. About 30 seconds
151:18 later and we are good to go. So, now I can do the docker ps- a again. And this
151:23 time if I look through this list very carefully, take a little bit here, I
151:28 should be able to see my Python agent. There we go. Local AI, Python, local AI
151:32 agent. And it starts with local AI because it is a part of that docker
151:36 compose stack. And by the way, I can do docker exec-it. And then I can do python-lo
151:46 agent bin bash. I can I can run this as well. And then what I can do is if I do
151:51 a print env command, I can see all the environment variables that are set
151:54 within this container. That's everything that we set up in the env. So I'm being
151:59 very comprehensive with this master class, showing you how you can tinker
152:02 around with different things like accessing your containers and seeing the
152:05 environment variables, making sure that everything that we specified in thev is
152:09 actually taking effect here. And sure enough, it is. So we are looking good.
2:32:12 Testing Our Full Agent Setup
152:13 So I'll go ahead and exit. We're back in our root machine now. We have our
152:18 container up and running and also it's running on port 8055. And so now we can
152:23 go back to open web UI within uh open webyt.dynamus.ai and we can set up our pipe. And so I'm
152:30 going to go to the admin panel functions. We don't have a function
152:33 here. So I have to import it. And so what I can do I'll actually do this
152:36 here. I'll just Google you can literally Google np pipe open web UI. And it'll
152:40 bring you to the one that I have here. You just have to sign into open web UI.
152:44 I'll click on get. And then this time for my URL instead of being something on
152:49 local host, I'm going to copy my actual subdomain here. So import to open web UI
152:53 and then boom, we have our pipe. So I'll click on save, confirm, and then within
152:59 the valves here, I can set all of my values. So I'll just click on default
153:02 for all of these so I can get a starting point here. And then yeah chat input is
153:07 good output is good the bearer token is test off and then for my URL it's going
153:14 to be http colon and then it's going to be uh the name of my service python local aai
153:24 agent port 8055. Let me get that right. 8055 and then slashinvoke- python- agent. I believe I
153:32 have this memorized. I think we are good there. So I'm going back if I clear this
153:37 and run a docker ps- a it is indeed called um python local AI agent that is
153:42 the name of our service so open web UI is able to connect to the agent directly
153:46 with this name because we are deploying it in the same docker network and so I
153:51 think we are looking good all right so I'm going to go ahead and click on save
153:56 all right and then go back and start a new chat and then also like I said a lot
154:00 of times it helps just to refresh open UI completely open web UI completely
154:03 completely. All right, there we go. And then now instead of Oh, I have to
154:07 actually enable. Let me go back to the admin panel. Functions, you have to make
154:11 sure this is ticked on. Um, so that we have the pipe enabled. Now going back
154:15 here and I'll refresh as well. We've got our pipe selected. And now I can say
154:20 hello. And there we go. Super super fast. We got a response from our Python
154:26 agent. Take a look at that. And then also going into my database here, you
154:29 can see that I have all these messages in the NAN chat histories table. We'll
154:33 take a look at that. All right. And then we can also ask it to do web search. I
154:38 can say like what is the latest LLM from um Anthropic for example. So it has to
154:44 do a quick Seir XNG search leveraging that. Uh latest is Claude Opus 4. All
154:49 right. And man that was so fast. We have no network delays now because everything
154:53 is running on the same network and we have an absolute killer GPU. So this is
154:56 so cool. Also, one thing that I want to mention is sometimes depending on your
155:02 cloud provider, CRXNG will not start successfully. There's one thing you have
155:05 to do. It's just a really small tidbit. If you run into this issue where the
155:09 CRXNG container is constantly restarting, what you want to do is go to
155:14 your local AI package and then run the command chmod 755 SER XNG. That's the
155:22 Seir XNG folder. And so the CRXNG folder is responsible for storing the
155:26 configuration that we have for CRXNG by default. Sometimes you don't have
155:29 permissions to write this file and it needs to do so. So I'm going to update
155:33 the troubleshooting to include this. But yeah, just a small tidbit. And then you
155:37 can just go ahead and run the command to start everything again. Um, obviously
155:41 you have to go back one directory then you can run this um and restart
155:45 everything. That easy to restart things to make changes take effect for your
155:49 package and then you'll be good to go. So yeah, we have everything working
155:54 here. So this is pretty much it for the master class. Now we have our local AI
155:58 package up and running with an agent and the network as well. We're communicating
156:02 to it within Open Web UI directly. There is so much that we have gotten through
156:07 now. So congratulations for making it this far. All right, I'm going to be totally
2:36:08 Additional Resources
33:08 Downloading Quantized LLMs in Ollama
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:16 Offloading (Splitting LLMs between GPU + CPU)
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 Critical Ollama Configuration
3:23 Dynamous AI Agent Mastery
3:26 interested in mastering more than just local AI, but building entire AI agents
3:31 in a local AI environment or even with cloud AI, definitely check out
3:35 dynamis.ai. AI. This is my community for early AI adopters just like yourself.
3:40 And a big part of this community is the AI agent mastery course where I dive
3:45 super deep into my full process for building AI agents. I'm talking planning
3:50 and prototyping and coding and using AI coding assistance and building full
3:55 frontends for our AI agents and securing things and deploying things. There's a
3:59 lot more coming soon for this course as well. I'm very actively working on it.
4:03 And a big part of this course is the complete agent that I build throughout
4:07 it. I build both with cloud AI and local AI. And so this master class will help
4:11 you get very very comfortable with local AI. But when it comes to building
4:15 complex agents and really getting deep into building out agents, then you
4:19 definitely want to check out the AI agent mastery course here in Dynamis.AI.
4:23 So with that, let's get back to the master class, diving into what local AI
4:28 is all about. Let's start by laying the foundation. What is local AI in the
4:32 first place? Well, very simply put, local AI is running your own large
4:36 language models and infrastructure like your database and your UI entirely on
4:42 your own machine 100% offline. So when you think about when you typically want
4:46 to build an AI agent, you need a large language model maybe like GPT4.1
4:51 or Claude 4 and then you need something like your database like Superbase and
4:55 you need a way to create a user interface. you have all these different
4:58 components for your agent and typically you're using APIs to access things that
5:03 are hosted on your behalf. But with local AI, we can take all of these
5:07 things completely in our own control running them ourselves. So this is
5:11 possible through open-source large language models and software. So
5:15 everything is running on your own hardware instead of you paying for APIs.
5:20 So we run the large language model ourself on our own machine instead of
5:25 paying for the OpenAI API for example. And so for large language models, there
5:29 are thousands of different open source large language models available for us
5:32 to use in a lot of different ways. And some of these you've probably heard of
5:37 before, like Deepseek R1, Quen 3, Mistral 3.1, Llama 4. These are just a
5:41 couple of examples of the most popular ones that you've probably heard of
5:44 before. We'll be tinkering around with using some of these in this master
5:48 class. And then we also have open-source software. So all of our infrastructure
5:53 that goes along with our agents and LLMs, things like Olama for running our
5:58 LLMs, Superbase for our database, N8N for our no/ lowode workflow automations,
6:04 and open web UI to have a nice user interface to talk to our agents and
6:07 LLMs. And we'll dive into using all of these as well. Now, because local AI
6:12 means running large language models on our own computer, it's not as easy as
6:18 just going to claw.ai AI or chatgbt.com and typing in a prompt. We have to
6:22 actually install something, but it still is very easy to get started. So, let me
6:25 show you right now with a hands-on example. So, here we are within the
6:30 website for Olama. This is just.com. I'll have a link to this in the
6:33 description of the video. This is one of the open- source platforms that allows
6:39 us to very easily download and run local large language models. And so, you just
6:42 have to go to their homepage here and click on this nice big download button.
6:45 You can install it for Windows, Mac or Linux. It really works for any operating
6:49 system. Then once you have it up and running on your machine, you can open up
6:53 any terminal. Like I'm on Windows here, so I'm in a PowerShell session and I can
6:58 run Olama commands now to do things like view the models that I have available on
7:03 my machine. I can download models and I can run them as well. And the way that I
7:08 know how to pull and run specific models is I can just go to this models tab in
7:12 their navigation and I can browse and filter through all of the open source
7:16 LLMs that are available to me like DeepSeek R1. Almost everyone is familiar
7:20 with DeepSeek. It just totally blew up back in February and March. We have
7:25 Gemma 3, Quen 3, Llama 4, a few of them that I mentioned earlier when we had the
7:30 presentation up. And so we can click into any one of these like I can go into
7:35 DeepSeek R1 for example and then I have the command right here that I can copy
7:39 to download and run this specific model in my terminal. And there are a lot of
7:44 different model variants of DeepSeek R1. So we'll get into different sizes and
7:47 hardware requirements and what that all means in a little bit, but I'll just
7:50 take one of them and run it as an example. So I'll just do a really small
7:54 one right now. I'll do a 1.5 billion parameter large language model. And
7:58 again, I'll explain what that means in a little bit. I can copy this command.
8:01 It's just lama run and then the unique ID of this large language model. So I'll
8:06 go back into my terminal. I'll clear it here and then paste in this command. And
8:09 so first it's going to have to pull this large language model. And the total size
8:15 for this is 1.1 GB. And so it'll have to download it. And then because I used the
8:20 run command, it will immediately get me into a chat interface with the model
8:24 once it's downloaded. Also, if you don't want to run it right right away, you
8:27 just want to install it, you can do Olama pull instead of Olama run. And
8:32 then again, to view the models that you have available to you installed already,
8:36 you can just do the Olama list command like I did earlier. And so, right now,
8:39 I'll pause and come back once it's installed in about 30 seconds. All
8:43 right, it is now installed. And now I can just send in a message like hello.
8:47 And then boom, we are now talking to a large language model. But instead of it
8:50 being hosted somewhere else and we're just using a website, this is running on
8:54 my own infrastructure, the large language model and all the billions of
8:59 parameters are getting loaded onto my graphics card and running the inference.
9:02 That's what it's called when we're generating a response from the LLM
9:06 directly within this terminal here. And so I can ask another question like um
9:13 what is the best GPU right now? We'll see what it says. So it's thinking
9:16 first. This is actually a thinking model. Deepseek R1 is a reasoning LLM.
9:20 And then it gives us an answer. It's top GPU models today. 3080 RX6700.
9:27 Obviously, we have a training cutoff for local large language models just like we
9:30 do with ones in the cloud like GPT. And so the information is a little outdated
9:34 here, but yeah, this is a good answer. So we have a large language model that
9:38 we're talking to directly on our machine. And then to close out of this,
9:42 I can just do control D or command D on Mac. And if I do list, we have all the
9:47 other models that you saw earlier, plus now this one that I just installed. So
9:50 these are all available for me to run again just with that Olama run command.
9:54 And it won't have to reinstall if you already have it installed. Run just
9:58 installs it if you don't have it yet already. So that is just a quick demo of
10:03 using Olama. We'll dive a lot more into Olama later, like how we can actually
10:07 use it within our Python code and within our N8N workflows. This is just our
10:11 quick way to try it out within the terminal. Now, to really get into why we
10:15 should care about local AI now that we know what it is, I want to cover the
10:19 pros and cons of local AI and what I like to call cloud AI. That's just when
10:23 you're paying for things to be hosted for you, like using Claude or Gemini or
10:28 using the cloud version of N8N instead of hosting it yourself. And I also want
10:32 to cover the advantages of each because I don't want to sugarcoat things and
10:35 just hype up this master class by telling you that you should always use
10:39 local AI. That is certainly not the case. There is a time and place for both
10:43 of these categories here, but there are so many use cases where local AI is
10:50 absolutely crucial. You have no idea how many businesses I have talked to that
10:54 are willing to put tens of thousands of dollars into running their own LLMs and
10:58 infrastructure because privacy and security is so crucial for the things
11:02 that they're building with AI. And that actually gets into the first advantage
11:06 here of local AI, which is privacy and security. You can run things 100%
11:11 offline. The data that you're giving to your LLMs as prompts, it now doesn't
11:16 leave your hardware. It stays entirely within your own control. And for a lot
11:20 of businesses, that is 100% crucial, especially when they're in highly
11:24 regulated industries like the health industry, finance, uh even real estate.
11:28 Like there's so many use cases where you're working with intellectual
11:32 property or just really sensitive information. You don't want to be
11:36 sending your data off to an LLM provider like Google or OpenAI or Enthropic. And
11:41 so as a business owner, you should definitely be paying attention to this
11:45 if you are working with automation use cases where you're dealing with any kind
11:48 of sensitive data. And then also if you're a freelancer, you're starting an
11:52 AI automation agency, anything where you're building for other businesses,
11:55 you are going to have so many opportunities open up to you when you're
11:59 able to work with local AI because you can handle those use cases now where
12:03 they need to work with sensitive data and you can't just go and use the OpenAI
12:08 API. And that is the main advantage of local AI. It is a very big deal. But
12:11 there are a few other things that are worth focusing on as well. Starting with
12:16 model fine-tuning, you can take any open- source large language model and
12:20 add additional training on top with your own data. Basically making it a domain
12:24 expert on your business or the problem that you are solving. It's so so
12:29 powerful. You can make models through fine-tuning more powerful than the best
12:33 of the best in the cloud depending on what you are able to fine-tune with
12:38 depending on the data that you have. And you can do fine-tuning with some cloud
12:42 models like with GPT, but your options are pretty limited and it can be quite
12:46 expensive. And so it definitely is a huge advantage to local AI. And local AI
12:52 in general can be very coste effective, including the infrastructure as well. So
12:56 your LLMs and your infrastructure. You run it all yourself and you pay for
13:00 nothing besides the electricity bill if it's running on your computer at your
13:04 house or if you have some private server in the cloud. You just have to pay for
13:07 that server and that's it. There's no N8N bill, no Superbase bill, no OpenAI
13:13 bill. You can save a lot of money. It's really, really nice. And on top of that,
13:17 when everything is running on your own infrastructure, the agents that you
13:22 create can run on the same server, the same place as your infrastructure. And
13:26 so it can actually be faster because you don't have network delays calling APIs
13:30 for all your different services for your LLMs and your database and things like
13:34 that. And then with that, we can now get into the advantages of cloud AI.
13:38 Starting with it's a lot easier to set up. There's a reason why I have to have
13:42 this master class for you in the first place. There are some initial hurdles
13:47 that we have to jump over to really have everything fully set up for our local
13:51 LLMs and infrastructure. And you just don't have that with cloud AI because
13:54 you can very simply call into these APIs. You just have to sign up and get
13:58 an API key and that's about it. So, it certainly is easier to get up and
14:02 running and there's less maintenance overall because they are hosting things
14:05 for you. Superbase is hosting the database for you. OpenAI is hosting the
14:09 LLM for you. So, you don't have to manage things on your own hardware. With
14:13 Local AI, you have to apply patches and updates if you have a private server in
14:17 the cloud. You have to manage your own hardware if you're running on your own
14:20 computer, making sure that it's on 24/7, if you want your database on 24/7, that
14:24 kind of thing. It's just less maintenance with cloud AI. And then
14:28 probably the biggest advantage of cloud AI overall is that you have better
14:34 models available to you. Claude 4 sonnet or opus for example is more powerful
14:40 than any local AI that you could run. So we have this gap here and this gap was a
14:45 lot bigger at one point even a year ago. The best local LLMs absolutely crushed
14:50 the best local LLMs and that gap is starting to diminish. And so I really
14:55 see a future where that gap is diminished entirely and all the best
14:59 local LLMs are actually on par with the best cloud ones. That's the future I
15:03 see. That's why I think that cloud that's why I think that local AI is such
15:07 a big deal because the advantages of local AI, those are just going to get
15:12 more prevalent over time when businesses realize they really want private and
15:15 secure solutions. And then the advantages of cloud AI, I think those
15:19 are actually going to diminish over time. That's the key. minimal setup,
15:24 less maintenance. Well, those advantages are going to go away as we have
15:28 platforms and better instructions and solutions to make the setup and
15:32 maintenance easier for local AI and we have the gap that's continuing to
15:35 diminish between the power of these LLMs. All these advantages are going to
15:39 actually go away and then it'll just completely make sense to use local AI
15:43 honestly probably for like every single solution in the future. That's really
15:47 what I see us heading towards. And then the last advantage to cloud AI which
15:51 also I think will go away over time is that you have some features out of the
15:55 box like you have memory that's built directly into chat GPT. Gemini has web
15:59 search baked in even when you use it through the API like these kind of
16:02 capabilities that are out of the box that you have to implement yourself with
16:07 local AI maybe as tools for your agent and you can definitely do that but it is
16:10 nice that these things are out of the box for cloud AI. So those are the pros
16:15 and cons between the two. I hope that this makes it very clear for you to pick
16:19 right now for your own use case. Should I implement local AI or cloud AI? A lot
16:24 of it comes down to the security and privacy requirements for your use case.
16:28 Now, the next big thing that we need to talk about for local AI is hardware
16:33 requirements. Cuz here's the thing, large language models are very resource
16:39 inensive. You can't just run any LLM on any computer. And the reason for that is
16:43 large language models are made up of billions or even trillions of numbers
16:47 called parameters. And they're all connected together in a web that looks
16:51 kind of like this. This is a very simplified view with just a few
16:54 parameters here. But each of the parameters are nodes and they're
16:58 connected together. The input layer is where our prompt comes in and our prompt
17:02 is fed through all these hidden layers and then we have the output at the end.
17:05 This is the response we get back from the LLM. But like I said, this is a very
17:10 simplified view. GPT4, for example, like you can see on the right hand side, is
17:15 estimated to have 1.4 trillion parameters. And so, if you want to fit
17:20 an entire large language model into your graphics card, you have to store all of
17:25 these numbers. And even though we can handle gigabytes at a time in our
17:28 graphics cards through what is called VRAMm, storing billions or trillions of
17:34 numbers is absolutely insane. And so that's why large language models, you
17:37 actually have to have a pretty good graphics card if you want to run some of
17:42 the best ones. And so looking at Olama here, when we see these different sizes,
17:47 going back to their model list, like 1.5 billion parameters or 27 billion
17:51 parameters, there are different sizes for the local LLMs. Obviously, the
17:56 larger a local LM that you are running, the more performance you are going to
17:59 get, but you are going to be limited to what you are capable of running with
18:04 your graphics card or your hardware. So, with that in mind, I now want to dive
18:07 into the nitty-gritty details with you so you know exactly the kind of models
18:11 that you can run, the kind of speeds you can expect depending on your hardware.
18:15 And if you want to invest in new hardware to run local AI, I've got some
18:19 recommendations as well. So there are generally four primary size ranges for
18:25 large language models based on the speed and the power that you are looking for.
18:29 You have models that are around seven or 8 billion parameters. Those are
18:33 generally the smallest that I'd recommend trying to run. There are a lot
18:37 of smaller LLMs available like 1 billion parameters or three billion parameters,
18:41 but I'm so unimpressed when I use those LLMs that I don't really want to focus
18:45 on them here. 7 billion parameters is still tiny compared to the large cloud
18:51 AI models like Claude or GBT, but you can get pretty good results with them
18:55 for just simple chat use cases. And so for these models, assuming a Q4
18:59 quantization, which I'll get into quantization in a little bit, it's
19:02 basically just a way to make the LLM a lot smaller without hurting performance
19:06 that much, a 7 billion parameter model will need about four to 5 GB of VRAM on
19:11 your graphics card. And so if you have something like a 3060 Ti from Nvidia
19:16 with 8 GB of VRAM, you can very comfortably run a 7 billion parameter
19:20 model and you can expect to get very roughly around 25 to 35 tokens per
19:27 second. A token is roughly equivalent to a word. And so your local large language
19:31 model at 7 billion parameters with this graphics card will get about 25 to 35
19:37 words per second out on the screen to you being streamed out. And then if you
19:42 use much more powerful hardware like a 3090 to run a 7 billion parameter model
19:47 then you'll just jack up the speed a lot more. So that's 7 billion or 8 billion
19:51 parameters. Another very common size is something around 14 billion parameters.
19:57 This will take about 8 to 10 GB of VRAM. And so just a couple of options for
20:01 this. You have the 4070Ti which is usually 16 GB of VRAM or you could go as
20:08 low as 12 GB of VRAM with the 3080 Ti. And you could expect to get about 15 to
20:13 25 words per second. And then this is where you start to get into basic tool
20:17 calling. So I find that when you are building with a 7 billion parameter
20:21 model, they don't do tool calling very well. So you can't really build that
20:25 powerful of agents around a 7 billion parameter model. But once you get to
20:29 something around 14 billion parameters, that's when I see agents being able to
20:33 really accept instructions well around tools and system prompts and leveraging
20:37 tools to do things on our behalf. That's when we can really start to use LLMs to
20:42 make things that are agentic. And then the next big category of LLMs
20:47 is somewhere between 30 and 34 billion parameters. You see a lot of LM that
20:51 fall in that size range. This will typically need 16 to 20 gigabyt of VRAM.
20:57 And so a 3090 is a really good example of a graphics card that can run this. It
21:03 has 24 GB of VRAM. I actually have two 3090s myself. And I'll have a link to my
21:08 exact PC that I built for running local AI in the description of this video. So
21:12 I have two 3090s, which we'll need in a second for a 70 billion parameter, but
21:16 one is enough for a 32 billion parameter model. And then also Macs with their new
21:22 M4 chips are very powerful with their unified memory architecture. So if you
21:27 get a Mac M4 Pro with 24 GB of unified memory, you can also run 32 billion
21:32 parameter models. Now the speed isn't going to be the best necessarily, and
21:36 again, this does depend a lot on your computer overall, but you can expect
21:40 something around 10 to 15, maybe up to 20 tokens per second. and 32 billion
21:46 parameters is when you really start to see LLMs that are actually pretty
21:50 impressive. Like 7 billion and 14 billion, they are disappointing quite a
21:54 bit. I'll be totally honest. Especially when you try to use them with more
21:58 complicated agentic tasks. 32 billion when you start to get into this range is
22:02 when I I'm actually genuinely impressed. I'm like, "Oh, this is actually pretty
22:05 close to the performance of some of the best cloud AI." And then 70 billion
22:10 parameters. This is going to take about 35 to 40 GB of VRAM for most consumer
22:19 GPUs like 3090s and 4090s even 5090s. It's not actually enough VRAM. And so
22:22 this is when you have to start to split a large language model across multiple
22:28 GPUs which solutions like Olama will actually help you do this right out of
22:31 the box. So it's not this insane setup even though it might feel kind of
22:35 daunting like oh I have to split the layers of my LLM between GPUs. It's not
22:38 actually that complicated. And so 23090s, 24090s, that will be necessary.
22:44 Um, or you could have more of like an enterprisegrade GPU like an H100. So
22:49 Nvidia has a lot of these non-consumer grade GPUs that have a lot more VRAMm to
22:53 handle things like 70 billion parameter models. And the speed won't be the best
22:58 if you're using something like 23090s, especially because performance is hurt
23:01 when you have to split an LM between GPUs. You could expect something like 8
23:06 to 12 tokens per second. And this is obviously if you have the most complex
23:09 agents that you're really trying to match the performance of cloud AI as
23:13 much as possible, that's when you'd want to use a 7 billion parameter model. And
23:16 then if you're investing in hardware to run local AI, I have a couple of quick
23:20 recommendations here. And a lot of this depends on the size of the model that's
23:24 going to be good enough for your use case. And so I'll dive into some
23:28 alternatives for running local AI directly if you want to do testing
23:32 before you buy infrastructure. I'll get into that in a little bit, but
23:35 recommended builds. If you want to spend around $800 to build a PC, I'd recommend
23:41 getting a 4060Ti graphics card and then 32 GB of RAM. If you want to spend
23:47 $2,000, I'd recommend either getting a PC with a 3090 and 64 GB of RAM or
23:53 getting that Mac M4 Pro with 24 GB of unified memory. And then lastly, if you
23:58 want to spend $4,000, which is about what I spent for my PC, then I'd
24:02 recommend getting two 3090 graphics cards, and I got both of mine used for
24:07 around $700 each. Um, and then also getting 128 GB of RAM, or you can get a
24:15 Mac M4 Max with 64 GB of unified memory. So, I wanted to really get into the
24:19 nitty-gritty details there. So, I know I spent a good amount of time diving into
24:22 super specific numbers, but I hope this is really helpful for you. No matter the
24:26 large language model or your hardware, you now know generally where you're at
24:30 for what you can run. So, to go along with that information overload, I want
24:34 to give you some specifics, individual LLMs that you can try right now based on
24:38 the size range that you know will work for your hardware. So, just a couple of
24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This
24:47 is the most popular local LLM ever. It completely blew up a few months ago. And
24:52 the best part about DeepSeek R1 is they have an option that fits into each of
24:56 the size ranges that I just covered in that chart. So they have a 7 billion
25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And
25:05 then there is also the full real version of R1, which is 671 billion parameters.
25:10 I'm sorry though, you probably don't have the hardware to run that unless
25:13 you're spending tens of thousands on your infrastructure. So, probably stick
25:16 with one of these based on your graphics card or if you have a Mac computer, pick
25:19 the one that'll work for you and just try it out. You can click on any one of
25:23 these sizes here. And then here's your command to download and run it. And this
25:28 is defaulting to a Q4 quantization, which is what I was assuming in the
25:31 chart earlier. And again, I will cover what that actually means in a little bit
25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.
25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do
25:45 have all the other um sizes that fit into those ranges that I mentioned
25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And
25:52 the same kind of deal where you click on the size that you want and you've got
25:55 your command to install it here. And this is a reasoning LLM just like
26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.
26:05 I've had really good results with this as well. There are less options here,
26:08 but you've got 22 or 24 billion parameters, which is going to work well
26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified
26:18 memory. Really, really good model. And then also, there is a version of it that
26:22 is fine-tuned for coding specifically called Devstrol, which is a another
26:26 really cool LLM worth checking out as well if you have the hardware to run it.
26:30 So, that is everything for just general recommendations for local LMS to try
26:34 right now. This is the part of the master class that is going to become
26:38 outdated the fastest because there are new local LMS coming out every single
26:42 month. I don't really know how long my recommendations will last for. But in
26:45 general, you can just go to the model list in Olama, search for the ones,
26:49 finds one that has the size that works with your graphics card and just give it
26:52 a shot. You can install it and run it very easily with Olama. And the other
26:57 thing that I want to mention here is you don't always have to run open- source
27:01 large language models yourself. You can use a platform like Open Router. You can
27:05 just go to open router.ai, sign up, add in some API credits. You can try these
27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for
27:15 your agents before you invest in hardware to actually run them yourself.
27:18 And so within Open Router, I can just search for Quen here. And I can go down
27:23 to Quen and I can go to 32 billion. They have a free offering as well that
27:26 doesn't have the best rate limits. So I'll just go to this one right here,
27:31 Quen 3 32B. So I can try the model out through open router. They actually host
27:35 it for me. So it's an open- source non-local version, but now I can try it
27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now
27:43 I want to buy a 3090 graphics card so that I can install it directly through
27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here
27:51 in open router. And there are other platforms like Grock as well where you
27:55 can run these open source large language models um not on your own infrastructure
27:58 if you just want to do some testing before beforehand or whatever that might
28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's
28:04 everything for my general recommendations for LLMs to try and use
28:09 in your agents. All right, it is time to take a quick breather. This is
28:12 everything that we've covered already in our master class. What is local AI? Why
28:17 we care about it? Why it's the future and hardware requirements. And I really
28:20 wanted to dive deep into this stuff because it sets the stage for everything
28:24 that we do when we actually build agents and deploy our infrastructure. And so
28:28 the last thing that I want to do with you before we really start to get into
28:33 building agents and setting up our package is I want to talk about some of
28:37 the tricky stuff that is usually pretty daunting for anyone getting into local
28:42 AI. I'm talking things like offloading models, quantization, environment
28:47 variables to handle things like uh flash attention, all the stuff that is really
28:51 important that I want to break down simply for you so you can feel confident
28:55 that you have everything set up right, that you know what goes into using local
29:00 LLMs. The first big concept to focus on here is quantization. And this is
29:04 crucial. It's how we can make large language models a lot smaller so they
29:10 can fit on our GPUs without hurting performance too much. We are lowering
29:14 the model precision here. And so what basically what that means is we have
29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits
29:23 with the full size, but we can lower the precision of each of those parameters to
29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of
29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the
29:35 parameters that we already covered. And we can make these numbers less precise
29:40 or smaller without losing much performance. So, we can fit larger LLMs
29:45 within a GPU that normally wouldn't even be close to running the full-size model.
29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4
29:55 quantization like four bit per parameter in that diagram earlier. If you had the
30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could
30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it
30:10 possible. It's like rounding a number that has a long decimal to something
30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're
30:19 doing it for each of the billions of parameters, those numbers that we have.
30:23 And so just to give you a visual representation of this, you can also
30:27 quantize images just like you can quantize LLMs. And so we have our full
30:31 scale image on the lefth hand side here comparing it to different levels of
30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with
30:40 a 16- bit quantization, it almost looks the same. But then once we go down to
30:44 4bit, you can very much see that we have a huge loss in quality for the image.
30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit
30:54 quantization, we don't actually lose that much performance like we lose a lot
30:58 of quality with images. And so that's why it's so useful for us. And so I have
31:01 a table just to kind of describe what this looks like. So FP16, that's the
31:07 16bit precision that all LMS have as a base. That is the full size. The speed
31:11 is obviously going to be very slow because the model is a lot bigger, but
31:16 your quality is perfect compared to what it could be. I mean, obviously that
31:18 doesn't mean that you're going to get perfect answers all the time. I'm just
31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8
31:28 precision, so it's half the size. The speed is going to be a lot better. And
31:33 the quality is nearperfect. So it's not like performance is cut in half just
31:37 because size is. You still have the same number of parameters. Each one is just a
31:42 bit less precise. And so you're still going to get almost the same results.
31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very
31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these
31:57 numbers are very vague on purpose. There's not a huge way to for me to like
32:01 qualify exactly the difference, especially because it changes per LLM
32:05 and your hardware and everything like that. So, I'm just being very general
32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to
32:13 be very very fast, but usually your performance starts to go down quite a
32:17 bit once you go down to a Q2. And then like the note that I have in the bottom
32:22 left here, a Q4 quantization is generally the best balance. And so when
32:26 you are thinking to yourself, which large language model should I run? What
32:31 size should I use? My rule of thumb is to pick the largest large language model
32:37 that can work with your hardware with a Q4 quantization. That is why I assumed
32:42 that in the table earlier. And then also like we saw in Olama earlier, it always
32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4
32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion
32:59 parameter model is still going to be a lot more powerful than the full 7
33:02 billion parameter or 14 billion parameter because you don't actually
33:07 lose that much performance. So that is quantization. So just to make this very
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 just fun fact, you can actually do that. Now, the very last thing that I want to
37:15 cover before we dive into some code, setting up the local AI package, and
37:21 building out some agents is a few very crucial parameters, environment
37:25 variables for Olama. So, these are environment variables that you can set
37:28 on your machine just like any other based on your operating system. And
37:32 Olama does have an FAQ for setting up some of these things, which I'll link to
37:36 in the description as well. But yeah, these are a bit more technical, so
37:40 people skip past setting this stuff up a lot, but it's actually really, really
37:44 important to make things very efficient when running local LLMs. So the first
37:49 environment variable is flash attention. You want to set this to one or true.
37:54 When you have this set to true, it's going to make the attention calculation
37:59 a lot more efficient. It sounds fancy, but basically large language models when
38:04 they are generating a response, they have to calculate which parts of your
38:08 prompt to pay the most attention to. That's the calculation. And you can make
38:12 it a lot more efficient without losing much performance at all by setting up
38:16 the flash attention, setting that to true. And then for another optimization,
38:21 just like we can quantize the LLM itself, you can also quantize or
38:27 compress the context. So your system prompt, the tool descriptions, your
38:31 prompt and conversation history, all that context that's being sent to your
38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for
38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the
38:46 context memory. It's a very simplified explanation, but it's really, really
38:50 useful because a long conversation can also take a lot of VRAM just like larger
38:55 LLM. And so it's good to compress that. And then the third environment variable,
38:58 this is actually probably the most crucial one to set up for Olama. There
39:02 is this crazy thing. I don't know why Olama does it, but by default, they
39:07 limit every single large language model to 2,000 tokens for the context limit,
39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and
39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a
39:21 lot of local large language models can also handle large prompts. But Olamo
39:25 will limit you to default to 2,000 tokens. And so you have to override that
39:30 yourself with this environment variable. And so generally I recommend starting
39:34 with about 8,000 tokens to start. You can move this all the way up to
39:38 something like 32,000 tokens if your local large language model supports
39:42 that. And if you view the model page on Alama, you can see the context link
39:46 that's supported by the LLM. But you definitely want to, you know, jack this
39:50 up more from just 2,000 because a lot of times when you have longer
39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do
39:57 not miss this. If your large language model is starting to go completely off
40:02 the rails and ignore your system prompt and forget that it has these tools that
40:06 you gave it, it's probably because you reached the context length. And so, just
40:10 keep that in mind. I see people miss this a lot. And then the very last
40:14 environment variable, uh, probably the least important out of all these four,
40:18 but if you're running a lot of different large language models at once and you're
40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so
40:25 in Olama, you can limit the number of models that are allowed to be in your
40:29 memory at a single time. With this one, typically you want to set this to either
40:33 one or two. Definitely set this to just one if you are using large language
40:37 models that are basically fit for your GPU. like it's going to fit exactly into
40:41 your VRAM and you're not going to have room for another large language model.
40:44 But if you are running more smaller ones and maybe you could actually fit two on
40:48 your GPU with the VRAM that you have, you can set this to two. So again, more
40:52 technical overall, but it's very important to have these right. And we'll
40:55 get into the local AI package where I already have these set up in the
40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a
41:02 minute ago that I'll have linked in the description. And so there's actually a
41:06 lot of good things to read into here. um like being able to verify that your GPU
41:10 is compatible with Olama. How can you tell if the model's actually loaded on
41:13 your GPU? So, a lot of like sanity check things that they walk you through in the
41:17 FAQ as well. Also talking about environment variables, which I just
41:20 covered. And so, they've got some instructions here depending on your OS
41:23 how to get those set up. So, if there's anything that's confusing to you, this
41:26 is a very good resource to start with. So, I'm trying to make it possible for
41:30 you to look into things further if there's anything that doesn't quite make
41:33 sense for what I explained here. And of course, always let me know in the
41:35 comments if you have any questions on this stuff as well, especially the more
41:39 technical stuff that I just got to cover because it's so important even though I
41:43 know we really want to dive into the meat of things, which we are actually
41:45 Ollama's OpenAI API Compatibility
41:47 going to do now. All right, here is everything that we have covered at this
41:50 point. And congratulations if you have made it this far because I covered all
41:55 the tricky stuff with quantization and the hardware requirements and offloading
41:59 and some of our little configuration and parameters. So, if you got all of that,
42:03 the rest of it is going to be a walk in the park as we start to dive into code,
42:07 getting all of our local AI set up and building out some agents. You understand
42:10 the foundation now that we're going to build on top of to make some cool stuff.
42:15 And so, now the next thing that we're going to do is talk about how we can use
42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show
42:23 you an example. We can take something that is using OpenAI right now,
42:27 transform it into something that is using OAMA and local LLM. So, we'll
42:31 actually dive into some code here. And I've got my fair share of no code stuff
42:35 in this master class as well, but I want to focus on both because I think it's
42:38 really important to use both code and no code whenever applicable. And that
42:42 applies to local AI just like building agents in general. So, I've already
42:45 promised a couple of times that I would dive into OpenAI API compatibility, what
42:50 it is, and why it's so important. And we're going to dive into this now so you
42:54 can really start to see how you can take existing agents and transform them into
42:59 being 100% local with local large language models without really having to
43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI
43:10 has created a standard for exposing large language models through an API.
43:14 It's called the chat completions API. It's kind of like how model context
43:19 protocol MCP is a standard for connecting agents to tools. The chat
43:23 completions API is a standard for exposing large language models over an
43:28 API. So you have this common endpoint along with a few other ones that all of
43:35 these providers implement. This is the way to access the large language model
43:39 to get a response based on some conversation history that you pass in.
43:43 So, Olama is implementing this as of February. We have other providers like
43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.
43:53 Almost every single provider is OpenAI API compatible. And so, not only is it
43:57 very easy to swap between large language models within a specific provider, it's
44:02 also very easy to swap between providers entirely. You can go from Gemini to
44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one
44:13 piece of configuration pointing to a different base URL as it is called. So
44:18 you can access that provider and then the actual API endpoint that you hit
44:22 once you are connected to that specific provider is always the exact same and
44:26 the response that you get back is also always the exact same. And so Olama has
44:31 this implemented now. And I'll link to this article in the description as well
44:33 if you want to read through this because they have a really neat Python example.
44:37 It shows where we create an OpenAI client and the only thing we have to do
44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are
44:47 pointing to Olama that is hosted locally instead of pointing to the URL for
44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And
44:55 then with Olama, you don't actually need an API key because everything's running
44:58 locally. So you just need some placeholder value here. But there is no
45:02 authentication that is going on. You can set that up. I'm not going to dive into
45:05 that right now. But by default, because it's all just running locally, you don't
45:09 even need an API key to connect to Olama. And then once we have our OpenAI
45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in
45:18 exactly the same way. But now we can specify a model that we have downloaded
45:22 locally already through Lama. We pass in our conversation history in the same way
45:27 and we access the response like the content the AI produced the token usage
45:31 like all those things that we get back from the response in the same way.
45:34 They've got a JavaScript example as well. They have a couple of examples
45:38 using different frameworks like the Versell AI SDK and Autogen. Really any
45:44 AI agent framework can work with OpenAI API compatibility to make it very easy
45:47 to swap between these different providers. like Pyantic AI, my favorite
45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily
45:57 within your Pantic AI agents swap between these different providers. And
45:58 OpenAI Compatible Demo
4:26 What is Local AI?
4:28 is all about. Let's start by laying the foundation. What is local AI in the
4:32 first place? Well, very simply put, local AI is running your own large
4:36 language models and infrastructure like your database and your UI entirely on
4:42 your own machine 100% offline. So when you think about when you typically want
4:46 to build an AI agent, you need a large language model maybe like GPT4.1
4:51 or Claude 4 and then you need something like your database like Superbase and
4:55 you need a way to create a user interface. you have all these different
4:58 components for your agent and typically you're using APIs to access things that
5:03 are hosted on your behalf. But with local AI, we can take all of these
5:07 things completely in our own control running them ourselves. So this is
5:11 possible through open-source large language models and software. So
5:15 everything is running on your own hardware instead of you paying for APIs.
5:20 So we run the large language model ourself on our own machine instead of
5:25 paying for the OpenAI API for example. And so for large language models, there
5:29 are thousands of different open source large language models available for us
5:32 to use in a lot of different ways. And some of these you've probably heard of
5:37 before, like Deepseek R1, Quen 3, Mistral 3.1, Llama 4. These are just a
5:41 couple of examples of the most popular ones that you've probably heard of
5:44 before. We'll be tinkering around with using some of these in this master
5:48 class. And then we also have open-source software. So all of our infrastructure
5:53 that goes along with our agents and LLMs, things like Olama for running our
5:58 LLMs, Superbase for our database, N8N for our no/ lowode workflow automations,
6:04 and open web UI to have a nice user interface to talk to our agents and
6:07 LLMs. And we'll dive into using all of these as well. Now, because local AI
6:12 means running large language models on our own computer, it's not as easy as
6:18 just going to claw.ai AI or chatgbt.com and typing in a prompt. We have to
6:22 actually install something, but it still is very easy to get started. So, let me
6:25 show you right now with a hands-on example. So, here we are within the
6:30 website for Olama. This is just.com. I'll have a link to this in the
6:33 description of the video. This is one of the open- source platforms that allows
6:39 us to very easily download and run local large language models. And so, you just
6:42 have to go to their homepage here and click on this nice big download button.
6:45 You can install it for Windows, Mac or Linux. It really works for any operating
6:49 system. Then once you have it up and running on your machine, you can open up
6:53 any terminal. Like I'm on Windows here, so I'm in a PowerShell session and I can
6:58 run Olama commands now to do things like view the models that I have available on
7:03 my machine. I can download models and I can run them as well. And the way that I
7:08 know how to pull and run specific models is I can just go to this models tab in
7:12 their navigation and I can browse and filter through all of the open source
7:16 LLMs that are available to me like DeepSeek R1. Almost everyone is familiar
7:20 with DeepSeek. It just totally blew up back in February and March. We have
7:25 Gemma 3, Quen 3, Llama 4, a few of them that I mentioned earlier when we had the
7:30 presentation up. And so we can click into any one of these like I can go into
7:35 DeepSeek R1 for example and then I have the command right here that I can copy
7:39 to download and run this specific model in my terminal. And there are a lot of
7:44 different model variants of DeepSeek R1. So we'll get into different sizes and
7:47 hardware requirements and what that all means in a little bit, but I'll just
7:50 take one of them and run it as an example. So I'll just do a really small
7:54 one right now. I'll do a 1.5 billion parameter large language model. And
7:58 again, I'll explain what that means in a little bit. I can copy this command.
8:01 It's just lama run and then the unique ID of this large language model. So I'll
8:06 go back into my terminal. I'll clear it here and then paste in this command. And
8:09 so first it's going to have to pull this large language model. And the total size
8:15 for this is 1.1 GB. And so it'll have to download it. And then because I used the
8:20 run command, it will immediately get me into a chat interface with the model
8:24 once it's downloaded. Also, if you don't want to run it right right away, you
8:27 just want to install it, you can do Olama pull instead of Olama run. And
8:32 then again, to view the models that you have available to you installed already,
8:36 you can just do the Olama list command like I did earlier. And so, right now,
8:39 I'll pause and come back once it's installed in about 30 seconds. All
8:43 right, it is now installed. And now I can just send in a message like hello.
8:47 And then boom, we are now talking to a large language model. But instead of it
8:50 being hosted somewhere else and we're just using a website, this is running on
8:54 my own infrastructure, the large language model and all the billions of
8:59 parameters are getting loaded onto my graphics card and running the inference.
9:02 That's what it's called when we're generating a response from the LLM
9:06 directly within this terminal here. And so I can ask another question like um
9:13 what is the best GPU right now? We'll see what it says. So it's thinking
9:16 first. This is actually a thinking model. Deepseek R1 is a reasoning LLM.
9:20 And then it gives us an answer. It's top GPU models today. 3080 RX6700.
9:27 Obviously, we have a training cutoff for local large language models just like we
9:30 do with ones in the cloud like GPT. And so the information is a little outdated
9:34 here, but yeah, this is a good answer. So we have a large language model that
9:38 we're talking to directly on our machine. And then to close out of this,
9:42 I can just do control D or command D on Mac. And if I do list, we have all the
9:47 other models that you saw earlier, plus now this one that I just installed. So
9:50 these are all available for me to run again just with that Olama run command.
9:54 And it won't have to reinstall if you already have it installed. Run just
9:58 installs it if you don't have it yet already. So that is just a quick demo of
10:03 using Olama. We'll dive a lot more into Olama later, like how we can actually
10:07 use it within our Python code and within our N8N workflows. This is just our
10:11 quick way to try it out within the terminal. Now, to really get into why we
10:15 should care about local AI now that we know what it is, I want to cover the
10:19 pros and cons of local AI and what I like to call cloud AI. That's just when
10:23 you're paying for things to be hosted for you, like using Claude or Gemini or
10:28 using the cloud version of N8N instead of hosting it yourself. And I also want
10:32 to cover the advantages of each because I don't want to sugarcoat things and
10:35 just hype up this master class by telling you that you should always use
10:39 local AI. That is certainly not the case. There is a time and place for both
10:43 of these categories here, but there are so many use cases where local AI is
10:50 absolutely crucial. You have no idea how many businesses I have talked to that
10:54 are willing to put tens of thousands of dollars into running their own LLMs and
10:58 infrastructure because privacy and security is so crucial for the things
11:02 that they're building with AI. And that actually gets into the first advantage
11:06 here of local AI, which is privacy and security. You can run things 100%
11:11 offline. The data that you're giving to your LLMs as prompts, it now doesn't
11:16 leave your hardware. It stays entirely within your own control. And for a lot
11:20 of businesses, that is 100% crucial, especially when they're in highly
11:24 regulated industries like the health industry, finance, uh even real estate.
11:28 Like there's so many use cases where you're working with intellectual
11:32 property or just really sensitive information. You don't want to be
11:36 sending your data off to an LLM provider like Google or OpenAI or Enthropic. And
11:41 so as a business owner, you should definitely be paying attention to this
11:45 if you are working with automation use cases where you're dealing with any kind
11:48 of sensitive data. And then also if you're a freelancer, you're starting an
11:52 AI automation agency, anything where you're building for other businesses,
11:55 you are going to have so many opportunities open up to you when you're
11:59 able to work with local AI because you can handle those use cases now where
12:03 they need to work with sensitive data and you can't just go and use the OpenAI
12:08 API. And that is the main advantage of local AI. It is a very big deal. But
12:11 there are a few other things that are worth focusing on as well. Starting with
12:16 model fine-tuning, you can take any open- source large language model and
12:20 add additional training on top with your own data. Basically making it a domain
12:24 expert on your business or the problem that you are solving. It's so so
12:29 powerful. You can make models through fine-tuning more powerful than the best
12:33 of the best in the cloud depending on what you are able to fine-tune with
12:38 depending on the data that you have. And you can do fine-tuning with some cloud
12:42 models like with GPT, but your options are pretty limited and it can be quite
12:46 expensive. And so it definitely is a huge advantage to local AI. And local AI
12:52 in general can be very coste effective, including the infrastructure as well. So
12:56 your LLMs and your infrastructure. You run it all yourself and you pay for
13:00 nothing besides the electricity bill if it's running on your computer at your
13:04 house or if you have some private server in the cloud. You just have to pay for
13:07 that server and that's it. There's no N8N bill, no Superbase bill, no OpenAI
13:13 bill. You can save a lot of money. It's really, really nice. And on top of that,
13:17 when everything is running on your own infrastructure, the agents that you
13:22 create can run on the same server, the same place as your infrastructure. And
13:26 so it can actually be faster because you don't have network delays calling APIs
13:30 for all your different services for your LLMs and your database and things like
13:34 that. And then with that, we can now get into the advantages of cloud AI.
13:38 Starting with it's a lot easier to set up. There's a reason why I have to have
13:42 this master class for you in the first place. There are some initial hurdles
13:47 that we have to jump over to really have everything fully set up for our local
13:51 LLMs and infrastructure. And you just don't have that with cloud AI because
13:54 you can very simply call into these APIs. You just have to sign up and get
13:58 an API key and that's about it. So, it certainly is easier to get up and
14:02 running and there's less maintenance overall because they are hosting things
14:05 for you. Superbase is hosting the database for you. OpenAI is hosting the
14:09 LLM for you. So, you don't have to manage things on your own hardware. With
14:13 Local AI, you have to apply patches and updates if you have a private server in
14:17 the cloud. You have to manage your own hardware if you're running on your own
14:20 computer, making sure that it's on 24/7, if you want your database on 24/7, that
14:24 kind of thing. It's just less maintenance with cloud AI. And then
14:28 probably the biggest advantage of cloud AI overall is that you have better
14:34 models available to you. Claude 4 sonnet or opus for example is more powerful
14:40 than any local AI that you could run. So we have this gap here and this gap was a
14:45 lot bigger at one point even a year ago. The best local LLMs absolutely crushed
14:50 the best local LLMs and that gap is starting to diminish. And so I really
14:55 see a future where that gap is diminished entirely and all the best
14:59 local LLMs are actually on par with the best cloud ones. That's the future I
15:03 see. That's why I think that cloud that's why I think that local AI is such
15:07 a big deal because the advantages of local AI, those are just going to get
15:12 more prevalent over time when businesses realize they really want private and
15:15 secure solutions. And then the advantages of cloud AI, I think those
15:19 are actually going to diminish over time. That's the key. minimal setup,
15:24 less maintenance. Well, those advantages are going to go away as we have
15:28 platforms and better instructions and solutions to make the setup and
15:32 maintenance easier for local AI and we have the gap that's continuing to
15:35 diminish between the power of these LLMs. All these advantages are going to
15:39 actually go away and then it'll just completely make sense to use local AI
15:43 honestly probably for like every single solution in the future. That's really
15:47 what I see us heading towards. And then the last advantage to cloud AI which
15:51 also I think will go away over time is that you have some features out of the
15:55 box like you have memory that's built directly into chat GPT. Gemini has web
15:59 search baked in even when you use it through the API like these kind of
16:02 capabilities that are out of the box that you have to implement yourself with
16:07 local AI maybe as tools for your agent and you can definitely do that but it is
16:10 nice that these things are out of the box for cloud AI. So those are the pros
16:15 and cons between the two. I hope that this makes it very clear for you to pick
16:19 right now for your own use case. Should I implement local AI or cloud AI? A lot
16:24 of it comes down to the security and privacy requirements for your use case.
16:28 Now, the next big thing that we need to talk about for local AI is hardware
16:33 requirements. Cuz here's the thing, large language models are very resource
16:39 inensive. You can't just run any LLM on any computer. And the reason for that is
16:43 large language models are made up of billions or even trillions of numbers
16:47 called parameters. And they're all connected together in a web that looks
16:51 kind of like this. This is a very simplified view with just a few
16:54 parameters here. But each of the parameters are nodes and they're
16:58 connected together. The input layer is where our prompt comes in and our prompt
17:02 is fed through all these hidden layers and then we have the output at the end.
17:05 This is the response we get back from the LLM. But like I said, this is a very
17:10 simplified view. GPT4, for example, like you can see on the right hand side, is
17:15 estimated to have 1.4 trillion parameters. And so, if you want to fit
17:20 an entire large language model into your graphics card, you have to store all of
17:25 these numbers. And even though we can handle gigabytes at a time in our
17:28 graphics cards through what is called VRAMm, storing billions or trillions of
17:34 numbers is absolutely insane. And so that's why large language models, you
17:37 actually have to have a pretty good graphics card if you want to run some of
17:42 the best ones. And so looking at Olama here, when we see these different sizes,
17:47 going back to their model list, like 1.5 billion parameters or 27 billion
17:51 parameters, there are different sizes for the local LLMs. Obviously, the
17:56 larger a local LM that you are running, the more performance you are going to
17:59 get, but you are going to be limited to what you are capable of running with
18:04 your graphics card or your hardware. So, with that in mind, I now want to dive
18:07 into the nitty-gritty details with you so you know exactly the kind of models
18:11 that you can run, the kind of speeds you can expect depending on your hardware.
18:15 And if you want to invest in new hardware to run local AI, I've got some
18:19 recommendations as well. So there are generally four primary size ranges for
18:25 large language models based on the speed and the power that you are looking for.
18:29 You have models that are around seven or 8 billion parameters. Those are
18:33 generally the smallest that I'd recommend trying to run. There are a lot
18:37 of smaller LLMs available like 1 billion parameters or three billion parameters,
18:41 but I'm so unimpressed when I use those LLMs that I don't really want to focus
18:45 on them here. 7 billion parameters is still tiny compared to the large cloud
18:51 AI models like Claude or GBT, but you can get pretty good results with them
18:55 for just simple chat use cases. And so for these models, assuming a Q4
18:59 quantization, which I'll get into quantization in a little bit, it's
19:02 basically just a way to make the LLM a lot smaller without hurting performance
19:06 that much, a 7 billion parameter model will need about four to 5 GB of VRAM on
19:11 your graphics card. And so if you have something like a 3060 Ti from Nvidia
19:16 with 8 GB of VRAM, you can very comfortably run a 7 billion parameter
19:20 model and you can expect to get very roughly around 25 to 35 tokens per
19:27 second. A token is roughly equivalent to a word. And so your local large language
19:31 model at 7 billion parameters with this graphics card will get about 25 to 35
19:37 words per second out on the screen to you being streamed out. And then if you
19:42 use much more powerful hardware like a 3090 to run a 7 billion parameter model
19:47 then you'll just jack up the speed a lot more. So that's 7 billion or 8 billion
19:51 parameters. Another very common size is something around 14 billion parameters.
19:57 This will take about 8 to 10 GB of VRAM. And so just a couple of options for
20:01 this. You have the 4070Ti which is usually 16 GB of VRAM or you could go as
20:08 low as 12 GB of VRAM with the 3080 Ti. And you could expect to get about 15 to
20:13 25 words per second. And then this is where you start to get into basic tool
20:17 calling. So I find that when you are building with a 7 billion parameter
20:21 model, they don't do tool calling very well. So you can't really build that
20:25 powerful of agents around a 7 billion parameter model. But once you get to
20:29 something around 14 billion parameters, that's when I see agents being able to
20:33 really accept instructions well around tools and system prompts and leveraging
20:37 tools to do things on our behalf. That's when we can really start to use LLMs to
20:42 make things that are agentic. And then the next big category of LLMs
20:47 is somewhere between 30 and 34 billion parameters. You see a lot of LM that
20:51 fall in that size range. This will typically need 16 to 20 gigabyt of VRAM.
20:57 And so a 3090 is a really good example of a graphics card that can run this. It
21:03 has 24 GB of VRAM. I actually have two 3090s myself. And I'll have a link to my
21:08 exact PC that I built for running local AI in the description of this video. So
21:12 I have two 3090s, which we'll need in a second for a 70 billion parameter, but
21:16 one is enough for a 32 billion parameter model. And then also Macs with their new
21:22 M4 chips are very powerful with their unified memory architecture. So if you
21:27 get a Mac M4 Pro with 24 GB of unified memory, you can also run 32 billion
21:32 parameter models. Now the speed isn't going to be the best necessarily, and
21:36 again, this does depend a lot on your computer overall, but you can expect
21:40 something around 10 to 15, maybe up to 20 tokens per second. and 32 billion
21:46 parameters is when you really start to see LLMs that are actually pretty
21:50 impressive. Like 7 billion and 14 billion, they are disappointing quite a
21:54 bit. I'll be totally honest. Especially when you try to use them with more
21:58 complicated agentic tasks. 32 billion when you start to get into this range is
22:02 when I I'm actually genuinely impressed. I'm like, "Oh, this is actually pretty
22:05 close to the performance of some of the best cloud AI." And then 70 billion
22:10 parameters. This is going to take about 35 to 40 GB of VRAM for most consumer
22:19 GPUs like 3090s and 4090s even 5090s. It's not actually enough VRAM. And so
22:22 this is when you have to start to split a large language model across multiple
22:28 GPUs which solutions like Olama will actually help you do this right out of
22:31 the box. So it's not this insane setup even though it might feel kind of
22:35 daunting like oh I have to split the layers of my LLM between GPUs. It's not
22:38 actually that complicated. And so 23090s, 24090s, that will be necessary.
22:44 Um, or you could have more of like an enterprisegrade GPU like an H100. So
22:49 Nvidia has a lot of these non-consumer grade GPUs that have a lot more VRAMm to
22:53 handle things like 70 billion parameter models. And the speed won't be the best
22:58 if you're using something like 23090s, especially because performance is hurt
23:01 when you have to split an LM between GPUs. You could expect something like 8
23:06 to 12 tokens per second. And this is obviously if you have the most complex
23:09 agents that you're really trying to match the performance of cloud AI as
23:13 much as possible, that's when you'd want to use a 7 billion parameter model. And
23:16 then if you're investing in hardware to run local AI, I have a couple of quick
23:20 recommendations here. And a lot of this depends on the size of the model that's
23:24 going to be good enough for your use case. And so I'll dive into some
23:28 alternatives for running local AI directly if you want to do testing
23:32 before you buy infrastructure. I'll get into that in a little bit, but
23:35 recommended builds. If you want to spend around $800 to build a PC, I'd recommend
23:41 getting a 4060Ti graphics card and then 32 GB of RAM. If you want to spend
23:47 $2,000, I'd recommend either getting a PC with a 3090 and 64 GB of RAM or
23:53 getting that Mac M4 Pro with 24 GB of unified memory. And then lastly, if you
23:58 want to spend $4,000, which is about what I spent for my PC, then I'd
24:02 recommend getting two 3090 graphics cards, and I got both of mine used for
24:07 around $700 each. Um, and then also getting 128 GB of RAM, or you can get a
24:15 Mac M4 Max with 64 GB of unified memory. So, I wanted to really get into the
24:19 nitty-gritty details there. So, I know I spent a good amount of time diving into
24:22 super specific numbers, but I hope this is really helpful for you. No matter the
24:26 large language model or your hardware, you now know generally where you're at
24:30 for what you can run. So, to go along with that information overload, I want
24:34 to give you some specifics, individual LLMs that you can try right now based on
24:38 the size range that you know will work for your hardware. So, just a couple of
24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This
24:47 is the most popular local LLM ever. It completely blew up a few months ago. And
24:52 the best part about DeepSeek R1 is they have an option that fits into each of
24:56 the size ranges that I just covered in that chart. So they have a 7 billion
25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And
25:05 then there is also the full real version of R1, which is 671 billion parameters.
25:10 I'm sorry though, you probably don't have the hardware to run that unless
25:13 you're spending tens of thousands on your infrastructure. So, probably stick
25:16 with one of these based on your graphics card or if you have a Mac computer, pick
25:19 the one that'll work for you and just try it out. You can click on any one of
25:23 these sizes here. And then here's your command to download and run it. And this
25:28 is defaulting to a Q4 quantization, which is what I was assuming in the
25:31 chart earlier. And again, I will cover what that actually means in a little bit
25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.
25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do
25:45 have all the other um sizes that fit into those ranges that I mentioned
25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And
25:52 the same kind of deal where you click on the size that you want and you've got
25:55 your command to install it here. And this is a reasoning LLM just like
26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.
26:05 I've had really good results with this as well. There are less options here,
26:08 but you've got 22 or 24 billion parameters, which is going to work well
26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified
26:18 memory. Really, really good model. And then also, there is a version of it that
26:22 is fine-tuned for coding specifically called Devstrol, which is a another
26:26 really cool LLM worth checking out as well if you have the hardware to run it.
26:30 So, that is everything for just general recommendations for local LMS to try
26:34 right now. This is the part of the master class that is going to become
26:38 outdated the fastest because there are new local LMS coming out every single
26:42 month. I don't really know how long my recommendations will last for. But in
26:45 general, you can just go to the model list in Olama, search for the ones,
26:49 finds one that has the size that works with your graphics card and just give it
26:52 a shot. You can install it and run it very easily with Olama. And the other
26:57 thing that I want to mention here is you don't always have to run open- source
27:01 large language models yourself. You can use a platform like Open Router. You can
27:05 just go to open router.ai, sign up, add in some API credits. You can try these
27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for
27:15 your agents before you invest in hardware to actually run them yourself.
27:18 And so within Open Router, I can just search for Quen here. And I can go down
27:23 to Quen and I can go to 32 billion. They have a free offering as well that
27:26 doesn't have the best rate limits. So I'll just go to this one right here,
27:31 Quen 3 32B. So I can try the model out through open router. They actually host
27:35 it for me. So it's an open- source non-local version, but now I can try it
27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now
27:43 I want to buy a 3090 graphics card so that I can install it directly through
27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here
27:51 in open router. And there are other platforms like Grock as well where you
27:55 can run these open source large language models um not on your own infrastructure
27:58 if you just want to do some testing before beforehand or whatever that might
28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's
28:04 everything for my general recommendations for LLMs to try and use
28:09 in your agents. All right, it is time to take a quick breather. This is
28:12 everything that we've covered already in our master class. What is local AI? Why
28:17 we care about it? Why it's the future and hardware requirements. And I really
28:20 wanted to dive deep into this stuff because it sets the stage for everything
28:24 that we do when we actually build agents and deploy our infrastructure. And so
28:28 the last thing that I want to do with you before we really start to get into
28:33 building agents and setting up our package is I want to talk about some of
28:37 the tricky stuff that is usually pretty daunting for anyone getting into local
28:42 AI. I'm talking things like offloading models, quantization, environment
28:47 variables to handle things like uh flash attention, all the stuff that is really
28:51 important that I want to break down simply for you so you can feel confident
28:55 that you have everything set up right, that you know what goes into using local
29:00 LLMs. The first big concept to focus on here is quantization. And this is
29:04 crucial. It's how we can make large language models a lot smaller so they
29:10 can fit on our GPUs without hurting performance too much. We are lowering
29:14 the model precision here. And so what basically what that means is we have
29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits
29:23 with the full size, but we can lower the precision of each of those parameters to
29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of
29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the
29:35 parameters that we already covered. And we can make these numbers less precise
29:40 or smaller without losing much performance. So, we can fit larger LLMs
29:45 within a GPU that normally wouldn't even be close to running the full-size model.
29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4
29:55 quantization like four bit per parameter in that diagram earlier. If you had the
30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could
30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it
30:10 possible. It's like rounding a number that has a long decimal to something
30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're
30:19 doing it for each of the billions of parameters, those numbers that we have.
30:23 And so just to give you a visual representation of this, you can also
30:27 quantize images just like you can quantize LLMs. And so we have our full
30:31 scale image on the lefth hand side here comparing it to different levels of
30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with
30:40 a 16- bit quantization, it almost looks the same. But then once we go down to
30:44 4bit, you can very much see that we have a huge loss in quality for the image.
30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit
30:54 quantization, we don't actually lose that much performance like we lose a lot
30:58 of quality with images. And so that's why it's so useful for us. And so I have
31:01 a table just to kind of describe what this looks like. So FP16, that's the
31:07 16bit precision that all LMS have as a base. That is the full size. The speed
31:11 is obviously going to be very slow because the model is a lot bigger, but
31:16 your quality is perfect compared to what it could be. I mean, obviously that
31:18 doesn't mean that you're going to get perfect answers all the time. I'm just
31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8
31:28 precision, so it's half the size. The speed is going to be a lot better. And
31:33 the quality is nearperfect. So it's not like performance is cut in half just
31:37 because size is. You still have the same number of parameters. Each one is just a
31:42 bit less precise. And so you're still going to get almost the same results.
31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very
31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these
31:57 numbers are very vague on purpose. There's not a huge way to for me to like
32:01 qualify exactly the difference, especially because it changes per LLM
32:05 and your hardware and everything like that. So, I'm just being very general
32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to
32:13 be very very fast, but usually your performance starts to go down quite a
32:17 bit once you go down to a Q2. And then like the note that I have in the bottom
32:22 left here, a Q4 quantization is generally the best balance. And so when
32:26 you are thinking to yourself, which large language model should I run? What
32:31 size should I use? My rule of thumb is to pick the largest large language model
32:37 that can work with your hardware with a Q4 quantization. That is why I assumed
32:42 that in the table earlier. And then also like we saw in Olama earlier, it always
32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4
32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion
32:59 parameter model is still going to be a lot more powerful than the full 7
33:02 billion parameter or 14 billion parameter because you don't actually
33:07 lose that much performance. So that is quantization. So just to make this very
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 just fun fact, you can actually do that. Now, the very last thing that I want to
37:15 cover before we dive into some code, setting up the local AI package, and
37:21 building out some agents is a few very crucial parameters, environment
37:25 variables for Olama. So, these are environment variables that you can set
37:28 on your machine just like any other based on your operating system. And
37:32 Olama does have an FAQ for setting up some of these things, which I'll link to
37:36 in the description as well. But yeah, these are a bit more technical, so
37:40 people skip past setting this stuff up a lot, but it's actually really, really
37:44 important to make things very efficient when running local LLMs. So the first
37:49 environment variable is flash attention. You want to set this to one or true.
37:54 When you have this set to true, it's going to make the attention calculation
37:59 a lot more efficient. It sounds fancy, but basically large language models when
38:04 they are generating a response, they have to calculate which parts of your
38:08 prompt to pay the most attention to. That's the calculation. And you can make
38:12 it a lot more efficient without losing much performance at all by setting up
38:16 the flash attention, setting that to true. And then for another optimization,
38:21 just like we can quantize the LLM itself, you can also quantize or
38:27 compress the context. So your system prompt, the tool descriptions, your
38:31 prompt and conversation history, all that context that's being sent to your
38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for
38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the
38:46 context memory. It's a very simplified explanation, but it's really, really
38:50 useful because a long conversation can also take a lot of VRAM just like larger
38:55 LLM. And so it's good to compress that. And then the third environment variable,
38:58 this is actually probably the most crucial one to set up for Olama. There
39:02 is this crazy thing. I don't know why Olama does it, but by default, they
39:07 limit every single large language model to 2,000 tokens for the context limit,
39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and
39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a
39:21 lot of local large language models can also handle large prompts. But Olamo
39:25 will limit you to default to 2,000 tokens. And so you have to override that
39:30 yourself with this environment variable. And so generally I recommend starting
39:34 with about 8,000 tokens to start. You can move this all the way up to
39:38 something like 32,000 tokens if your local large language model supports
39:42 that. And if you view the model page on Alama, you can see the context link
39:46 that's supported by the LLM. But you definitely want to, you know, jack this
39:50 up more from just 2,000 because a lot of times when you have longer
39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do
39:57 not miss this. If your large language model is starting to go completely off
40:02 the rails and ignore your system prompt and forget that it has these tools that
40:06 you gave it, it's probably because you reached the context length. And so, just
40:10 keep that in mind. I see people miss this a lot. And then the very last
40:14 environment variable, uh, probably the least important out of all these four,
40:18 but if you're running a lot of different large language models at once and you're
40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so
40:25 in Olama, you can limit the number of models that are allowed to be in your
40:29 memory at a single time. With this one, typically you want to set this to either
40:33 one or two. Definitely set this to just one if you are using large language
40:37 models that are basically fit for your GPU. like it's going to fit exactly into
40:41 your VRAM and you're not going to have room for another large language model.
40:44 But if you are running more smaller ones and maybe you could actually fit two on
40:48 your GPU with the VRAM that you have, you can set this to two. So again, more
40:52 technical overall, but it's very important to have these right. And we'll
40:55 get into the local AI package where I already have these set up in the
40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a
41:02 minute ago that I'll have linked in the description. And so there's actually a
41:06 lot of good things to read into here. um like being able to verify that your GPU
41:10 is compatible with Olama. How can you tell if the model's actually loaded on
41:13 your GPU? So, a lot of like sanity check things that they walk you through in the
41:17 FAQ as well. Also talking about environment variables, which I just
41:20 covered. And so, they've got some instructions here depending on your OS
41:23 how to get those set up. So, if there's anything that's confusing to you, this
41:26 is a very good resource to start with. So, I'm trying to make it possible for
41:30 you to look into things further if there's anything that doesn't quite make
41:33 sense for what I explained here. And of course, always let me know in the
41:35 comments if you have any questions on this stuff as well, especially the more
41:39 technical stuff that I just got to cover because it's so important even though I
41:43 know we really want to dive into the meat of things, which we are actually
41:47 going to do now. All right, here is everything that we have covered at this
41:50 point. And congratulations if you have made it this far because I covered all
41:55 the tricky stuff with quantization and the hardware requirements and offloading
41:59 and some of our little configuration and parameters. So, if you got all of that,
42:03 the rest of it is going to be a walk in the park as we start to dive into code,
42:07 getting all of our local AI set up and building out some agents. You understand
42:10 the foundation now that we're going to build on top of to make some cool stuff.
42:15 And so, now the next thing that we're going to do is talk about how we can use
42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show
42:23 you an example. We can take something that is using OpenAI right now,
42:27 transform it into something that is using OAMA and local LLM. So, we'll
42:31 actually dive into some code here. And I've got my fair share of no code stuff
42:35 in this master class as well, but I want to focus on both because I think it's
42:38 really important to use both code and no code whenever applicable. And that
42:42 applies to local AI just like building agents in general. So, I've already
42:45 promised a couple of times that I would dive into OpenAI API compatibility, what
42:50 it is, and why it's so important. And we're going to dive into this now so you
42:54 can really start to see how you can take existing agents and transform them into
42:59 being 100% local with local large language models without really having to
43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI
43:10 has created a standard for exposing large language models through an API.
43:14 It's called the chat completions API. It's kind of like how model context
43:19 protocol MCP is a standard for connecting agents to tools. The chat
43:23 completions API is a standard for exposing large language models over an
43:28 API. So you have this common endpoint along with a few other ones that all of
43:35 these providers implement. This is the way to access the large language model
43:39 to get a response based on some conversation history that you pass in.
43:43 So, Olama is implementing this as of February. We have other providers like
43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.
43:53 Almost every single provider is OpenAI API compatible. And so, not only is it
43:57 very easy to swap between large language models within a specific provider, it's
44:02 also very easy to swap between providers entirely. You can go from Gemini to
44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one
44:13 piece of configuration pointing to a different base URL as it is called. So
44:18 you can access that provider and then the actual API endpoint that you hit
44:22 once you are connected to that specific provider is always the exact same and
44:26 the response that you get back is also always the exact same. And so Olama has
44:31 this implemented now. And I'll link to this article in the description as well
44:33 if you want to read through this because they have a really neat Python example.
44:37 It shows where we create an OpenAI client and the only thing we have to do
44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are
44:47 pointing to Olama that is hosted locally instead of pointing to the URL for
44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And
44:55 then with Olama, you don't actually need an API key because everything's running
44:58 locally. So you just need some placeholder value here. But there is no
45:02 authentication that is going on. You can set that up. I'm not going to dive into
45:05 that right now. But by default, because it's all just running locally, you don't
45:09 even need an API key to connect to Olama. And then once we have our OpenAI
45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in
45:18 exactly the same way. But now we can specify a model that we have downloaded
45:22 locally already through Lama. We pass in our conversation history in the same way
45:27 and we access the response like the content the AI produced the token usage
45:31 like all those things that we get back from the response in the same way.
45:34 They've got a JavaScript example as well. They have a couple of examples
45:38 using different frameworks like the Versell AI SDK and Autogen. Really any
45:44 AI agent framework can work with OpenAI API compatibility to make it very easy
45:47 to swap between these different providers. like Pyantic AI, my favorite
45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily
45:57 within your Pantic AI agents swap between these different providers. And
46:02 so what I have for you now is two code bases that I want to cover. The first
46:07 one is the local AI package, which we'll dive into in a little bit. But right
46:13 now, we have all of the agents that we are going to be creating in this master
46:17 class. So I have a couple for N8N that are also available in this repository.
46:21 And then a couple of scripts that I want to share with you as well. And so the
46:25 very first thing that I want to show you is this simple script that I have called
46:31 OpenAI compatible demo. And so you can download this repository. I'll have this
46:34 linked in the description as well. There's instructions for downloading and
46:38 setting up everything in here. And this is all 100% local AI. And so with that,
46:43 I'm going to go over into my windsurf here where I have this OpenAI compatible
46:47 demo set up. So I've got a comment at the top reminding us what the OpenAI API
46:52 compatibility looks like. We set our base URL to point to Olama hosted
46:59 locally and it's hosted on port 11434 by default. So I can actually show you
47:02 this. I have Ola running in a Docker container, which we're going to dive
47:05 into this when we set up the local AI package, but you can see that it is
47:11 being exposed on port 11434. And by the way, you can see the
47:14 127.0.0.1 in that URL that I have highlighted here, that is synonymous with localhost.
47:21 And so this right here, you could also replace with 127.0.0.1.
47:25 Just a little tidbit there. It's not super important. I just typically leave
47:28 it as localhost. And then you can change the port as well. I'm just sticking to
47:32 what the default is. And then again, we don't need to set our API key. We can
47:36 just set it to any value that we want here. We just need some placeholder even
47:39 though there is no authentication with a llama for real unless you configure
47:43 that. So that's OpenAI compatibility. And the important thing with this script
47:47 here is I have two different configurations here. I have one for
47:51 talking to OpenAI and then one for OALMA. So with OpenAI, we set our base
47:57 URL to point to api.openai.com. We have our OpenAI API key set in our
48:01 environment variables. So you can just set all your environment variables here
48:05 and then rename this to env. I've got instructions for that in the readme of
48:08 course. And then going back to the script, we are using GPT4.1 nano for our
48:13 large language model. There's something super fast and cheap. And then for our
48:17 Lama configuration, we are setting the base URL here, localhost1434
48:22 or just whatever we have set in our environment variables. Same thing for
48:26 the API key. And then same thing for our large language model. And what I'm going
48:31 to be using in this case is Quen 314B. That is one of the large language models
48:34 that I showed you within the Olama website. Definitely a smaller one
48:38 compared to what I could run, but I just want to run something fast. And very
48:41 small large language models are great for simple tasks like summarization or
48:45 just basic chat. And that's what I'm going to be using here just for a simple
48:49 demo. And so whether it's enabled or not, this configuration is just based on
48:53 what we have set for our environment variables. And the important thing here
48:59 is the code that runs for each of these configurations just as we go through
49:03 this demo is exactly the same. We are parameterizing the configuration for the
49:08 base URL and API key. So we are setting up the exact same OpenAI client just
49:13 like we saw in the Olama article but just changing the base URL and API key.
49:17 And so then for example when we use it right here it's client.hat.comp
49:22 completions.create create calling the exact same function no matter if we're
49:26 using OpenAI or Olama. And then we're handling the response in the same way as
49:31 well. And so I'll go back to my terminal now. And so I went through all the steps
49:34 already to set up my virtual environment, install all of my dependencies. And so now I can run the
49:40 command OpenAI compatible demo. And now it's going to present the two
49:43 configuration options for me. And so I can run through OpenAI. So we'll go
49:46 ahead and do that first. And these two demos are going to look exactly the
49:50 same, but that is the point. And so we have our base URL here for OpenAI. We
49:54 have a basic example of a completion with GPT4.1 Nano. There we go. So this
49:59 is the model that was used. Here are the number of tokens. And this is our
50:03 response. And then I can press enter to see a streaming response now as well. So
50:07 we saw it type out our answer in real time. And then I can press enter one
50:10 more time. This is the last part of the demo. Just say multi-turn conversation.
50:14 So we got a couple of messages here in our conversation history. So very nice
50:19 and simple. The point here is to now show you that I can run this and select
50:23 Olama now instead and everything is going to look exactly the same and all
50:27 of the code is the same as wallet. It is only our configuration that is
50:31 different. And so it will take a little bit when you first run this because
50:36 Olama has to load the large language model into your GPU. And so going to the
50:42 logs for Olama, I can show you what this looks like here. And so when we first
50:47 make a request when Quen 314B is not loaded into our GPU yet, you're going to
50:52 see a lot of logs come in here and we'll and you'll have this container up and
50:54 running when you have the local AI package which we'll cover in a little
50:57 bit. So it shows all the metadata about our model like it's Quen 314b. Uh we can
51:05 see here that uh we have a Q4 KM quantization like we saw in the Olama
51:09 website. Uh what other information do we have here? There's just so much to to
51:14 digest here. Um, yeah, another really important thing is we have the uh
51:19 context link. I have that set to 8,192 just like I recommended in the
51:22 environment variables. And then we can see that we offloaded all of the layers
51:26 to the GPU. So I don't have to do any offloading to the CPU or the RAM. I can
51:30 keep everything in the GPU, which is certainly ideal, like I said, to make
51:34 sure this is actually fast. And then when we get a response from quen 314b,
51:41 we are calling the v1/hatcompletions endpoint because it is openi API
51:46 compatible. So that exact endpoint that we hit for openai is the one that we are
51:50 hitting here with a large language model that is running entirely on our computer
51:54 in Olama. And so the response I get back, it's actually a reasoning LLM as
51:58 well. So we even have the thinking tokens here, which is super cool. And so
52:02 we got our response. It's just printing out the first part of it here just to
52:04 keep it short. And then I can press enter. And we can see a streaming demo
52:08 as well. And it's going to be a lot faster this time because we do already
52:11 have the model loaded into our GPU. And so that first request when it first has
52:15 to load a model is always the slower one. And then it's faster going forward
52:19 once that model is already loaded in our GPU. And then as long as we don't swap
52:24 to another large language model and use that one, then it will remain in our GPU
52:28 for some time. And so then all of our responses after are faster. And then we
52:33 just have the last part of our demo here with a multi-turn conversation. So we
52:37 can see conversation history in action as well, just not with streaming here.
52:40 Um, and and everything's a bit slower with this large language model because
52:43 it is a reasoning one. And so you can certainly if you want faster uh
52:47 inference, you can always use a non-reasoning local LLM like Mistl or
52:52 Gemma for example. So that is our very simple demo showing how this works. I
52:55 hope that you can see with this and again this works with other AI agent
52:59 frameworks like eggno or pideantic AI or crew AI as well like they all work in
53:03 this way where you can use openAI API compatibility to swap between providers
53:08 so easily so you don't have to recreate things to use local AI and that's
53:11 something so important that I want to communicate with you because if I'm the
53:15 one introducing you to local AI I also want to show you how it can very easily
53:19 fit into your existing systems and automations. All right. Now, we have
53:20 Introducing the Local AI Package
53:23 gotten to the part of the local AI master class that I'm actually the most
53:27 excited for because over the past months, I have very much been pouring my
53:31 heart and soul into building up something to make it infinitely easier
53:35 for you to get everything up and running for local AI. And that is the local AI
53:40 package. And so, right now, we're going to walk through installing it step by
53:44 step. I don't want you to miss anything here because it's so important to get
53:47 this up and running, get it all working well. Because if you have the local AI
53:51 package running on your machine and everything is working, you don't need
53:55 anything else to start building AI agents running 100% offline and
53:59 completely private. And so here's the thing. At this point, we've been
54:04 focusing mostly on Olama and running our local large language models. But there's
54:08 the whole other component to local AI that I introduced at the start of the
54:13 master class for our infrastructure. things like our database and local and
54:18 private web search, our user interface, agent monitoring. We have all these
54:23 other open-source platforms that we also want to run along with our large
54:27 language models and the local AI package is the solution to bring all of that
54:32 together curated for you to install in just a few steps. So, here is the GitHub
54:37 repository for the local AI package. I'll have this linked in the description
54:41 below. Just to be very clear, there are two GitHub repos for this master class.
54:45 We have this one that we covered earlier. This has our N8N and Python
54:49 agents that we'll cover in a bit, as well as the OpenAI compatible demo that
54:53 we saw earlier. So, you want to have this cloned and the local AI package as
54:57 well. Very easy to get both up and running. And if you scroll down in the
55:02 local AI package, I have very comprehensive instructions for setting
55:06 up everything, including how to deploy it to a private server in the cloud,
55:10 which we'll get into at the end of this master class, and a troubleshooting
55:13 section at the bottom. So, everything that I'm about to walk you through here,
55:17 there's instructions in the readme as well if you just want to circle back to
55:21 clarify anything. Also, I dive into all of the platforms that are included in
55:26 the local AI package. And this is very important because like I said, when you
55:30 want to build a 100% offline and private AI agent, it's a lot more than just the
55:35 large language model. You have all of the accompanying infrastructure like
55:39 your database and your UI. And so I have all that included. First of all, I have
55:44 N8N that is our low/noodeode workflow automation platform. We'll be building
55:48 an agent with N8N in the local AI package in a little bit once we have it
55:52 set up. We have Superbase for our open- source database. We have Olama. Of
55:56 course, we want to have this in the package as well for our LLMs. Open Web
56:01 UI, which gives us a chat GPT like interface for us to talk to our LLMs and
56:06 have things like conversation history. Very, very nice. So, we're looking at
56:09 this right here. This is included in the package. Then we have Flowwise. It's
56:13 similar to N8N. It's another really good tool to build AI agents with no slash
56:18 low code. Quadrant, which is an open- source vector database. Neo4j which is a
56:25 knowledge graph engine and then seir xng for open-source completely free and
56:31 private web search caddy which this is going to be very important for us once
56:35 we deploy the local AI package to the cloud and we actually want to have
56:38 domains for our different services like nn and open web UI and then the last
56:42 thing is langfuse this is an open- source LLM engineering platform it helps
56:47 us with agent observability now some of these services are outside of the scope
56:52 for this local AI master class. I don't want to spend a half hour on every
56:56 single one of these services and make this a 10-hour video. I will be focusing
57:02 in this video on N8N, Superbase, Olama, Open WebUI, CRXNG, and then Caddy once
57:08 we deploy everything to the cloud. So, I do cover like half of these services.
57:12 And the other thing that I want to touch on here is that there are quite a few
57:16 things included here. And so you do need about 8 GB of RAM on your machine or
57:21 your cloud server to run everything. It is pretty big overall. And so you can
57:26 remove certain things like if you don't want Quadrant and Langfuse for example,
57:30 you can take those out of the package. More on that later. It doesn't have to
57:34 be super bloated, you can whittle this down to what you need. But yeah, there's
57:37 a lot of different things that go into building AI agents. And so I have all of
57:40 these services here so that no matter what you need, I've got you covered. And
57:42 Instructions for Installing the Local AI Package
6:08 Run Your 1st Local LLM in 5 Minutes w/ Ollama
6:12 means running large language models on our own computer, it's not as easy as
6:18 just going to claw.ai AI or chatgbt.com and typing in a prompt. We have to
6:22 actually install something, but it still is very easy to get started. So, let me
6:25 show you right now with a hands-on example. So, here we are within the
6:30 website for Olama. This is just.com. I'll have a link to this in the
6:33 description of the video. This is one of the open- source platforms that allows
6:39 us to very easily download and run local large language models. And so, you just
6:42 have to go to their homepage here and click on this nice big download button.
6:45 You can install it for Windows, Mac or Linux. It really works for any operating
6:49 system. Then once you have it up and running on your machine, you can open up
6:53 any terminal. Like I'm on Windows here, so I'm in a PowerShell session and I can
6:58 run Olama commands now to do things like view the models that I have available on
7:03 my machine. I can download models and I can run them as well. And the way that I
7:08 know how to pull and run specific models is I can just go to this models tab in
7:12 their navigation and I can browse and filter through all of the open source
7:16 LLMs that are available to me like DeepSeek R1. Almost everyone is familiar
7:20 with DeepSeek. It just totally blew up back in February and March. We have
7:25 Gemma 3, Quen 3, Llama 4, a few of them that I mentioned earlier when we had the
7:30 presentation up. And so we can click into any one of these like I can go into
7:35 DeepSeek R1 for example and then I have the command right here that I can copy
7:39 to download and run this specific model in my terminal. And there are a lot of
7:44 different model variants of DeepSeek R1. So we'll get into different sizes and
7:47 hardware requirements and what that all means in a little bit, but I'll just
7:50 take one of them and run it as an example. So I'll just do a really small
7:54 one right now. I'll do a 1.5 billion parameter large language model. And
7:58 again, I'll explain what that means in a little bit. I can copy this command.
8:01 It's just lama run and then the unique ID of this large language model. So I'll
8:06 go back into my terminal. I'll clear it here and then paste in this command. And
8:09 so first it's going to have to pull this large language model. And the total size
8:15 for this is 1.1 GB. And so it'll have to download it. And then because I used the
8:20 run command, it will immediately get me into a chat interface with the model
8:24 once it's downloaded. Also, if you don't want to run it right right away, you
8:27 just want to install it, you can do Olama pull instead of Olama run. And
8:32 then again, to view the models that you have available to you installed already,
8:36 you can just do the Olama list command like I did earlier. And so, right now,
8:39 I'll pause and come back once it's installed in about 30 seconds. All
8:43 right, it is now installed. And now I can just send in a message like hello.
8:47 And then boom, we are now talking to a large language model. But instead of it
8:50 being hosted somewhere else and we're just using a website, this is running on
8:54 my own infrastructure, the large language model and all the billions of
8:59 parameters are getting loaded onto my graphics card and running the inference.
9:02 That's what it's called when we're generating a response from the LLM
9:06 directly within this terminal here. And so I can ask another question like um
9:13 what is the best GPU right now? We'll see what it says. So it's thinking
9:16 first. This is actually a thinking model. Deepseek R1 is a reasoning LLM.
9:20 And then it gives us an answer. It's top GPU models today. 3080 RX6700.
9:27 Obviously, we have a training cutoff for local large language models just like we
9:30 do with ones in the cloud like GPT. And so the information is a little outdated
9:34 here, but yeah, this is a good answer. So we have a large language model that
9:38 we're talking to directly on our machine. And then to close out of this,
9:42 I can just do control D or command D on Mac. And if I do list, we have all the
9:47 other models that you saw earlier, plus now this one that I just installed. So
9:50 these are all available for me to run again just with that Olama run command.
9:54 And it won't have to reinstall if you already have it installed. Run just
9:58 installs it if you don't have it yet already. So that is just a quick demo of
10:03 using Olama. We'll dive a lot more into Olama later, like how we can actually
10:07 use it within our Python code and within our N8N workflows. This is just our
10:11 quick way to try it out within the terminal. Now, to really get into why we
10:15 should care about local AI now that we know what it is, I want to cover the
10:19 pros and cons of local AI and what I like to call cloud AI. That's just when
10:23 you're paying for things to be hosted for you, like using Claude or Gemini or
10:28 using the cloud version of N8N instead of hosting it yourself. And I also want
10:32 to cover the advantages of each because I don't want to sugarcoat things and
10:35 just hype up this master class by telling you that you should always use
10:39 local AI. That is certainly not the case. There is a time and place for both
10:43 of these categories here, but there are so many use cases where local AI is
10:50 absolutely crucial. You have no idea how many businesses I have talked to that
10:54 are willing to put tens of thousands of dollars into running their own LLMs and
10:58 infrastructure because privacy and security is so crucial for the things
11:02 that they're building with AI. And that actually gets into the first advantage
11:06 here of local AI, which is privacy and security. You can run things 100%
11:11 offline. The data that you're giving to your LLMs as prompts, it now doesn't
11:16 leave your hardware. It stays entirely within your own control. And for a lot
11:20 of businesses, that is 100% crucial, especially when they're in highly
11:24 regulated industries like the health industry, finance, uh even real estate.
11:28 Like there's so many use cases where you're working with intellectual
11:32 property or just really sensitive information. You don't want to be
11:36 sending your data off to an LLM provider like Google or OpenAI or Enthropic. And
11:41 so as a business owner, you should definitely be paying attention to this
11:45 if you are working with automation use cases where you're dealing with any kind
11:48 of sensitive data. And then also if you're a freelancer, you're starting an
11:52 AI automation agency, anything where you're building for other businesses,
11:55 you are going to have so many opportunities open up to you when you're
11:59 able to work with local AI because you can handle those use cases now where
12:03 they need to work with sensitive data and you can't just go and use the OpenAI
12:08 API. And that is the main advantage of local AI. It is a very big deal. But
12:11 there are a few other things that are worth focusing on as well. Starting with
12:16 model fine-tuning, you can take any open- source large language model and
12:20 add additional training on top with your own data. Basically making it a domain
12:24 expert on your business or the problem that you are solving. It's so so
12:29 powerful. You can make models through fine-tuning more powerful than the best
12:33 of the best in the cloud depending on what you are able to fine-tune with
12:38 depending on the data that you have. And you can do fine-tuning with some cloud
12:42 models like with GPT, but your options are pretty limited and it can be quite
12:46 expensive. And so it definitely is a huge advantage to local AI. And local AI
12:52 in general can be very coste effective, including the infrastructure as well. So
12:56 your LLMs and your infrastructure. You run it all yourself and you pay for
13:00 nothing besides the electricity bill if it's running on your computer at your
13:04 house or if you have some private server in the cloud. You just have to pay for
13:07 that server and that's it. There's no N8N bill, no Superbase bill, no OpenAI
13:13 bill. You can save a lot of money. It's really, really nice. And on top of that,
13:17 when everything is running on your own infrastructure, the agents that you
13:22 create can run on the same server, the same place as your infrastructure. And
13:26 so it can actually be faster because you don't have network delays calling APIs
13:30 for all your different services for your LLMs and your database and things like
13:34 that. And then with that, we can now get into the advantages of cloud AI.
13:38 Starting with it's a lot easier to set up. There's a reason why I have to have
13:42 this master class for you in the first place. There are some initial hurdles
13:47 that we have to jump over to really have everything fully set up for our local
13:51 LLMs and infrastructure. And you just don't have that with cloud AI because
13:54 you can very simply call into these APIs. You just have to sign up and get
13:58 an API key and that's about it. So, it certainly is easier to get up and
14:02 running and there's less maintenance overall because they are hosting things
14:05 for you. Superbase is hosting the database for you. OpenAI is hosting the
14:09 LLM for you. So, you don't have to manage things on your own hardware. With
14:13 Local AI, you have to apply patches and updates if you have a private server in
14:17 the cloud. You have to manage your own hardware if you're running on your own
14:20 computer, making sure that it's on 24/7, if you want your database on 24/7, that
14:24 kind of thing. It's just less maintenance with cloud AI. And then
14:28 probably the biggest advantage of cloud AI overall is that you have better
14:34 models available to you. Claude 4 sonnet or opus for example is more powerful
14:40 than any local AI that you could run. So we have this gap here and this gap was a
14:45 lot bigger at one point even a year ago. The best local LLMs absolutely crushed
14:50 the best local LLMs and that gap is starting to diminish. And so I really
14:55 see a future where that gap is diminished entirely and all the best
14:59 local LLMs are actually on par with the best cloud ones. That's the future I
15:03 see. That's why I think that cloud that's why I think that local AI is such
15:07 a big deal because the advantages of local AI, those are just going to get
15:12 more prevalent over time when businesses realize they really want private and
15:15 secure solutions. And then the advantages of cloud AI, I think those
15:19 are actually going to diminish over time. That's the key. minimal setup,
15:24 less maintenance. Well, those advantages are going to go away as we have
15:28 platforms and better instructions and solutions to make the setup and
15:32 maintenance easier for local AI and we have the gap that's continuing to
15:35 diminish between the power of these LLMs. All these advantages are going to
15:39 actually go away and then it'll just completely make sense to use local AI
15:43 honestly probably for like every single solution in the future. That's really
15:47 what I see us heading towards. And then the last advantage to cloud AI which
15:51 also I think will go away over time is that you have some features out of the
15:55 box like you have memory that's built directly into chat GPT. Gemini has web
15:59 search baked in even when you use it through the API like these kind of
16:02 capabilities that are out of the box that you have to implement yourself with
16:07 local AI maybe as tools for your agent and you can definitely do that but it is
16:10 nice that these things are out of the box for cloud AI. So those are the pros
16:15 and cons between the two. I hope that this makes it very clear for you to pick
16:19 right now for your own use case. Should I implement local AI or cloud AI? A lot
16:24 of it comes down to the security and privacy requirements for your use case.
16:28 Now, the next big thing that we need to talk about for local AI is hardware
16:33 requirements. Cuz here's the thing, large language models are very resource
16:39 inensive. You can't just run any LLM on any computer. And the reason for that is
16:43 large language models are made up of billions or even trillions of numbers
16:47 called parameters. And they're all connected together in a web that looks
16:51 kind of like this. This is a very simplified view with just a few
16:54 parameters here. But each of the parameters are nodes and they're
16:58 connected together. The input layer is where our prompt comes in and our prompt
17:02 is fed through all these hidden layers and then we have the output at the end.
17:05 This is the response we get back from the LLM. But like I said, this is a very
17:10 simplified view. GPT4, for example, like you can see on the right hand side, is
17:15 estimated to have 1.4 trillion parameters. And so, if you want to fit
17:20 an entire large language model into your graphics card, you have to store all of
17:25 these numbers. And even though we can handle gigabytes at a time in our
17:28 graphics cards through what is called VRAMm, storing billions or trillions of
17:34 numbers is absolutely insane. And so that's why large language models, you
17:37 actually have to have a pretty good graphics card if you want to run some of
17:42 the best ones. And so looking at Olama here, when we see these different sizes,
17:47 going back to their model list, like 1.5 billion parameters or 27 billion
17:51 parameters, there are different sizes for the local LLMs. Obviously, the
17:56 larger a local LM that you are running, the more performance you are going to
17:59 get, but you are going to be limited to what you are capable of running with
18:04 your graphics card or your hardware. So, with that in mind, I now want to dive
18:07 into the nitty-gritty details with you so you know exactly the kind of models
18:11 that you can run, the kind of speeds you can expect depending on your hardware.
18:15 And if you want to invest in new hardware to run local AI, I've got some
18:19 recommendations as well. So there are generally four primary size ranges for
18:25 large language models based on the speed and the power that you are looking for.
18:29 You have models that are around seven or 8 billion parameters. Those are
18:33 generally the smallest that I'd recommend trying to run. There are a lot
18:37 of smaller LLMs available like 1 billion parameters or three billion parameters,
18:41 but I'm so unimpressed when I use those LLMs that I don't really want to focus
18:45 on them here. 7 billion parameters is still tiny compared to the large cloud
18:51 AI models like Claude or GBT, but you can get pretty good results with them
18:55 for just simple chat use cases. And so for these models, assuming a Q4
18:59 quantization, which I'll get into quantization in a little bit, it's
19:02 basically just a way to make the LLM a lot smaller without hurting performance
19:06 that much, a 7 billion parameter model will need about four to 5 GB of VRAM on
19:11 your graphics card. And so if you have something like a 3060 Ti from Nvidia
19:16 with 8 GB of VRAM, you can very comfortably run a 7 billion parameter
19:20 model and you can expect to get very roughly around 25 to 35 tokens per
19:27 second. A token is roughly equivalent to a word. And so your local large language
19:31 model at 7 billion parameters with this graphics card will get about 25 to 35
19:37 words per second out on the screen to you being streamed out. And then if you
19:42 use much more powerful hardware like a 3090 to run a 7 billion parameter model
19:47 then you'll just jack up the speed a lot more. So that's 7 billion or 8 billion
19:51 parameters. Another very common size is something around 14 billion parameters.
19:57 This will take about 8 to 10 GB of VRAM. And so just a couple of options for
20:01 this. You have the 4070Ti which is usually 16 GB of VRAM or you could go as
20:08 low as 12 GB of VRAM with the 3080 Ti. And you could expect to get about 15 to
20:13 25 words per second. And then this is where you start to get into basic tool
20:17 calling. So I find that when you are building with a 7 billion parameter
20:21 model, they don't do tool calling very well. So you can't really build that
20:25 powerful of agents around a 7 billion parameter model. But once you get to
20:29 something around 14 billion parameters, that's when I see agents being able to
20:33 really accept instructions well around tools and system prompts and leveraging
20:37 tools to do things on our behalf. That's when we can really start to use LLMs to
20:42 make things that are agentic. And then the next big category of LLMs
20:47 is somewhere between 30 and 34 billion parameters. You see a lot of LM that
20:51 fall in that size range. This will typically need 16 to 20 gigabyt of VRAM.
20:57 And so a 3090 is a really good example of a graphics card that can run this. It
21:03 has 24 GB of VRAM. I actually have two 3090s myself. And I'll have a link to my
21:08 exact PC that I built for running local AI in the description of this video. So
21:12 I have two 3090s, which we'll need in a second for a 70 billion parameter, but
21:16 one is enough for a 32 billion parameter model. And then also Macs with their new
21:22 M4 chips are very powerful with their unified memory architecture. So if you
21:27 get a Mac M4 Pro with 24 GB of unified memory, you can also run 32 billion
21:32 parameter models. Now the speed isn't going to be the best necessarily, and
21:36 again, this does depend a lot on your computer overall, but you can expect
21:40 something around 10 to 15, maybe up to 20 tokens per second. and 32 billion
21:46 parameters is when you really start to see LLMs that are actually pretty
21:50 impressive. Like 7 billion and 14 billion, they are disappointing quite a
21:54 bit. I'll be totally honest. Especially when you try to use them with more
21:58 complicated agentic tasks. 32 billion when you start to get into this range is
22:02 when I I'm actually genuinely impressed. I'm like, "Oh, this is actually pretty
22:05 close to the performance of some of the best cloud AI." And then 70 billion
22:10 parameters. This is going to take about 35 to 40 GB of VRAM for most consumer
22:19 GPUs like 3090s and 4090s even 5090s. It's not actually enough VRAM. And so
22:22 this is when you have to start to split a large language model across multiple
22:28 GPUs which solutions like Olama will actually help you do this right out of
22:31 the box. So it's not this insane setup even though it might feel kind of
22:35 daunting like oh I have to split the layers of my LLM between GPUs. It's not
22:38 actually that complicated. And so 23090s, 24090s, that will be necessary.
22:44 Um, or you could have more of like an enterprisegrade GPU like an H100. So
22:49 Nvidia has a lot of these non-consumer grade GPUs that have a lot more VRAMm to
22:53 handle things like 70 billion parameter models. And the speed won't be the best
22:58 if you're using something like 23090s, especially because performance is hurt
23:01 when you have to split an LM between GPUs. You could expect something like 8
23:06 to 12 tokens per second. And this is obviously if you have the most complex
23:09 agents that you're really trying to match the performance of cloud AI as
23:13 much as possible, that's when you'd want to use a 7 billion parameter model. And
23:16 then if you're investing in hardware to run local AI, I have a couple of quick
23:20 recommendations here. And a lot of this depends on the size of the model that's
23:24 going to be good enough for your use case. And so I'll dive into some
23:28 alternatives for running local AI directly if you want to do testing
23:32 before you buy infrastructure. I'll get into that in a little bit, but
23:35 recommended builds. If you want to spend around $800 to build a PC, I'd recommend
23:41 getting a 4060Ti graphics card and then 32 GB of RAM. If you want to spend
23:47 $2,000, I'd recommend either getting a PC with a 3090 and 64 GB of RAM or
23:53 getting that Mac M4 Pro with 24 GB of unified memory. And then lastly, if you
23:58 want to spend $4,000, which is about what I spent for my PC, then I'd
24:02 recommend getting two 3090 graphics cards, and I got both of mine used for
24:07 around $700 each. Um, and then also getting 128 GB of RAM, or you can get a
24:15 Mac M4 Max with 64 GB of unified memory. So, I wanted to really get into the
24:19 nitty-gritty details there. So, I know I spent a good amount of time diving into
24:22 super specific numbers, but I hope this is really helpful for you. No matter the
24:26 large language model or your hardware, you now know generally where you're at
24:30 for what you can run. So, to go along with that information overload, I want
24:34 to give you some specifics, individual LLMs that you can try right now based on
24:38 the size range that you know will work for your hardware. So, just a couple of
24:42 recommendations here. The first one that I want to focus on is Deepseek R1. This
24:47 is the most popular local LLM ever. It completely blew up a few months ago. And
24:52 the best part about DeepSeek R1 is they have an option that fits into each of
24:56 the size ranges that I just covered in that chart. So they have a 7 billion
25:00 parameter, 14, 32, and 70. The exact numbers that I mentioned earlier. And
25:05 then there is also the full real version of R1, which is 671 billion parameters.
25:10 I'm sorry though, you probably don't have the hardware to run that unless
25:13 you're spending tens of thousands on your infrastructure. So, probably stick
25:16 with one of these based on your graphics card or if you have a Mac computer, pick
25:19 the one that'll work for you and just try it out. You can click on any one of
25:23 these sizes here. And then here's your command to download and run it. And this
25:28 is defaulting to a Q4 quantization, which is what I was assuming in the
25:31 chart earlier. And again, I will cover what that actually means in a little bit
25:35 here. The other one that I want to focus on here is Quen 3. This is a lot newer.
25:41 Quen 3 is so good. And they don't have a 70 billion parameter option, but they do
25:45 have all the other um sizes that fit into those ranges that I mentioned
25:49 earlier. Like they got 8 billion, 14 billion, and 32 billion parameters. And
25:52 the same kind of deal where you click on the size that you want and you've got
25:55 your command to install it here. And this is a reasoning LLM just like
26:01 DeepSeek R1. And then the other one that I want to mention here is Mistral Small.
26:05 I've had really good results with this as well. There are less options here,
26:08 but you've got 22 or 24 billion parameters, which is going to work well
26:12 with a 3090 graphics card or if you have a Mac M4 Pro with 24 GB of unified
26:18 memory. Really, really good model. And then also, there is a version of it that
26:22 is fine-tuned for coding specifically called Devstrol, which is a another
26:26 really cool LLM worth checking out as well if you have the hardware to run it.
26:30 So, that is everything for just general recommendations for local LMS to try
26:34 right now. This is the part of the master class that is going to become
26:38 outdated the fastest because there are new local LMS coming out every single
26:42 month. I don't really know how long my recommendations will last for. But in
26:45 general, you can just go to the model list in Olama, search for the ones,
26:49 finds one that has the size that works with your graphics card and just give it
26:52 a shot. You can install it and run it very easily with Olama. And the other
26:57 thing that I want to mention here is you don't always have to run open- source
27:01 large language models yourself. You can use a platform like Open Router. You can
27:05 just go to open router.ai, sign up, add in some API credits. You can try these
27:10 open source LLM yourself. Maybe if you want to see what's powerful enough for
27:15 your agents before you invest in hardware to actually run them yourself.
27:18 And so within Open Router, I can just search for Quen here. And I can go down
27:23 to Quen and I can go to 32 billion. They have a free offering as well that
27:26 doesn't have the best rate limits. So I'll just go to this one right here,
27:31 Quen 3 32B. So I can try the model out through open router. They actually host
27:35 it for me. So it's an open- source non-local version, but now I can try it
27:39 in my agents to see if this is good. And then if it's good, it's like, okay, now
27:43 I want to buy a 3090 graphics card so that I can install it directly through
27:47 um Olama instead. And so the 32 billion quen 3 is exactly what we're seeing here
27:51 in open router. And there are other platforms like Grock as well where you
27:55 can run these open source large language models um not on your own infrastructure
27:58 if you just want to do some testing before beforehand or whatever that might
28:01 be. So I wanted to call that out as an alternative as well. But yeah, that's
28:04 everything for my general recommendations for LLMs to try and use
28:09 in your agents. All right, it is time to take a quick breather. This is
28:12 everything that we've covered already in our master class. What is local AI? Why
28:17 we care about it? Why it's the future and hardware requirements. And I really
28:20 wanted to dive deep into this stuff because it sets the stage for everything
28:24 that we do when we actually build agents and deploy our infrastructure. And so
28:28 the last thing that I want to do with you before we really start to get into
28:33 building agents and setting up our package is I want to talk about some of
28:37 the tricky stuff that is usually pretty daunting for anyone getting into local
28:42 AI. I'm talking things like offloading models, quantization, environment
28:47 variables to handle things like uh flash attention, all the stuff that is really
28:51 important that I want to break down simply for you so you can feel confident
28:55 that you have everything set up right, that you know what goes into using local
29:00 LLMs. The first big concept to focus on here is quantization. And this is
29:04 crucial. It's how we can make large language models a lot smaller so they
29:10 can fit on our GPUs without hurting performance too much. We are lowering
29:14 the model precision here. And so what basically what that means is we have
29:18 each of our parameters, all of our numbers for our LLMs that are 16 bits
29:23 with the full size, but we can lower the precision of each of those parameters to
29:28 8, four, or two bits. Don't worry if you don't understand the technicalities of
29:31 that. Basically, it comes down to LLMs are just billions of numbers. That's the
29:35 parameters that we already covered. And we can make these numbers less precise
29:40 or smaller without losing much performance. So, we can fit larger LLMs
29:45 within a GPU that normally wouldn't even be close to running the full-size model.
29:50 Like with 32 billion parameter LLMs, for example, I was assuming a Q4
29:55 quantization like four bit per parameter in that diagram earlier. If you had the
30:00 full 16 bit parameter for the 32 billion parameter LLM, there's no way it could
30:06 fit on your Mac or your 3090 GPU, but we can use quantization to make it
30:10 possible. It's like rounding a number that has a long decimal to something
30:15 like 10.44 instead of this thing that has like 10 decimal points, but we're
30:19 doing it for each of the billions of parameters, those numbers that we have.
30:23 And so just to give you a visual representation of this, you can also
30:27 quantize images just like you can quantize LLMs. And so we have our full
30:31 scale image on the lefth hand side here comparing it to different levels of
30:35 quantization. We have 16 bit, 8 bit, and 4bit. And you can see that at first with
30:40 a 16- bit quantization, it almost looks the same. But then once we go down to
30:44 4bit, you can very much see that we have a huge loss in quality for the image.
30:49 Now with images, it's more extreme than LLMs. when we do a 8 bit or a 4bit
30:54 quantization, we don't actually lose that much performance like we lose a lot
30:58 of quality with images. And so that's why it's so useful for us. And so I have
31:01 a table just to kind of describe what this looks like. So FP16, that's the
31:07 16bit precision that all LMS have as a base. That is the full size. The speed
31:11 is obviously going to be very slow because the model is a lot bigger, but
31:16 your quality is perfect compared to what it could be. I mean, obviously that
31:18 doesn't mean that you're going to get perfect answers all the time. I'm just
31:22 saying it's it's the 100% results from this LLM. And then going down to a Q8
31:28 precision, so it's half the size. The speed is going to be a lot better. And
31:33 the quality is nearperfect. So it's not like performance is cut in half just
31:37 because size is. You still have the same number of parameters. Each one is just a
31:42 bit less precise. And so you're still going to get almost the same results.
31:47 And then going down to a Q4 4bit, it's a fourth the size. It's going to be very
31:52 fast compared to 16 bit. And the quality is still going to be great. Now, these
31:57 numbers are very vague on purpose. There's not a huge way to for me to like
32:01 qualify exactly the difference, especially because it changes per LLM
32:05 and your hardware and everything like that. So, I'm just being very general
32:09 here. And then once you get to Q2, um the size goes down a lot. It's going to
32:13 be very very fast, but usually your performance starts to go down quite a
32:17 bit once you go down to a Q2. And then like the note that I have in the bottom
32:22 left here, a Q4 quantization is generally the best balance. And so when
32:26 you are thinking to yourself, which large language model should I run? What
32:31 size should I use? My rule of thumb is to pick the largest large language model
32:37 that can work with your hardware with a Q4 quantization. That is why I assumed
32:42 that in the table earlier. And then also like we saw in Olama earlier, it always
32:47 defaults to a Q4 quantization because the 16 bit is just so big compared to Q4
32:52 that most of the LLMs you couldn't even run yourself. And a Q4 of a 32 billion
32:59 parameter model is still going to be a lot more powerful than the full 7
33:02 billion parameter or 14 billion parameter because you don't actually
33:07 lose that much performance. So that is quantization. So just to make this very
33:11 practical for you, I'm back here in the model list for Quen 3. We have all these
33:15 models that don't specify a quantization, but we can see that it
33:20 defaults to Q4 because if I click on any one of them, the quantization right here
33:26 is a Q4 KM. And don't worry about the KM. That's just a way to group
33:30 parameters. You have KS, KM, and KL. It's kind of outside of the scope of
33:33 what really matters for you. The big thing is the Q4 like the actual number
33:38 here. So Q4 quantization is the default for Quen 332B and really any model in
33:44 Olama. But if we want to see the other quantized variants and we want to run
33:48 them, you can click on the view all. This is available no matter the LLM that
33:52 you're seeing in Olama. Now we can scroll through and see all the levels of
33:56 quantization for each of the parameter sizes for Quen 3. So, if I scroll all
34:01 the way down, the absolute biggest version of Quenti that I can run is the
34:08 full 16bit of the 235 billion parameter Quen 3. And it is a whopping 470 GB just
34:14 to install this. And there is no way that you're ever going to lay hands on
34:17 infrastructure to run this unless you're working for a very large enterprise. But
34:22 I can go down here, let's say, to 14 billion parameters and I can run the Q4
34:27 like this. So, you can click on any one that you want to run. Like let's say I
34:30 want to run Q8. I can click on this and then I have the command to pull and run
34:35 this specific quantization of the 14 billion parameter model. So each of the
34:39 quantized variants they have a unique ID within Olama. So you can very
34:42 specifically choose the one that you want. Again my general recommendation is
34:47 just to go with also what Olama recommends which is just defaulting to
34:51 Q4. Like if I go to DeepSec R1, you can see that also defaults to Q4 no matter
34:56 the size that I pick. But if you do want to explore different quantizations, you
35:00 want to try to run the absolute full model for maybe something smaller like 7
35:04 billion or 14 billion, you can definitely do that through a lama and
35:08 really any other provider of local LLMs. So that is everything for quantization.
35:12 It's important to know how that works, but yes, generally stick with a Q4 of
35:17 the largest LLM that you can run. The next concept that is very important to
35:21 understand is offloading. All offloading is is splitting the layers for your
35:26 large language model between your GPU and your CPU and RAM. It's kind of
35:30 crazy, but large language models don't have to fit entirely in your GPU. All
35:36 large language models can be split into layers, layers of the different weights,
35:40 and you can have some of it running on your GPU. So, it's stored in your VRAM
35:45 and computed by the GPU. And then some of the large language models stored in
35:50 your RAM, computed by the CPU. Now, this does hurt performance a lot. And so,
35:55 generally, you want to avoid offloading if you can. You want to be able to fit
35:59 everything in your GPU, which by the way, the context, like your prompts for
36:04 your local LLMs, that is also stored in VRAM. And so, sometimes you'll see what
36:08 happens when you have very long conversations for a large language model
36:13 that barely fit in your GPU. That'll actually tip it over the edge. So, it
36:16 starts to offload some of it to the CPU and RAM. So keep that in mind when you
36:19 have longer conversations and all of a sudden things get really slow, you know
36:24 that offloading is happening. Sometimes this is necessary though as context
36:28 grows. And if you're only offloading a little bit of the LLM or a little bit of
36:32 the conversation, whatever to the CPU and RAM, it won't affect performance
36:36 that much. And so sometimes if you're trying to squeeze the biggest size you
36:41 can into your machine for an LLM, you can take advantage of offloading to run
36:45 something bigger or have a much larger conversation. Just know that usually it
36:49 kind of sucks. Like when I have offloading start to happen, my machine
36:53 gets bogged down and the responses are a lot slower. It's really not fun, but it
36:59 is possible. And fun fact, by the way, if your GPU is full and your CPU and RAM
37:04 is full, you can actually offload to storage, like literally using your hard
37:07 drive or SSD. That's when it's like incredibly slow and just terrible. But
37:11 just fun fact, you can actually do that. Now, the very last thing that I want to
37:15 cover before we dive into some code, setting up the local AI package, and
37:21 building out some agents is a few very crucial parameters, environment
37:25 variables for Olama. So, these are environment variables that you can set
37:28 on your machine just like any other based on your operating system. And
37:32 Olama does have an FAQ for setting up some of these things, which I'll link to
37:36 in the description as well. But yeah, these are a bit more technical, so
37:40 people skip past setting this stuff up a lot, but it's actually really, really
37:44 important to make things very efficient when running local LLMs. So the first
37:49 environment variable is flash attention. You want to set this to one or true.
37:54 When you have this set to true, it's going to make the attention calculation
37:59 a lot more efficient. It sounds fancy, but basically large language models when
38:04 they are generating a response, they have to calculate which parts of your
38:08 prompt to pay the most attention to. That's the calculation. And you can make
38:12 it a lot more efficient without losing much performance at all by setting up
38:16 the flash attention, setting that to true. And then for another optimization,
38:21 just like we can quantize the LLM itself, you can also quantize or
38:27 compress the context. So your system prompt, the tool descriptions, your
38:31 prompt and conversation history, all that context that's being sent to your
38:36 LLM, you can quantize that as well. So Q4 is my general recommendation for
38:41 quantizing LLMs. Q8 is the general recommendation for quantizing the
38:46 context memory. It's a very simplified explanation, but it's really, really
38:50 useful because a long conversation can also take a lot of VRAM just like larger
38:55 LLM. And so it's good to compress that. And then the third environment variable,
38:58 this is actually probably the most crucial one to set up for Olama. There
39:02 is this crazy thing. I don't know why Olama does it, but by default, they
39:07 limit every single large language model to 2,000 tokens for the context limit,
39:13 which is just tiny compared to, you know, Gemini being 1 million tokens and
39:17 Claude being 200,000 tokens. Like, they handle very, very large prompts. And a
39:21 lot of local large language models can also handle large prompts. But Olamo
39:25 will limit you to default to 2,000 tokens. And so you have to override that
39:30 yourself with this environment variable. And so generally I recommend starting
39:34 with about 8,000 tokens to start. You can move this all the way up to
39:38 something like 32,000 tokens if your local large language model supports
39:42 that. And if you view the model page on Alama, you can see the context link
39:46 that's supported by the LLM. But you definitely want to, you know, jack this
39:50 up more from just 2,000 because a lot of times when you have longer
39:53 conversations, you're going to get past 2,000 tokens very, very quickly. So, do
39:57 not miss this. If your large language model is starting to go completely off
40:02 the rails and ignore your system prompt and forget that it has these tools that
40:06 you gave it, it's probably because you reached the context length. And so, just
40:10 keep that in mind. I see people miss this a lot. And then the very last
40:14 environment variable, uh, probably the least important out of all these four,
40:18 but if you're running a lot of different large language models at once and you're
40:22 trying to shove them all in your GPU, a lot of times you can have issues. And so
40:25 in Olama, you can limit the number of models that are allowed to be in your
40:29 memory at a single time. With this one, typically you want to set this to either
40:33 one or two. Definitely set this to just one if you are using large language
40:37 models that are basically fit for your GPU. like it's going to fit exactly into
40:41 your VRAM and you're not going to have room for another large language model.
40:44 But if you are running more smaller ones and maybe you could actually fit two on
40:48 your GPU with the VRAM that you have, you can set this to two. So again, more
40:52 technical overall, but it's very important to have these right. And we'll
40:55 get into the local AI package where I already have these set up in the
40:59 configuration. And then by the way, this is the Olama FAQ that I referenced a
41:02 minute ago that I'll have linked in the description. And so there's actually a
41:06 lot of good things to read into here. um like being able to verify that your GPU
41:10 is compatible with Olama. How can you tell if the model's actually loaded on
41:13 your GPU? So, a lot of like sanity check things that they walk you through in the
41:17 FAQ as well. Also talking about environment variables, which I just
41:20 covered. And so, they've got some instructions here depending on your OS
41:23 how to get those set up. So, if there's anything that's confusing to you, this
41:26 is a very good resource to start with. So, I'm trying to make it possible for
41:30 you to look into things further if there's anything that doesn't quite make
41:33 sense for what I explained here. And of course, always let me know in the
41:35 comments if you have any questions on this stuff as well, especially the more
41:39 technical stuff that I just got to cover because it's so important even though I
41:43 know we really want to dive into the meat of things, which we are actually
41:47 going to do now. All right, here is everything that we have covered at this
41:50 point. And congratulations if you have made it this far because I covered all
41:55 the tricky stuff with quantization and the hardware requirements and offloading
41:59 and some of our little configuration and parameters. So, if you got all of that,
42:03 the rest of it is going to be a walk in the park as we start to dive into code,
42:07 getting all of our local AI set up and building out some agents. You understand
42:10 the foundation now that we're going to build on top of to make some cool stuff.
42:15 And so, now the next thing that we're going to do is talk about how we can use
42:19 local AI anywhere. We're going to dive into OpenAI compatibility and I'll show
42:23 you an example. We can take something that is using OpenAI right now,
42:27 transform it into something that is using OAMA and local LLM. So, we'll
42:31 actually dive into some code here. And I've got my fair share of no code stuff
42:35 in this master class as well, but I want to focus on both because I think it's
42:38 really important to use both code and no code whenever applicable. And that
42:42 applies to local AI just like building agents in general. So, I've already
42:45 promised a couple of times that I would dive into OpenAI API compatibility, what
42:50 it is, and why it's so important. And we're going to dive into this now so you
42:54 can really start to see how you can take existing agents and transform them into
42:59 being 100% local with local large language models without really having to
43:03 touch the code or your workflow at all. It is a beautiful thing because OpenAI
43:10 has created a standard for exposing large language models through an API.
43:14 It's called the chat completions API. It's kind of like how model context
43:19 protocol MCP is a standard for connecting agents to tools. The chat
43:23 completions API is a standard for exposing large language models over an
43:28 API. So you have this common endpoint along with a few other ones that all of
43:35 these providers implement. This is the way to access the large language model
43:39 to get a response based on some conversation history that you pass in.
43:43 So, Olama is implementing this as of February. We have other providers like
43:49 Gemini is OpenAI compatible. Uh, Grock is Open Router, which we saw earlier.
43:53 Almost every single provider is OpenAI API compatible. And so, not only is it
43:57 very easy to swap between large language models within a specific provider, it's
44:02 also very easy to swap between providers entirely. You can go from Gemini to
44:09 OpenAI or OpenAI to O Lama or OpenAI to Grock just with changing basically one
44:13 piece of configuration pointing to a different base URL as it is called. So
44:18 you can access that provider and then the actual API endpoint that you hit
44:22 once you are connected to that specific provider is always the exact same and
44:26 the response that you get back is also always the exact same. And so Olama has
44:31 this implemented now. And I'll link to this article in the description as well
44:33 if you want to read through this because they have a really neat Python example.
44:37 It shows where we create an OpenAI client and the only thing we have to do
44:42 to connect to Olama instead of OpenAI is change this base URL. So now we are
44:47 pointing to Olama that is hosted locally instead of pointing to the URL for
44:51 OpenAI. So we'd reach out to them over the internet and talk to their LLMs. And
44:55 then with Olama, you don't actually need an API key because everything's running
44:58 locally. So you just need some placeholder value here. But there is no
45:02 authentication that is going on. You can set that up. I'm not going to dive into
45:05 that right now. But by default, because it's all just running locally, you don't
45:09 even need an API key to connect to Olama. And then once we have our OpenAI
45:13 client set up that is actually talking to Olama, not OpenAI, we can use it in
45:18 exactly the same way. But now we can specify a model that we have downloaded
45:22 locally already through Lama. We pass in our conversation history in the same way
45:27 and we access the response like the content the AI produced the token usage
45:31 like all those things that we get back from the response in the same way.
45:34 They've got a JavaScript example as well. They have a couple of examples
45:38 using different frameworks like the Versell AI SDK and Autogen. Really any
45:44 AI agent framework can work with OpenAI API compatibility to make it very easy
45:47 to swap between these different providers. like Pyantic AI, my favorite
45:52 AI agent framework, also supports OpenAI API compatibility. So you can easily
45:57 within your Pantic AI agents swap between these different providers. And
46:02 so what I have for you now is two code bases that I want to cover. The first
46:07 one is the local AI package, which we'll dive into in a little bit. But right
46:13 now, we have all of the agents that we are going to be creating in this master
46:17 class. So I have a couple for N8N that are also available in this repository.
46:21 And then a couple of scripts that I want to share with you as well. And so the
46:25 very first thing that I want to show you is this simple script that I have called
46:31 OpenAI compatible demo. And so you can download this repository. I'll have this
46:34 linked in the description as well. There's instructions for downloading and
46:38 setting up everything in here. And this is all 100% local AI. And so with that,
46:43 I'm going to go over into my windsurf here where I have this OpenAI compatible
46:47 demo set up. So I've got a comment at the top reminding us what the OpenAI API
46:52 compatibility looks like. We set our base URL to point to Olama hosted
46:59 locally and it's hosted on port 11434 by default. So I can actually show you
47:02 this. I have Ola running in a Docker container, which we're going to dive
47:05 into this when we set up the local AI package, but you can see that it is
47:11 being exposed on port 11434. And by the way, you can see the
47:14 127.0.0.1 in that URL that I have highlighted here, that is synonymous with localhost.
47:21 And so this right here, you could also replace with 127.0.0.1.
47:25 Just a little tidbit there. It's not super important. I just typically leave
47:28 it as localhost. And then you can change the port as well. I'm just sticking to
47:32 what the default is. And then again, we don't need to set our API key. We can
47:36 just set it to any value that we want here. We just need some placeholder even
47:39 though there is no authentication with a llama for real unless you configure
47:43 that. So that's OpenAI compatibility. And the important thing with this script
47:47 here is I have two different configurations here. I have one for
47:51 talking to OpenAI and then one for OALMA. So with OpenAI, we set our base
47:57 URL to point to api.openai.com. We have our OpenAI API key set in our
48:01 environment variables. So you can just set all your environment variables here
48:05 and then rename this to env. I've got instructions for that in the readme of
48:08 course. And then going back to the script, we are using GPT4.1 nano for our
48:13 large language model. There's something super fast and cheap. And then for our
48:17 Lama configuration, we are setting the base URL here, localhost1434
48:22 or just whatever we have set in our environment variables. Same thing for
48:26 the API key. And then same thing for our large language model. And what I'm going
48:31 to be using in this case is Quen 314B. That is one of the large language models
48:34 that I showed you within the Olama website. Definitely a smaller one
48:38 compared to what I could run, but I just want to run something fast. And very
48:41 small large language models are great for simple tasks like summarization or
48:45 just basic chat. And that's what I'm going to be using here just for a simple
48:49 demo. And so whether it's enabled or not, this configuration is just based on
48:53 what we have set for our environment variables. And the important thing here
48:59 is the code that runs for each of these configurations just as we go through
49:03 this demo is exactly the same. We are parameterizing the configuration for the
49:08 base URL and API key. So we are setting up the exact same OpenAI client just
49:13 like we saw in the Olama article but just changing the base URL and API key.
49:17 And so then for example when we use it right here it's client.hat.comp
49:22 completions.create create calling the exact same function no matter if we're
49:26 using OpenAI or Olama. And then we're handling the response in the same way as
49:31 well. And so I'll go back to my terminal now. And so I went through all the steps
49:34 already to set up my virtual environment, install all of my dependencies. And so now I can run the
49:40 command OpenAI compatible demo. And now it's going to present the two
49:43 configuration options for me. And so I can run through OpenAI. So we'll go
49:46 ahead and do that first. And these two demos are going to look exactly the
49:50 same, but that is the point. And so we have our base URL here for OpenAI. We
49:54 have a basic example of a completion with GPT4.1 Nano. There we go. So this
49:59 is the model that was used. Here are the number of tokens. And this is our
50:03 response. And then I can press enter to see a streaming response now as well. So
50:07 we saw it type out our answer in real time. And then I can press enter one
50:10 more time. This is the last part of the demo. Just say multi-turn conversation.
50:14 So we got a couple of messages here in our conversation history. So very nice
50:19 and simple. The point here is to now show you that I can run this and select
50:23 Olama now instead and everything is going to look exactly the same and all
50:27 of the code is the same as wallet. It is only our configuration that is
50:31 different. And so it will take a little bit when you first run this because
50:36 Olama has to load the large language model into your GPU. And so going to the
50:42 logs for Olama, I can show you what this looks like here. And so when we first
50:47 make a request when Quen 314B is not loaded into our GPU yet, you're going to
50:52 see a lot of logs come in here and we'll and you'll have this container up and
50:54 running when you have the local AI package which we'll cover in a little
50:57 bit. So it shows all the metadata about our model like it's Quen 314b. Uh we can
51:05 see here that uh we have a Q4 KM quantization like we saw in the Olama
51:09 website. Uh what other information do we have here? There's just so much to to
51:14 digest here. Um, yeah, another really important thing is we have the uh
51:19 context link. I have that set to 8,192 just like I recommended in the
51:22 environment variables. And then we can see that we offloaded all of the layers
51:26 to the GPU. So I don't have to do any offloading to the CPU or the RAM. I can
51:30 keep everything in the GPU, which is certainly ideal, like I said, to make
51:34 sure this is actually fast. And then when we get a response from quen 314b,
51:41 we are calling the v1/hatcompletions endpoint because it is openi API
51:46 compatible. So that exact endpoint that we hit for openai is the one that we are
51:50 hitting here with a large language model that is running entirely on our computer
51:54 in Olama. And so the response I get back, it's actually a reasoning LLM as
51:58 well. So we even have the thinking tokens here, which is super cool. And so
52:02 we got our response. It's just printing out the first part of it here just to
52:04 keep it short. And then I can press enter. And we can see a streaming demo
52:08 as well. And it's going to be a lot faster this time because we do already
52:11 have the model loaded into our GPU. And so that first request when it first has
52:15 to load a model is always the slower one. And then it's faster going forward
52:19 once that model is already loaded in our GPU. And then as long as we don't swap
52:24 to another large language model and use that one, then it will remain in our GPU
52:28 for some time. And so then all of our responses after are faster. And then we
52:33 just have the last part of our demo here with a multi-turn conversation. So we
52:37 can see conversation history in action as well, just not with streaming here.
52:40 Um, and and everything's a bit slower with this large language model because
52:43 it is a reasoning one. And so you can certainly if you want faster uh
52:47 inference, you can always use a non-reasoning local LLM like Mistl or
52:52 Gemma for example. So that is our very simple demo showing how this works. I
52:55 hope that you can see with this and again this works with other AI agent
52:59 frameworks like eggno or pideantic AI or crew AI as well like they all work in
53:03 this way where you can use openAI API compatibility to swap between providers
53:08 so easily so you don't have to recreate things to use local AI and that's
53:11 something so important that I want to communicate with you because if I'm the
53:15 one introducing you to local AI I also want to show you how it can very easily
53:19 fit into your existing systems and automations. All right. Now, we have
53:23 gotten to the part of the local AI master class that I'm actually the most
53:27 excited for because over the past months, I have very much been pouring my
53:31 heart and soul into building up something to make it infinitely easier
53:35 for you to get everything up and running for local AI. And that is the local AI
53:40 package. And so, right now, we're going to walk through installing it step by
53:44 step. I don't want you to miss anything here because it's so important to get
53:47 this up and running, get it all working well. Because if you have the local AI
53:51 package running on your machine and everything is working, you don't need
53:55 anything else to start building AI agents running 100% offline and
53:59 completely private. And so here's the thing. At this point, we've been
54:04 focusing mostly on Olama and running our local large language models. But there's
54:08 the whole other component to local AI that I introduced at the start of the
54:13 master class for our infrastructure. things like our database and local and
54:18 private web search, our user interface, agent monitoring. We have all these
54:23 other open-source platforms that we also want to run along with our large
54:27 language models and the local AI package is the solution to bring all of that
54:32 together curated for you to install in just a few steps. So, here is the GitHub
54:37 repository for the local AI package. I'll have this linked in the description
54:41 below. Just to be very clear, there are two GitHub repos for this master class.
54:45 We have this one that we covered earlier. This has our N8N and Python
54:49 agents that we'll cover in a bit, as well as the OpenAI compatible demo that
54:53 we saw earlier. So, you want to have this cloned and the local AI package as
54:57 well. Very easy to get both up and running. And if you scroll down in the
55:02 local AI package, I have very comprehensive instructions for setting
55:06 up everything, including how to deploy it to a private server in the cloud,
55:10 which we'll get into at the end of this master class, and a troubleshooting
55:13 section at the bottom. So, everything that I'm about to walk you through here,
55:17 there's instructions in the readme as well if you just want to circle back to
55:21 clarify anything. Also, I dive into all of the platforms that are included in
55:26 the local AI package. And this is very important because like I said, when you
55:30 want to build a 100% offline and private AI agent, it's a lot more than just the
55:35 large language model. You have all of the accompanying infrastructure like
55:39 your database and your UI. And so I have all that included. First of all, I have
55:44 N8N that is our low/noodeode workflow automation platform. We'll be building
55:48 an agent with N8N in the local AI package in a little bit once we have it
55:52 set up. We have Superbase for our open- source database. We have Olama. Of
55:56 course, we want to have this in the package as well for our LLMs. Open Web
56:01 UI, which gives us a chat GPT like interface for us to talk to our LLMs and
56:06 have things like conversation history. Very, very nice. So, we're looking at
56:09 this right here. This is included in the package. Then we have Flowwise. It's
56:13 similar to N8N. It's another really good tool to build AI agents with no slash
56:18 low code. Quadrant, which is an open- source vector database. Neo4j which is a
56:25 knowledge graph engine and then seir xng for open-source completely free and
56:31 private web search caddy which this is going to be very important for us once
56:35 we deploy the local AI package to the cloud and we actually want to have
56:38 domains for our different services like nn and open web UI and then the last
56:42 thing is langfuse this is an open- source LLM engineering platform it helps
56:47 us with agent observability now some of these services are outside of the scope
56:52 for this local AI master class. I don't want to spend a half hour on every
56:56 single one of these services and make this a 10-hour video. I will be focusing
57:02 in this video on N8N, Superbase, Olama, Open WebUI, CRXNG, and then Caddy once
57:08 we deploy everything to the cloud. So, I do cover like half of these services.
57:12 And the other thing that I want to touch on here is that there are quite a few
57:16 things included here. And so you do need about 8 GB of RAM on your machine or
57:21 your cloud server to run everything. It is pretty big overall. And so you can
57:26 remove certain things like if you don't want Quadrant and Langfuse for example,
57:30 you can take those out of the package. More on that later. It doesn't have to
57:34 be super bloated, you can whittle this down to what you need. But yeah, there's
57:37 a lot of different things that go into building AI agents. And so I have all of
57:40 these services here so that no matter what you need, I've got you covered. And
57:44 so with that, we can now move on to installing the local AI package. And
57:48 these instructions will work for you on any operating system, any computer. Even
57:52 if you don't have a really good GPU to run local large language models, you
57:56 still could always use OpenAI or Anthropic, something like that, and then
58:00 run everything else locally to save on costs or just to have everything running
58:04 on your computer. And so there are a couple of prerequisites that you have to
58:08 have before you can do the instructions below. You need Python so you can run
58:12 the start script that boots everything up. Git or GitHub desktop so you can
58:16 clone this GitHub repository, bring it all onto your own machine. And then you
58:21 want Docker or Docker Desktop. And so I've got links for all of these. Docker
58:25 and Docker Desktop we need because all of these local AI services that I've
58:29 curated for you, they all run as individual Docker containers that are
58:34 all combined together in a stack. And so I'll actually show you this is the end
58:36 result once we have everything up and running within your docker desktop. You
58:41 have this local AI docker compos stack that has all of the services running in
58:45 tandem like superbase and reddus and nitn and flowwise caddy neo4j. All of
58:50 these are running within this stack. That is what we're working towards right
58:54 now. And so make sure you have all these things installed. I've got links that'll
58:57 take you to installing no matter your operating system. Very easy to get all
59:01 of this up and running on your machine. Then we can move on to our first command
59:05 here, which is to clone this GitHub repository, bringing all of this code on
59:10 your machine so you can get everything running. And so you want to open up a
59:14 new terminal. So I've got a new PowerShell session open here. Going to
59:18 paste in this command. And I'm going to be doing this completely from scratch
59:22 with you. So you clone the repo and then I'm just going to change my directory
59:26 into local AI package, which was just created from this get clone command. So
59:31 those are the first two steps. The next thing is we have to configure all of our
59:36 environment variables. And believe it or not, this is actually the longest part
59:41 of the process. And once we have this taken care of, it's a breeze getting the
59:44 rest of this up and running. But there's a lot of configuration that we have to
59:49 set up for our different services like credentials for logging into our
59:54 Superbase dashboard or Neo4j. Uh things like our Superbase um anonymous key and
59:59 private key. All these things we have to configure. And so within our terminal
60:04 here, you can do code dot to open this within VS code or windsurf. Open this in
60:09 windsurf. You just want to open up this folder within your IDE and the specific
60:14 IDE that you use. Really doesn't matter. You just want to get to this.env.example
60:20 here. I'm going to copy it and then I'm going to paste it. And then I'm going to
60:24 rename this toenv. So we're taking the example. example, turning it into av file. So,
60:31 you want to make sure that you copy it and rename it like this. Then we can go
60:36 ahead and start setting all of our configuration. And I'll even zoom in on
60:39 this just so that it's very easy for you to see everything that we are setting up
60:44 here. So, first up, we have a couple of credentials for N8N. We have our
60:49 encryption key and our JWT secret. And it's very easy to generate these. In
60:53 fact, we'll be doing this a couple of times, but we'll use this open SSL
60:58 command to generate a random 32 character alpha numeric string that
61:02 we're going to use for things like our encryption key and JWT secret. And so,
61:08 OpenSSL is a command that is available for you by default on Linux and Macs.
61:12 You can just open up any terminal and run this command and it'll spit out a
61:16 long string that you can then just paste in for this value. For Windows, you
61:20 can't just open up any terminal and use OpenSSL, but you can use Git Bash, which
61:26 is going to come with GitHub Desktop when you install it. And so, I'll go
61:29 ahead and just search for that. If you just go to your search bar on your
61:32 bottom left on Windows and search for Git Bash, it's going to open up this
61:37 terminal like this. And so, I can go ahead and copy this command, go in here,
61:42 and paste it in. And then I can run it. And then, boom, there we go. This is I
61:45 know it's really small for you to see right now. I'm going to go ahead and
61:48 copy this because this is now the value that I can use for my encryption key.
61:52 And then you want to do the exact same thing to generate a JWT secret. And then
61:57 the other way that you can do this if you don't want to install git bash or
62:01 it's not working for whatever reason, you can use Python to generate this as
62:05 well. So I can just copy this command and then I can go into the terminal here
62:10 and I can just paste this in. And so it's going to just like with OpenSSL
62:15 generate this random 32 character string that I can copy and then use for my JWT
62:20 secret. There we go. And so I am going to get in the weeds a little bit here
62:24 with each of these different parameters, but I really want to make sure that I'm
62:27 clear on how to set up everything for you so you can really walk through this
62:31 step by step with me. And like I said, setting up the environment variables is
62:35 the longest part by far for getting the local AI package set up. So if you bear
62:39 with me on this, you get through this configuration, you will have everything
62:43 running that you need for local AI for the LLMs and your infrastructure. So
62:47 that's everything for N8N. Now we have some secrets for Superbase. And there
62:53 are some instructions in the Superbase documentation for how to get some of
62:57 these values. So it's this link right here, which I have open up on my
63:01 browser. So we'll we'll reference this in a little bit here. But first, we can
63:05 set up a couple of other things. The first thing we need to define is our
63:11 Postgress password. So, Supphabase uses Postgress under the hood for the
63:14 database. And so, we want to set a password here that we'll use to connect
63:19 to Postgress within N8N or a connection string that we have for our Python code,
63:23 whatever that might be. And this value can be really anything that you want.
63:26 Just note that you have to be very careful at using special characters like
63:31 percent symbols. So if you ever have any issues with Postgress, it's probably
63:36 because you have special characters that are throwing it off. U that's something
63:39 that I've seen happen quite a few times. And so like I said, I want to mention
63:42 troubleshooting steps and things to make sure that it is very clear for you. So
63:47 for this Postgress password here, I'm just going to say test Postgress pass.
63:51 I'm just going to give some kind of random value here. Just end with a
63:54 couple of numbers. I don't care that I'm exposing this information to you because
63:58 this is a local AI package. These passwords are for services that never
64:02 leave my computer. So, it's not like you could hack me by connecting to anything
64:08 here. And then we have a JWT secret. And this is where we get into this link
64:13 right here in the Superbase docs. And so they walk you through generating a JWT
64:18 secret and then using that to create both your anonymous and your service
64:22 role keys. If you're familiar with Superbase at all, we need both of these
64:27 pieces of information. The anonymous key is what we share to our front end. This
64:30 is our public key. And then the service role key has all permissions for
64:33 Superbase. We'll use this in our backends for things like our agents. And
64:39 so you can just go ahead and copy. You can go ahead and copy this JWT secret.
64:43 And then you can paste this in right here. This is 32 characters long just
64:47 like the things that we generated with OpenSSL. I'm just going to be using
64:52 exactly what Superbase tells me to. And then what you can do with this is you
64:56 can select the anonymous key. Click on generate JWT and then I can copy this
65:02 value and then I will paste this for my anonymous token. And so I'm just
65:06 replacing the default value there for the anonymous key. And then going back
65:10 and selecting the service key, I'm going to generate that one as well. So it
65:13 looks very similar. They'll always start with ey, but these values are different
65:18 if you go towards the end. And so I'll go ahead and paste this for my service
65:22 ro key. Boom. There we go. All right. And then for the Superbase dashboard
65:27 that we'll log into to see our tables and our SQL editor and authentication
65:31 and everything like that, we have our username here, which I'm just going to
65:35 keep as superbase. And then for the password, I can just say test superbase
65:39 pass. I'll just kind of use that as my common nomenclature here for my
65:42 passwords cuz I don't really care what that is right now. And then the last
65:45 thing that we have to set up is our pooler tenant ID. And it's not really
65:49 important to dive into what exactly this means. Just know that you can set this
65:52 to really anything that you want. Like I typically will just choose four digits
65:57 here like 1,00 for my pooler tenant ID. So that is everything that we need for
66:01 superbase. And actually most of the configuration is for superbase. Then we
66:06 have Neo4j. This is really simple. You can leave Neo4j for the username and
66:11 then I'll just say test Neo4j pass for my password here. So you just set the
66:15 password for knowledger graph and even if you're not using Neo4j you still have
66:19 to set this but yeah it just takes two seconds. Then we have langfuse. This is
66:23 for agent observability. We have a few secrets that we need here. And for these
66:28 values they can really just be whatever you want. It doesn't matter because
66:31 these are just passwords just like we had passwords for things like Neoforj.
66:35 So I can just say test click house pass. Um and then I can do test mo pass. And
66:43 um I mean it really doesn't matter here. Random Langfuse salt. I'm just doing
66:47 completely whack values here. You probably want something more secure in
66:51 this case, but um I'm just doing something as a placeholder for now. Um
66:56 yeah, that there we go. Okay, good. And then then the last thing that we need
66:59 for Langfuse is an encryption key. And this is also generated with OpenSSL like
67:04 we did for the N8N credentials. And so I'll go back to my git bash terminal.
67:08 And again, you can do this with Python as well. I'll just run the exact same
67:12 command. I'll get a different value this time. And so I'll go ahead and copy
67:16 that. You could technically use the same value over and over if you wanted to,
67:20 but obviously it's way more secure to use a different value for each of the
67:24 encryption keys that you generate with OpenSSL. So there we go. That is our
67:28 encryption key. And that is actually everything that we have to set up for
67:32 our environment variables when we are just running the local AI package on our
67:37 computer. Once we deploy it to the cloud and we actually want domains for our
67:41 different services like open web UI and N8N then we'll have to set up caddy. So
67:45 this is where we'll dive into domains and we'll get into this at the end of
67:49 the master class here. But everything past this point for environment
67:54 variables is completely optional. You can leave all of this exactly as it is
67:59 and everything will work. Most of this is just extra configuration for
68:03 superbase. So, Superbase is definitely the biggest service that's included in
68:08 this list of, you know, curated services for you. And so, there's a lot of
68:11 different configuration things you can play around with if you want to dive
68:15 more into this. You can definitely look at the same documentation page that we
68:19 were using for the Superbase Secrets. And so, you can scroll through this if
68:22 you want to learn more um like setting up email authentication or Google
68:27 authentication. um diving more into all of those different configuration things
68:32 for Superbase if you want to dive more into that. I'm not going to get into all
68:36 of this right now because the core of getting Superbase up and running we
68:40 already have taken care of with the credentials that we set up at the top um
68:44 right here. And so that these are these are just the base things and so that's
68:47 what we'll stick to right now. So that is everything for our environment
68:51 variables. So then going back to our readme now which I have open directly in
68:55 windsurf now instead of my browser we have finished our configuration and I do
68:59 have a note here that you want to set things up for caddy if you're deploying
69:03 to production. Obviously we're doing that later not right now like I said and
69:07 so with that we are good to start everything. Now before we spin up the
69:12 entire local AI package there is one thing that I want to cover. It's
69:14 important to cover this before we run things. If you don't want to run
69:19 everything in the package cuz it is a lot like maybe you only want to use half
69:22 of these services and you don't want Neo4j and Langfuse and Flowwise right
69:28 now. There are two options that you have. The easiest one right now is to go
69:33 into the docker compose file. This is the main file where all of the services
69:38 are curated together and you can just remove the services that you don't want
69:42 to include. So, for example, if you don't want Quadrant right now, cuz it is
69:46 actually one of the larger services. It's like 600 u megabytes of RAM just
69:50 having this running, you can search for Quadrant, and you can just go ahead and
69:54 delete this service from the stack like that. Boom. Now I don't have Quadrant.
69:59 It won't spin up as a part of the stack anymore. And then also I have a volume
70:03 for Quadrant. So, you can remove that as well. Volumes, by the way, is how we are
70:08 able to persist data for these containers. So if we tear down
70:12 everything and then we spin it back up, we still are going to have our open web
70:17 UI conversations and our N8N workflows, everything in Superbase, like all that
70:21 is still going to be saved because we're storing it all in volumes. So we can do
70:25 whatever the heck we want with these containers. We can tear them down. We
70:28 can update them, which I'll show you how to do later. We can spin it back up. And
70:32 all of our data will always be persisted. So you don't have to worry
70:35 about losing information. And you can always back things up if you want to be
70:39 really secure, but I've never done that before and I've been updating this
70:42 package for months and months and months and all of my workflows from 6 months
70:46 ago are still there. I haven't lost anything. And so that's just a quick
70:50 caveat there for how you can remove services if you want. And then another
70:54 thing that we don't have available yet, but I'm very excited to, you know, kind
70:58 of talk about this right now. It's in beta right now. We are creating me and
71:03 one other guy uh that's actually on my Dynamist team. Um Thomas, he's got a
71:07 YouTube channel as well. He's a great guy. We're working together on this.
71:09 He's actually been putting in most of the work creating a front-end
71:13 application for us to manage our local AI package. And one of the big things
71:17 with this is that we're going to make it possible for you to toggle on and off
71:22 the services that you want to have within your local AI package. So you can
71:27 very much customize the package to the services that you want to run. So you
71:31 can keep it lightweight just to the things you care about. Also, we'll be
71:34 able to manage environment variables and monitor the containers. Not all of this
71:38 is up and running at this point, but this is in beta. We're working on it.
71:41 I'm really excited for this. So, not available yet, but at once this is
71:44 available, this will be a really good way for you to customize the package to
71:47 your needs. So, you don't have to go and edit the docker compose file directly.
71:51 So, that's something that I just wanted to get out of the way now. But, we can
71:57 start and actually execute our package now. Get all these containers up and
72:02 running. So the command that you run to start the local AI package is different
72:07 depending on your operating system and the hardware that you have. So for
72:13 example, if you are an Nvidia GPU user, you want to run this start services.py
72:18 script. This boots up all of the containers and you want to specifically
72:23 pass in the profile of GPU NVIDIA. This is going to start Ola in a way where the
72:29 Olama container is able to leverage your GPU automatically. And then if you are
72:34 using an AMD GPU and you're on Linux, then you can run it this way. Which by
72:38 the way, unfortunately, if you have an AMD GPU on Windows, you aren't able to
72:46 run O Lama in a container. And it's the same thing with Mac computers.
72:49 Unfortunately, like you see right here, you cannot expose your GPU to the Docker
72:55 instance. And so if you are an AMD GPU on Windows or running on Mac, you cannot
73:01 run Olama in the local AI package. You just have to install it on your own
73:04 machine like I already showed you in this master class and then you'll just
73:08 run everything else through the local AI package and they can actually go out to
73:12 your machine and communicate to Olama directly. So just a small limitation for
73:17 Mac and AMD on Windows. But if you're running on Linux or an Nvidia GPU on
73:21 Windows like I'm using, then you can go ahead and run this command right here.
73:27 So if you can't run a GPU in the Olama container, then you can always just
73:32 start in CPU mode or you can run with a profile of none. This will actually make
73:36 it so that Olama never starts in the local AI package. So you can just
73:40 leverage the Olama that you have already running on your computer like I showed
73:43 you how to install already. So, just a couple of small caveats that I really
73:46 want to hit on there. I need to make sure that you're using the right
73:51 command. And so, in my case, I'm Nvidia on Windows. So, I'm going to copy this
73:55 command. Go back over into my terminal. I'll just clear it here. So, we have a
73:59 blank slate. And I'll paste in this command. And so, it's going to do quite
74:02 a few things initially. First, it's going to clone the Superbase repository
74:07 because Superbase actually manages the stack in a separate place. And so, we
74:10 have to pull that in. Then there's some configuration for CRXNG for our uh local
74:17 and private web search. And then I have a couple of warnings here saying that
74:20 the Flowwise username and password are not set, which by the way for that if
74:24 you want to set the Flow Wise username and password, it's optional, but you can
74:29 do that if I scroll down right here. So you can set these values, those will
74:32 actually make those warnings go away, but you can also ignore them, too. So
74:35 anyway, I just wanted to mention that really quickly. But now what's happening
74:39 here is it starts by running all of the Superbase containers. And so there's
74:44 quite a bit that goes into Superbase, like I said. So we're running all of
74:47 that. It's getting all that spun up. And then once we run all of these, it's
74:51 going to move on to deploying the rest of our stack. And if you're running this
74:55 for the very first time, it will take a while to download all of these images.
74:59 They're not super small. There's a lot of infrastructure that we're starting up
75:03 here. And so it'll take a bit. You just have to be patient. maybe go grab your
75:06 coffee or make your next meal, whatever that is. And then everything will be up
75:09 and running once you are back. And so yeah, now you can see that we are
75:13 running the rest of the containers here. Um, and so we'll just wait for that to
75:16 be done. And then I'll show you what that looks like in Docker Desktop as
75:19 well. And so I'll give it a second here just to finish. Uh, looks like my
75:24 terminal glitched a little bit. Like I was scrolling and so it kind of broke it
75:27 a bit. But anyway, everything is up and running now. It'll look like this where
75:30 it'll say all of the containers are healthy or running or started. And then
75:34 if I go into Docker Desktop and I expand the local AI compost stack, you want to
75:39 make sure that you have a green dot for everything except for the Olama pull and
75:45 N8N import. These just run once initially and then they go down because
75:49 they're responsible for pulling some things for our local AI package. And so
75:53 yeah, I've got green dots for everything except for two right here. Now I'm
75:57 leaving this in here intentionally actually because there is a bug with
76:02 Superbase specifically if you are on Windows. So you'll see this issue where
76:08 the Superbase pooler is constantly restarting and that also affects N8N
76:12 because N8N relies on the Superbase pooler. So it's constantly restarting as
76:17 well. If you see this problem, I actually talk about this in the
76:21 troubleshooting section of the readme. If you scroll all the way down, if the
76:24 Superbase pooler is restarting, you can check out this GitHub issue. And so I
76:29 linked to this right here, and he tells you exactly which file you want to
76:33 change. It's this one right here. So it's docker volumes poolerpooler.exs.
76:39 And you need to change the file to end in lf. And so I'll show you what I mean
76:43 by that. I'll show you exactly how to do this. It's like a super tiny random
76:47 thing, but this has tripped up so many people. So I want to include this
76:51 explicitly in the master class here. So you want to go within the superbase
76:56 folder within docker volumes and then it's within pooler and then we have
77:00 pooler.exs and basically no matter your IDE you can see the crlf in the bottom right here.
77:08 You want to click on this and then change it to lf and then make sure that
77:13 you save this file. Very easy to fix that. And then what you can do is you
77:19 can run the exact same command to spin everything up again. And so I'm going to
77:22 do this now. It's going to go through all the same steps. It'll be faster this
77:25 time because you already have everything pulled. And this, by the way, is how you
77:28 can just restart everything really quickly if you want to enforce new
77:31 environment variables or anything like that. So I want to include that
77:35 explicitly um for that reason as well. And I'll go ahead and close out of this.
77:39 And and while this is all restarting, the other thing that I want to show you
77:42 in the readme is I also have instructions for upgrading the containers in the local AI package. So
77:49 when N8N has an update or Superbase has an update, it is your responsibility
77:53 because you're managing the infrastructure to update things yourself. And so you very simply just
77:58 have to run these three commands to update everything. You want to tear down
78:03 all of the containers and make sure you specify your profile like GPU Nvidia and
78:09 then you want to pull all of the latest containers and again specifying your
78:13 profile. And then once you do those two things, you'll have the most up-to-date
78:17 versions of the containers downloaded. So you can go ahead and run the start
78:21 services with your profile just like we just did to restart things. Very easy to
78:25 update everything. And even though we are completely tearing down our
78:29 containers here before we upgrade them, we aren't losing any information because
78:33 we are persisting things in the volumes that we have set up at the top of our
78:37 Docker Compose stack. And so this is where we store all of our data in our
78:41 database and and workflows. All these things are persisted. So we don't have
78:45 to worry about losing them. Very easy to upgrade things and you still get to keep
78:48 everything. You don't have to make backups and things like that unless you
78:54 just want to be ultra ultra safe. So now we can go back to our Docker desktop and
79:00 we've got green dots for everything now since we fixed that pooler.exs issue.
79:04 The only thing that we don't have green dots for is the N8N import and then we
79:08 have our Olama pull as well because like I said those are the two things that
79:11 just have to run at the beginning and then they aren't ongoing processes like
79:16 the rest of our services. So, we have everything up and running. And if there
79:22 is anything that is a white dot besides Olama pull or n import or if there's
79:27 anything that is constantly restarting, just feel free to post a comment and
79:31 I'll definitely be sure to help you out. And then also check out the
79:34 troubleshooting section as well. One thing that I'll mention really quick is
79:37 sometimes your N8N will constantly restart and it'll say something like the
79:42 N8N encryption key doesn't match what you have in the config. And the big
79:46 thing to keep in mind for that is you want to make sure that you set this
79:51 value for the encryption key before you ever run it for the first time.
79:53 Otherwise, it's going to generate some random default value and then if you
79:56 change this later, it won't match with what it expects. And so, yeah, my big
79:59 recommendation is like make sure you have everything set up in your
80:03 environment variables before you ever run the start services for the first
80:08 time. This should be run once you have your environment variables set up.
80:11 Otherwise, you risk any of these services creating default values that
80:14 then wouldn't match with the keys and things that you set up later. And so
80:18 with that, we can now go into our browser and actually explore all of
80:22 these local AI services that we have running on our computer now. Now over in
80:26 our browser, we can start visiting the different services that we have spun up.
80:30 Like here is N8N. You just have to go to localhost port 5678. It'll have you
80:35 create a local account when you first visit it. And then you'll have this
80:38 workflow view that should look very familiar to you. if you have used NAND
80:41 in the past. And then we have open web UI localhost port 8080. This is our chat
80:47 GPT like interface where we can directly talk to all of the models that we have
80:52 pulled in our Olama container. Really, really neat. And then we have local host
80:57 port 8000 for our Superbase dashboard. The signin definitely isn't pretty
81:00 compared to the managed version of Superbase. But once you enter in your
81:04 username and password that you have set for the environment variables for the
81:07 dashboard, then you have the very typical view where we have our tables
81:11 and we've got our SQL editor. Everything that you're familiar with with
81:14 Superbase. And that's the key thing with all these different services. They all
81:18 will look the exact same for you pretty much. Um like another one for example,
81:25 if I go to localhost um port 3000, we have languages. This is for agent
81:28 observability and monitoring. And this is something I'm not going to dive into
81:31 in this master class. Like I said, I'm not covering all the services. But yeah,
81:34 I just want to show that like every single one of these pretty much you can
81:39 access in your browser. And by the way, the way that we know the specific port
81:43 to access for each of these services is by taking a look at either what it tells
81:48 us in Docker Desktop. So like we can see that Neo4j is um let's see, we have port
81:53 7474. For uh CR XNG, it's port 8081. For Flowwise, it's port 3001. What's one
82:01 that we've seen already? Um, let me Yeah, like Open Web UI is port 8080. So,
82:06 the port on the left is the one that we access in our browser. And then the port
82:11 on the right is what's mapped on the container. So, when we visit port 8080
82:16 on our computer, that goes into port 8080 on the container. And that's what
82:21 we have exposed. The other way that you can see the port that you need to use is
82:24 just by taking a look at this docker compose file. And you don't need to have
82:29 like a super good understanding of this docker compose file. But if you want to
82:33 customize your stack or even help me by making contributions to local AI
82:36 package, this is the main place to make changes. And so for example, I can go
82:41 down to flowwise and I can see that the port is 3001. Or if I go down to let's say N8N, we can
82:49 see that the port is 5678. And so the port is always going to be
82:52 there somewhere in the service that you have set up. Like for the Langfuse
82:56 worker, it's 3030. That's more of a behind-the-scenes kind of service. But
82:59 let me just find one more example for you here. Um yeah, like Reddus for
83:04 example is 6379. So you can see the ports in the Docker Compose as well. I
83:07 just want to call it out just to at least get you a little bit comfortable
83:11 and familiar with the Docker Compose file in case you want to customize
83:14 things. But the main thing is just leveraging what you see here in Docker
83:18 Desktop. Last thing in Docker Desktop really quickly, if you want to bring
83:21 more local large language models into the mix, you can do it without having to
83:25 restart anything. You just have to find the Olama container in the Docker
83:29 Compose stack. Head on over to the exec tab. And now here we can run any
83:33 commands that we'd want. We're directly within the container here. And we can
83:36 use Lama commands just like we did earlier on our host machine. And so for
83:40 example, I ran Lama list already. So I can see the large language models that
83:43 have already been pulled in my Olama container. If I want to pull more, I can
83:48 just do Olama pull and then find that ID for the model I want to use on the Olama
83:52 website. And like I said, you don't have to restart anything. If I pull it here,
83:56 it's now in the container and I can immediately start using it in Open Web
84:00 UI or N8N. We'll see that in a little bit. And so that's just really important
84:03 because a lot of times you're going to want to start to use different large
84:06 language models and you don't want to have to restart anything. The ones that
84:11 are brought into the machine by default is it's determined by this line right
84:15 here. So if you want to change the ones that are pulled by default, I just have
84:20 Quinn 2.57B instructs like a really small lightweight one that I have
84:23 brought into your Lama container by default. Uh if you want to add in
84:27 different ones, you can just update this line right here to include multiple
84:32 Olama pulls. And so that way you can bring in Quen 3 or Mistral 3.1 small,
84:36 whatever you want. This is just the one I have by default. And then all the
84:40 other ones that you saw in my list here, I've pulled myself. All right. Now that
84:45 we have the local AI package up and running, it is time to build some
84:50 agents. Now, we get to use our local AI package to actually build out an
84:53 application. And so, I'm going to start by introducing you to Open Web UI, and
84:58 we'll use it to talk to our Olama LLM. So, we have an application kind of right
85:02 out of the box for us. Then I'll dive into building a local AI agent with N8N,
85:08 even connecting it to Open Web UI. So we have this custom agent that we built in
85:12 N8N and then we immediately have a really nice UI to chat with it. And then
85:16 we'll transition to Python building the exact same agent in Python as well. Like
85:21 I said, I want to focus on both no code and code to really make this a complete
85:24 master class so that whether you want to build with N8N or Python, you can see
85:28 how to connect to our different services that we have running locally like
85:32 Superbase and CRXNG and Open Web UI. So, we'll cover all of that and then I'll
85:36 get into deployments after this. But yeah, let's go ahead right now focus on
85:40 open web UI and building out some agents. So, back over in Open Web UI,
85:45 remember this is localhost port 8080. You want to set up your connection to
85:49 Olama so we can start talking with our local LLMs in this nice interface. And
85:54 so bottom left, go to the admin panel, then go to settings and then the
85:58 connections tab. Here we can set up our connections both to OpenAI with our API
86:02 key, which we're not going to do right now, but then also the Olama API. This
86:07 is what we want to set up. Now, usually by default, this value is just
86:12 localhost. And this is actually wrong. This is something that is so important
86:16 to understand. And this will apply when we set up credentials in N8N and Python
86:20 as well. When you are within a container, localhost means that you are
86:26 referencing still within the container. Open web UI needs to reach out to the
86:32 Olama container, not itself. So localhost is not correct here. This is
86:36 generally the default just because open web UI assumes that you're running on
86:39 your machine and so then you would also have Lama running on your machine. So
86:42 local host usually works when you're outside of containers. But here we have
86:47 to change this. This is super important to get right. And so there are two
86:51 options we have. If you are running on a Mac or AMD on Windows and you want to
86:56 use Lama running on your machine not within a container, then you want to do
87:00 host.doccker.in. This is the way in docker to tell the container to look outside to the host
87:07 machine where you're running the containers and you're running
87:11 separately. Very important to know that. And then if you are running Olama in the
87:16 container like I am doing. I have Ola running in my Docker desktop. You want
87:21 to change this to Olama, you're specifically calling out the service
87:26 that is running the Olama container in your Docker Compose stack. And the way
87:30 that we know that this is the name specifically is because we just go back
87:35 to our allimportant Docker Compose file. Olama. So whenever there's an X and a
87:40 dash, you just ignore that. It's just the thing after it. So, ola is the name
87:45 of our service running the container. And then if we wanted to connect to
87:49 something else like flow-wise, flow-wise is the name of the service. Open WebUI,
87:55 it's open- web UI. All of these tople keywords, these are the names when we
87:59 want our containers to be talking to each other. And all of this is possible
88:03 because they are within the same Docker network. And so I'll just show you that
88:06 so you know what I'm talking about here. If I go back to Docker Desktop, we have
88:11 this local AI compos stack. All of these containers can now communicate
88:14 internally with each other by referencing the names like Reddus or
88:19 CRXNG. So, we'll be seeing that a lot when we're building out our agents as
88:22 well. So, I wanted to spend a couple minutes to focus on that. And so, you
88:25 can go ahead and click on save in the very bottom right. I know my face is
88:28 covering this right now, but you have a save button here. Make sure you actually
88:32 do that. Um, and for this API key, I don't know why it's asking me to fill it
88:34 out. I don't really care about connecting to open AI. So I'll just put
88:37 some random value there and click save. And then boom, there we go. We are good.
88:40 And then a lot of times with open web UI, it also helps to refresh otherwise
88:43 it doesn't load the models for some reason. So I just did a refresh of the
88:49 site here. Control F5. And then now we can select all of the local LLMs that we
88:53 have pulled in our Olama container. And so for example, I can do Quinn 2.57B.
88:57 That's the one that I just have by default. I can say hello. And it's going
89:02 to take a little bit cuz it has to load this model onto my GPU just like we saw
89:06 with quen 3 earlier. But then in a second here we'll get a response. And
89:10 there are actually multiple calls that are being done here. We have one to get
89:15 our response, one to get a title for our conversation on the lefth hand side. And
89:19 then also if you click on the three dots here, you can see that it created a
89:23 couple of tags for this conversation. So couple of things that are fired off all
89:26 at once there. And I can test conversation history. What did I just
89:30 say? So yeah, I mean everything's working really well here. We have chat
89:34 history, conversation history on the lefth hand side. There's so much that we
89:37 get out of the box. And so I wanted to show you this really quickly. Now we can
89:41 move on to building an agent in N8N. And I'll even show you how to connect it to
89:46 Open Web UI as well through this N8N agent connector. Really exciting stuff.
89:49 So let's get right into it. So I'm going to start really simple here by building
89:53 a basic agent. The main thing that I want to focus on is just connecting to
89:57 our different local AI services. So I am going to assume that you have a basic
90:01 knowledge of N8N here because this is not an N8N master class. And so I'm
90:04 starting with a chat trigger so we can talk to our agent directly in the UI.
90:08 We'll connect this to open web UI in a bit as well. And then I want to connect
90:14 an AI agent node. And so what we want to do is connect for the chat model and
90:18 then local superbase for our conversation history, our agent memory.
90:22 And so for the chat model I'm going to do lama chat model. I'm going to create
90:26 brand new credentials. You can see me do this from scratch. The URL that you want
90:32 for the base URL is exactly the same as what we just entered into open web UI.
90:35 And so if you are running Olama on your host machine like an AMD on Windows or
90:40 you are running on a Mac or you just don't want to run the Olama container,
90:46 then it is host.doccker.in. And then if you are referencing the
90:49 Olama container, we just reference Olama. That's the name of the service
90:53 running the Olama container in our stack. And then the port is 11434 by
90:57 default. And you can test this connection. So it'll do a quick ping to
91:01 the container to make sure that we are good to go. And I'll even show you what
91:04 that looks like. So right here in my Olama container, I have the logs up. And
91:09 the last two requests were just a simple get request to the root endpoint. We
91:13 have two of those right here. And if I click on retry and I go back to the
91:18 logs, boom, we are at three now. So it made three requests. So it's just making
91:22 that simple ping each time to make sure the container is available. And so I'm
91:26 going to go ahead and click on save and then close out. So now we have our
91:29 credentials and then we can automatically select the model that we
91:33 have loaded now in our container. And so just to keep things really lightweight,
91:36 I'm going to go with the 7 billion parameter model right now from Quen 2.5.
91:40 Cool. All right. So that is everything that we need to connect Olama. It is
91:44 that easy. And then we could even test it right now. So, I'm going to go ahead
91:47 and save this workflow. And I'm going to just say hello. And uh we don't need the
91:52 conversation history or tools or anything at this point. We're already
91:55 getting a response here from the LLM. It's working on loading the model into
91:59 my GPU as we speak. And so there we go. We got our answer looking really good.
92:04 Cool. So now we can add memory as well. So I'm going to add Postgress because
92:08 remember Superbase uses Postgress under the hood. And then I'm going to create
92:12 brand new credentials here. And this is actually probably the hardest one to set
92:16 up out of all of the credentials for connecting to our local AI service. And
92:20 so I'm going to show you what the Docker Compose file looks like just that it's
92:24 clear how I'm getting these different values. And so I'll point out all of
92:28 them. So the first one for our host it is DB because this is the name of the
92:35 specific Superbase service that we have that is the underlying Postgress
92:38 database. And I can show you how I got that really quick. If you go to the
92:42 superbase folder that we pull when we run that start services script, I go to
92:47 docker and then docker compose. If I search for db and there's quite a few
92:53 dependencies on db here. So let me find the actual reference to it. Where is db?
92:58 Here we go. So yeah, it's really short. Uh db is the name of our service that
93:03 actually is the superbase DB. So this is the container name that this is what
93:07 you'll see in docker desktop. But then this is the underlying service that we
93:10 want to reference when we have our containers communicating with each
93:14 other. Like in this case we have our N8N container talking to our superbase
93:18 database container. And then the database and username are both going to
93:22 be Postgress. Those are the values that we have by default. If you scroll down a
93:26 bit in thev you can see these right here. The Postgress database is
93:29 Postgress and the user is also Postgress. And you can customize these
93:33 things but these are some of the optional parameters that I didn't touch
93:36 in the setup with you. And so you can just leave those as is. Now the
93:41 Postgress password, this is one of them that we set. That was the first
93:44 superbase value that we set there. Make sure you have that from what you have in
93:48 thev. And then everything else you can kind of leave as the defaults here. The port is
93:53 going to be 5432. So that is everything for setting up our connection to
93:57 Postgress. You can test this connection as well. And then we can move on to
94:01 adding in some tools and things like that as well. But yeah, this is like the
94:06 very first basic version of the agent that I wanted to show you. And hopefully
94:10 with this you can see how no matter the service that you have running in the
94:13 local AI package. It's very easy to figure out how to connect to it both
94:17 with the help of N8N because N8N always makes it really easy to connect to
94:20 things. Then also just knowing that like you just have to reference that service
94:25 name that we have for the container in the Docker Compose stack. That's how we
94:28 can talk to it. So you could add in quadrant or you could add in language.
94:31 Like you can connect anything that you want into our agent here. And so now we
94:36 have conversation history. Next up, I want to show you how to build a bit more
94:40 of a complicated agent with N8N using some tools. And then also I'm going to
94:44 show you how to connect it to Open Web UI. And so right now this is a live
94:47 demo. Instead of connecting to one of the Olama LLMs, I'm going straight to
94:53 N8N. I have this custom N8N agent connector. And so we are talking to this
94:57 agent that I'll show you how to build in a little bit. This one has a tool to use
95:02 CRXNG for local and private web search. This is one of the platforms that we
95:06 have included in the local AI package. And so this response is going to take a
95:10 little bit here because it has to search the web. And the response that it
95:13 generates with this question is pretty long. Like there we go. Okay. So we got
95:16 the answer. It's pretty long. But yeah, we are able to search the internet now
95:21 with a local agent. N8N connected to open web UI. We're getting pretty fancy
95:25 here. And we also have the title that was generated on the left. And then we
95:30 have the tags here as well. And so the way that this all works, I'm going to
95:33 start by explaining how we can connect N8N to Open Web UI. And this is just
95:37 crucial. Makes it so easy for us to test agents locally as we are developing
95:42 them. And so if you go to the settings and the admin panel in the bottom left
95:47 and go to functions, open web UI has this thing called functions which gives
95:51 us the ability to add in custom functionality kind of as like custom
95:56 models that we can then use like you saw with the N8N agent connector. And so
96:02 what I have here is this thing that I call the N8N pipe. And I'll have a link
96:05 to this in the description as well. I created this myself and I uploaded it to
96:10 the open web UI directory of functions. And so you can go to this link right
96:14 here. You can even just Google the N8N pipe for open web UI. And then you click
96:19 on this get button. It'll just have you enter in the URL for your open web UI.
96:23 So I can just like paste in this right here. Click on import to open web UI and
96:28 it'll automatically redirect you to your open web UI instance. So you'll have
96:33 this function now. And we don't have to dive into the code for all how how all
96:36 of this works. I worked pretty hard to create this for you. Uh actually quite a
96:40 while ago I made this. And the thing that we need to care about is
96:44 configuring this to talk to our N8N agent. And so if you click on the
96:49 valves, the setting icon in the top right, there are a few values that we
96:54 have to set. And so now I'm going to go over to showing you how to build things
96:57 in N8N. Then all of this will click and it'll make sense. I right now looking at
97:00 these values, you're probably like, how the heck do I get all of these? But
97:02 don't worry, we'll dive into all of that. But first, let's go into our N8N
97:07 agent. I'll explain how all of this works. So, first of all, we have our
97:12 chat trigger that gives us the ability to communicate with our agent very
97:16 easily in the workflow. We have a new trigger now for the web hook. And so,
97:22 this is turning our agent into an API endpoint. So, we're able to talk to it
97:27 with other services like open web UI. And so to configure the web hook here,
97:31 you want to make sure that it is a post request type. And then you can define a
97:35 custom path here. Whatever you set here is going to determine what our URL is.
97:40 So we have our test URL. And then also if you toggle the workflow to active,
97:44 this is really important. The workflow in N does have to be active. Then you
97:49 have access to this production URL. And this is actually the first value that we
97:54 need to set within the valves for this open web UI function. We have our N8N
97:59 URL. And because this is a container talking to another container, we don't
98:03 actually want to use this localhost value that it has here for us. We want
98:08 to specify N8N because N8N again is the name of the service running the N8N
98:13 container in our Docker Compose stack. So N8N port 5678. And then this is the
98:18 custom URL that we can determine based on this. And then the other thing that
98:23 we want to do is set up header authentication. We don't want to expose
98:27 this endpoint without any kind of security. And so we want to set up some
98:31 authentication. And so you can select header off from the authentication
98:34 dropdown. And then for the credentials here, I'll just create brand new ones to
98:38 show you what this looks like. The name needs to be authorization with a capital
98:43 A. This has to be very specific. The name in the top left and the name of
98:46 your credentials. This can be whatever you want, but this has to be
98:51 authorization. And then the value here, the way that we want to format this is
98:55 it's going to be bearer and then the and then a space and then whatever you want
99:00 your bearer token to be. So this is what you get to define, but it needs to start
99:05 with a bearer capital B and a space. And then whatever you type after bearer
99:09 space, this goes in as the NAN bearer token. So you don't include a bearer
99:13 space here because that it's just assumed that it's going to be like that.
99:16 It's going to be prefixed with that. So you just type in like test off is what I
99:21 have. So my bearer token is bearer test off like that. And then this is what I
99:24 enter in for this field. Now I already have mine set up. So I'm just going to
99:27 go ahead and close out of this. And then the last thing that we have to set up
99:30 for the web hook. And don't worry, this is the node that we spend the most time
99:33 with. You want to go to the drop down here and change this to respond using
99:38 the respond to web hook node. very important because then at the end of our
99:40 workflow and we get the response from our agent, we're going to send that back
99:45 to whatever requested our API which is going to be open web UI in this case.
99:48 And so that's everything for our configuration for the web hook. Now the
99:53 next thing that we have to do is we have to determine is open web UI sending in a
99:58 request to get a response for our main agent or is it just looking to generate
100:03 that conversation title or the tags for our conversation? Because like we were
100:06 looking at earlier, I'm going to close out of this for now and go back to a
100:10 conversation, our last conversation here. We get our main response, but then
100:15 also there is a request to an LLM to create a very simple title for our
100:19 conversation and the tags that we can see in the top right. And so our N8
100:24 workflow actually gets invoked three separate times for just the first
100:30 message in a new conversation. And so we need to determine, are we getting a main
100:34 response? Like should we go to our main agent or should we just go to a simple
100:39 LLM that I have set up here to help generate the tags or title? And so the
100:44 way that we can determine that is whenever Open Web UI is requesting
100:47 something like a title for a conversation, it always prefixes the
100:53 prompt with three pound symbols, a space, and then the word task. And so we
100:58 can key off of this. If the prompt starts with this, and that prompt just
101:02 is coming in from our web hook here. If it does start with it, then we're just
101:06 going to go to this simple LLM, we're just going to be using Quen 2.514b
101:11 instruct. We have no tools, no memory or anything like our main agent because
101:14 we're just very simply going to generate that title or the tags. And I can even
101:19 show you in the execution history what that looks like. So in this case, we
101:22 have our web hook that comes in. The chat input starts with the triple pound
101:28 and task. And so sure enough, we are deeming it to be a metadata request is
101:32 what I'm calling it. And so then it then goes down to this LLM that is just
101:36 generating some text here. We just have this JSON response with the tags for the
101:41 conversation, technology, hardware, and gaming. So we're asking about the price
101:45 of the 5090 GPU. And then we do the exact same thing to also generate the
101:51 title GPU specs. And so exactly what we see here is the title of this last
101:55 conversation. So I hope that makes sense. And then if it doesn't start with
101:59 task and the triple pound and so it's actually our request. Then we go to our
102:03 main agent. We don't want our main agent to have to handle those super simple
102:06 tasks. You can also just use a really tiny LLM. Like this would be the perfect
102:11 case to actually use a super tiny LLM like um even like DeepSec R1 1.5B. You
102:15 could because it's just such a simple task. Otherwise though we are going to
102:20 go to our main agent. And so I'm not going to dive into like all these nodes
102:24 in a ton of detail, but basically we are are expecting the chat input to contain
102:30 the prompt for our agent. And the way that we know to expect chat input
102:35 specifically is because going back to the settings for the function here with
102:38 the valves, we are saying right here chat input. So you want to make sure
102:43 that the value that you put in here for input matches exactly with what you are
102:48 expecting from our web hook. And so chat input is the one that I have by default.
102:51 So you can just copy me if you want. Then we go into our agent where we're
102:55 hooked into Olama and we've got our local superbase. I already showed you
102:58 how to connect up all this and that looks exactly the same. The only thing
103:01 that is different now is we have a single tool to search the web with
103:07 CRXNG. So it's a web search tool. I have a description here just telling it what
103:11 is going to get back from using this tool. And then for the workflow ID, this
103:16 is if I go to add a node here and I just go for uh workflow tools, call N8N
103:23 workflow tool. So this is basically taking an N8N workflow and using it as a
103:28 tool for our agent. So this is the node that we have right here. But then I'm
103:32 referencing the ID of this N8N workflow. So this ID because I'm going to just
103:37 call the subworkflow that I have defined below. And again, I don't want to dive
103:40 into all the details of NAN right now and how this all works, but the agent is
103:44 going to decide the query. What should I search the web with? It decides that and
103:49 then it invokes this sub workflow here where we have our call to CR XNG. So the
103:54 name of the container service in our docker and compost stack is just CR XNG
103:59 and it runs on port 8080. And then if you look at the CXNG documentation, you
104:03 can look at how to invoke their API and things like this. So I'm just doing a
104:07 simple search here and then there are a few different nodes because what I want
104:10 to do is I want to split out the search and actually I can show you this by
104:14 going to an execution history where we're actually using this tool. So take
104:18 a look at this. So in this case the LLM decided to invoke this tool and the
104:23 query that it decided is current price of the 5090 GPU. So this is going along
104:28 with the conversation that we had last in open web UI. we get some results from
104:33 CRXNG, which is just going to be a bunch of different websites. And so, we don't
104:37 have the answer quite yet. We just have a bunch of resources that can help us
104:41 get there. And so, I'm going to split out. So, we have a bunch of different
104:45 websites. We're going to now limit to just one. I just want to pull one
104:48 website right now just to keep it really, really simple because now we're
104:52 going to actually visit that website. I'm going to make an HTTP request to
104:57 this website, which yeah, I mean, if it's literally an Nvidia official site
105:01 for the 5090, like this definitely has the information that we need. We're
105:04 going to make a request to it, and then we're also going to use this HTML node
105:08 to make sure that we are only selecting the body of the site. So, we take out
105:12 all the footers and headers and all that junk. So, we just have the key
105:15 information. And then that is what we aggregate and then return back to our AI
105:19 agent. So it now has the content, the core content of this website to get us
105:24 that answer. That is how we invoke our web search tool. And then at the very
105:29 end, we're just going to set this output field. And that's going to be the
105:33 response that we got back either from like generating a title or calling our
105:37 main agent. And this is really important. the output field specifically
105:41 whatever we call it here we have to make sure that that is corresponding to this
105:46 value as the last thing we have to set for the settings for our open web UI
105:50 function. So output here has to match with output here because that is what
105:55 we're going to return in this respond to web hook. Whatever open web UI gets back
105:59 it's getting back from what we return right here. So that is everything for
106:03 our agent. I could probably dive in quite a bit more into explaining how
106:06 this all works and building out a lot more complex agents, which I definitely
106:10 do with local AI in the Dynamis AI agent mastery course. So check that out if you
106:13 are interested. I just wanted to give you a simple example here showing how we
106:17 can talk to our different services like Olama, Superbase, and Seir XNG. And then
106:22 also open web UI as well. So once you have all these settings set, make sure
106:26 of course that you click on save. It's very, very important. These two things
106:29 at the bottom don't really matter, by the way. But yeah, click on save once
106:32 you have all of the settings there. And then you can go ahead and have a
106:36 conversation with your agent just like I did when I was demoing things before we
106:40 dove into the workflow. And by the way, this NAN agent that works with Open Web
106:44 UI, I have as a template for you. You can go ahead and download that in this
106:48 GitHub repository where I'm storing all the agents for this masterass. So we
106:52 have the JSON for it right here. You can go ahead and download this file. Go into
106:57 your N8N instance. Click on the three dots in the top right once you've
107:01 created a new workflow. Import from file and then you can bring in that JSON
107:03 workflow. You'll just have to set up all your own credentials for things like
107:07 Lama and Superbase and CRXNG. But then you'll be good to go and you can just go
107:11 through the same process that I did setting up the function in open web UI
107:15 and it'll be with like 15 minutes you'll have everything up and running to talk
107:20 to N8N in open web UI. Next up I want to create now the Python version of our
107:25 local AI agent. And so this is going to be a onetoone translation. Exactly what
107:30 we built here in NN, we are now going to do in Python. So I can show you how to
107:34 work with both noode and code with our local AI package. And so this GitHub
107:39 repo that has the N workflow we were just looking at and that OpenAI
107:43 compatible demo we saw earlier, this has pretty much everything for the agent. So
107:46 most of this repository is for this agent that we're about to dive into now
107:50 with Python. And in this readme here, I have very detailed instructions for
107:54 setting up everything. And a lot of what we do with the Python agent, especially
107:57 when we are configuring our environment variables, it's going to look very
108:01 similar to a lot of those values that we set in N8N. Like we have our base URL
108:05 here, which you'd want to set to something, you know, like HTTP lama port
108:09 11434. We just need to add this /view one, which I guess is a little bit different,
108:14 but yeah, I've got instructions here for setting up all of our environment
108:18 variables, our API key, which you can actually use OpenAI or Open Router as
108:22 well with this agent, taking advantage of the OpenAI API compatibility. This is
108:27 a live example of this because you can change the base URL, API key, and the
108:31 LLM choice to something from Open Router or OpenAI, and then you're good to go
108:35 immediately. It's really, really easy. We will be using Olama in this case, of
108:39 course, though. And then you want to set your superb basease URL and service key.
108:42 You can get that from your environment variables. Same thing with CRXNG with
108:47 that base URL. We'll set that just like we did in N8N. We have our bearer token
108:51 like in our case was test off. It's just whatever comes after the bearer and the
108:55 space. And then the OpenAI API key you can ignore. That's just for the
108:58 compatible demo that we saw earlier. This is everything that we need for our
109:02 main agent now. And so we're using a Python library called fast API to turn
109:09 our AI agent into an API endpoint just like we did in N8N. And so fast API is
109:13 kind of what gives us this web hook both with the entry point and the exit for
109:17 our agent and then everything in between is going to be the logic where we are
109:21 using our agent. And I'm going to be using paidantic AI. It's my favorite AI
109:25 agent framework with Python right now. Makes it really easy to set up agents
109:29 and we'll so we'll dive into that here. And I don't want to get into the
109:32 nitty-gritty of the Python code here because this isn't a master class on
109:36 specifically building agents. I really just want to show you how we can be
109:40 connecting to our local AI services. This agent is 100% offline. Like I could
109:46 cut the internet to my machine and still use everything here. So we create our
109:50 Superbase client and the instance of our fast API endpoint. I have some models
109:55 here that define the the requests coming in, the response going out. So we have
109:59 the chat input and the session ID just like we saw in N8N. And then the output
110:04 is going to be this output field. And so that corresponds to exactly what we're
110:07 expecting with those settings that we set up in the function in open web UI.
110:11 So this Python agent is also going to work directly with open web UI. And then
110:16 we have some dependencies for our Pantic AI agent because it needs to have an
110:21 HTTP client and the CRXNG base URL to make those requests for the web search
110:25 tool. And then we're setting up our model here. It's an OpenAI model, but we
110:29 can override the base URL and API key to communicate with Olama or Open Router as
110:34 well like we will be doing. And then we create our Pantic AI agent just getting
110:38 that model based on our environment variables. I've got a very simple system
110:43 prompt and then the dependencies here because we need that HTTP client to talk
110:48 to CR XNG. And then I'm just allowing it to retry twice. So if there's any kind
110:51 of error that comes up, the agent can re retry automatically, which is one of the
110:55 really awesome things that we have in Pyantic AI. And then I'm also creating a
111:01 second agent here. This is the agent that is going to be responsible like we
111:05 have in NADM for handling the metadata for open web UI like conversation titles
111:10 and tags for our conversation. And so it's an entirely separate Asian because
111:14 we just have a another system prompt. In this case, I'm just doing something
111:18 really simple here. Uh we don't have any dependencies for this agent because it's
111:21 not going to be using the web search tool. And then for the model, I'm just
111:25 using the exact same model that we have for our primary agent. But like I shared
111:30 with N8N, you could make it so that this is like a much smaller model, like a one
111:33 or three billion parameter model because the task is just so basic or maybe like
111:38 a 7 billion parameter model. So you can tweak that if you want. Just for
111:41 simplicity sake, I'm using the same LLM for both of these agents.
111:46 And then we get to our web search tool. So in Pantic AI, the way that you give a
111:51 tool to your agent is you do at@ and then the name of your agent and then
111:55 tool and then the function that you define below this is now going to be
112:00 given as a tool to the agent. And then this description that we have in the doc
112:04 string here that is given as a part of the prompt to your agent. So it knows
112:09 when and how to use this tool. And so the exact, you know, details of how
112:13 we're using CRX and G here, I won't dive into, but it is the exact same as what
112:17 we did in N8N where we make that request to the search endpoint of CRX andG. We
112:22 go through the page results here. We limit to just the top three results or I
112:26 could even change this to make it even simpler and just the top result. So we
112:30 have the smallest prompt possible to the LLM. And then we get the the content of
112:34 that page specifically. And then we return that to our AI agent with some
112:40 JSON here. So now once it invokes this tool, it has a full page back with
112:44 information to help answer the question from the user. It has that web search
112:47 complete now. And then we have some security here to make sure that the
112:51 bearer token matches what we get into our API endpoint. So that's that header
112:55 authentication that we set up in N8N. So this part right here where we're
112:59 verifying the header authentication that corresponds to this verify token
113:03 function. And then we have a function to fetch conversation history to store a
113:08 new message in conversation history. So both of these are just making requests
113:12 to our locally hosted superbase using that superbase client that we created
113:16 above. And then we have the definition for our actual API endpoint. And so in
113:23 N8N we were using invoke N8N agent for our path to our agent. So this was our
113:29 production URL. In this fast API endpoint, our endpoint is slashinvoke
113:34 python agent. And then we're specifically expecting the chat input
113:39 and session ID. So that is our um chat request request type right here. And
113:43 then sorry I highlighted the wrong thing. We have our response model here
113:46 that has the output field. So we're defining the exact types for the inputs
113:50 and the outputs for this API endpoint. And then we're also using this verify
113:55 token to protect our endpoint at the start. And then the key thing here, if
113:59 the chat input starts with that task, then we're going to call our metadata
114:02 agent. And so it's just going to spit out the title or the tags, whatever that
114:06 might be. Otherwise, we're going to fetch the conversation history, format
114:11 that for Pyantic AI, store the user's message so that we have that
114:15 conversation history stored, create our dependencies so that we can communicate
114:20 with CRXNG, and then we'll just do agent.run. We'll pass in the latest
114:24 message from the user, the past conversation history and the
114:27 dependencies that we created. So it can use those when it invokes the web search
114:31 tool and then we just get the response back from the agent and we'll you can
114:34 print that out in the terminal as well and then we'll just store it in
114:38 superbase and then return the output field. So I'm going kind of fast here. I
114:42 there definitely a lot more videos on my channel where I break down in more
114:45 detail building out agents with podantic AI and turning them into API endpoints
114:49 and things like that. Um but yeah, just going a little bit faster here. here.
114:52 And then the last thing is with any kind of exception that we encounter, we're
114:55 just going to return a response to the front end saying that there was an issue
114:59 and then specifying what that is. And then we are using Ubicorn to host our
115:06 API endpoint specifically on port 8055. So that is everything for our Python
115:11 agent exactly the same as what we set up in N8N. And now going to the readme
115:16 here, I'll open up the preview. The way that we can run this agent, we just have
115:20 to open up a terminal just like we did with the OpenAI compatible demo. I've
115:25 got instructions here for um setting up the database table, which this is using
115:30 the same table as the one in N8N and N8N creates it automatically. So, if you
115:34 have already been using the N8N agent, you don't actually have to run this SQL
115:38 here. Um, and then you want to obviously set up your environment variables like
115:42 we covered. Uh, open your virtual environment and install the requirements
115:46 there. And then you can go ahead and run the command python main.py.
115:52 And so this will start the API endpoint. So it'll just be hanging here because
115:56 now it's waiting for requests to come in on a port 8055. And so what I can do is
116:03 I can go back to open web UI. I can go to the admin panel functions. Go to the
116:08 settings. I can now change this URL. So everything else is the same. I have my
116:11 bearer token, the input field, and the output field the same as N8N. The only
116:16 thing I have to change now is my URL. And so I know this is an N8N pipe and I
116:20 have N8N in the name everywhere, but this does work with just any API
116:23 endpoint that we have created with this format here. And so I'm going to say for
116:27 my URL, it's actually going to be host.doccker.in because I have my API endpoint for
116:33 Python running outside on my host machine. So I need my open web UI
116:38 container to go outside to my host machine. And then specifically the port
116:43 is going to be 8055. And then the endpoint here, I'm going to
116:46 delete this web hook here because it's invoke-python- agent. Take a look at that. All right.
116:52 Boom. So I'm going to go ahead and save this. And then I can go over to my chat.
116:57 And it says n agent connector here still. But this is actually talking to
117:00 my Python agent now. So I'll go ahead and start by asking it the exact same
117:05 question that I asked the N8N agent. And I do have this pipe set up to always say
117:08 that it's calling nan, but this is indeed calling our Python API endpoint.
117:12 And we can see that now. So there we go. We got all the requests coming in, the
117:16 response from the agent, and then also the metadata for the title and the tags
117:20 for the conversation. Take a look at that. So we got our title here. We have
117:24 our tags, and then we have our answer. It's a starting price of $2,000, which
117:28 it's a lot more right now. The starting point, the starting price is kind of
117:32 misleading, but like yeah, this is a good answer. and it did use CRXNG to do
117:36 that web search for us. This is really, really neat. Now, the last thing that we
117:40 want to do for our Python agent before we can work on deploying things to the
117:45 cloud, to a private server in the cloud, is we want to containerize it. Now, the
117:49 reason that we want to do this, and this is the Docker file that I have set up to
117:53 turn our Python agent into a container, just like our local AI services, is if
117:59 we have our agent containerized, then we can have it communicate within the
118:03 Docker network just like we have our different local AI services
118:07 communicating with each other. Because right now running directly with Python
118:12 to communicate with Olama, for example, we need our URL to be localhost, not
118:18 Olama. Remember, you can only use the specific name of the container service
118:23 when you are within the docker compose stack. And so we'd have to actually say
118:28 localhost right now. But if we add the container for the agent into the stack
118:32 as well, then we can communicate directly within the private network.
118:36 Like I can say lama and then for sir xng I could use this URL instead. Right now
118:41 we have to actually use localhost port 8081. And so it's really nice for
118:46 security reasons and just to make your deployment um in a nice package to have
118:51 the agents that you're running in the same network as your infrastructure. And
118:55 so that's what we're going to do right now. And so within the read me that I
118:59 have for instructions on setting up everything. I have the instructions that
119:02 we follow to run things with Python. I also have the instructions to run it
119:07 with Docker. And so all you want to do is run this single command. It's
119:11 actually very easy because I have this Docker file set up to turn our agent
119:15 into a container and I've got security and everything taken care of. We're
119:19 running it on port 8055 just like we did with Python. And then I have this very
119:24 simple Docker Compose file. It's just a single service that we're going to tack
119:28 on to all of the other services that we already have running for the local AI
119:32 package. And I'm calling this one the Python local AI agent. And so we're
119:36 using all of our environment variables from our ENV just like we did with the
119:40 local AI package. And then what I have at the top here is I am including the
119:45 docker compose file for the local AI package. So that just kind of solidifies
119:49 the connection there. Otherwise, you'll get this kind of weird error that says
119:52 there are orphaned containers when you run this, even though they aren't
119:55 actually. And so this is optional. You'll just get an orphan container
119:58 warning that you can ignore. But if you don't want to have that warning, you can
120:01 include this right here. You just have to make sure that this path corresponds
120:06 to the path to your docker compose in the local AI package. So in my case, I
120:10 just had to go up to directories and then go into the local AI package
120:13 folder. So yeah, you this is optional, but I want to include this here just to
120:18 make things in tiptop shape for you. So yeah, this is the docker compose. And
120:21 then what we can do now is I'll go back over to my terminal and I will paste in
120:25 this command. And what this will do is it will start or restart my Python local
120:31 AI agent container. And make sure that you specify this here because if you
120:35 don't then it's going to try to rebuild the entire local AI package because we
120:38 have this include. So very important you want to just rebuild or build for the
120:43 first time this agent container. And so I'll go ahead and run this and it's
120:47 going to give me those flow-wise warnings. So I don't have my username
120:49 and password set, but remember we can ignore those. But anyway, it's going to
120:52 build the Python local AI agent container here. And there's a couple of
120:56 steps that it has to do. It has to update some internal packages and then
121:00 also install all of the pit packages we have for our Python requirements for
121:04 things like fast API and Pantic AI. So it'll take a little bit to complete,
121:07 usually just a minute or two. And so I'll go ahead and pause and come back
121:11 once this is done. And there we go. Your output should look something like this.
121:14 It goes through all the build steps and then it says that the container is
121:18 started at the bottom. And this is now in our local AI Docker Compose stack. So
121:23 going back over to Docker Desktop. It'll take a little bit to find it here
121:26 because there are so many services that we have here. But if we scroll down,
121:29 okay, there we go. At the bottom here, we have our Python local AI agent
121:34 waiting on port 8055 just like when we ran it directly with Python, but now it
121:38 is within a container that is directly within our stack. And so now, like I was
121:41 saying, this is super important, so I'm hitting on it again. Now when we set up
121:45 our environment variables for our container, we are going to be
121:49 referencing the service names of our different local AI services like circ or
121:55 lama instead of localhost. And so this whole like localhost versus
121:58 host.docer.ernal versus using the service name, that's the thing I see people get tripped up on
122:03 the most when they're configuring different things in a Docker environment
122:06 like the local AI package. That's why I'm spending a good amount of time
122:09 really hammering that in because I want you to get it right. And of course, if
122:12 you have any issues that come up with this, just let me know. I'd love to help
122:16 walk you through what exactly your configuration should look like. And so,
122:20 we have our agent now up and running in a container. And I'm not going to go and
122:23 demo this again right now because the next thing that we're going to move into
122:26 doing and then I'll give a final demo here is deploying everything to a
122:30 private server that we have in the cloud. All right. So, we have really
122:34 gotten through all of the hard stuff already. So, if you have made it this
122:38 far, congratulations. you really have what it takes to start building AI
122:43 agents with local AI now and the sky is the limit for what you can accomplish.
122:47 And so the last thing that I want to really focus on here in this master
122:50 class is taking everything that we've been building on our own computer with
122:54 our infrastructure and our agents and deploying it to a private machine in the
122:59 cloud because then we can have our entire infrastructure and agents running
123:02 24/7. We don't have to rely on having our own computer up all the time. It's
123:05 really nice to have it there because then we can also share it with other
123:10 people as well. So local AI is still considered local as long as it's running
123:14 on a machine in the cloud that you control. And so this is not just going
123:18 to open AAI or Superbase and paying for their API. This is still us running
123:22 everything ourselves on a private server. That's what I'm going to show
123:25 you how to do right now. And this process that I cover with you is going
123:29 to work no matter the cloud provider that you end up choosing. And there are
123:32 some caveats to that that I'll explain in a little bit. But yeah, you can pick
123:36 from a lot of different options. So the cloud platform that we will be deploying
123:41 to today is Digital Ocean. I use Digital Ocean a lot. It's where I deploy most of
123:46 my AI agents. So I highly recommend it. And the best part about Digital Ocean is
123:51 they have both GPU machines if you need to have a lot of power for your local
123:56 LLMs and they have very affordable CPU instances. If you want to deploy
124:00 everything for the local AI package except Olama, you can definitely go a
124:03 more hybrid route if you don't want to pay a lot because these GPU instances in
124:07 the cloud can be pretty expensive like one, two, even $5 per hour. So, what you
124:12 can do with a hybrid setup is deploy everything in the local AI package. So,
124:16 at least you have all that locally and you're not paying for those
124:19 subscriptions. But then you could still use something like OpenAI, Open Router
124:23 or Anthropic for your LLMs. So, Digital Ocean gives us the ability to do both,
124:27 and we'll dive into that when we set things up. Another really good option
124:33 for GPU instances is Tensor Do. Tensor Do is not as nice looking to me as
124:37 Digital Ocean. I generally feel like I have a better experience with Digital
124:40 Ocean, but I have deployed the local AI package to Tensor Do before on a 4090
124:46 GPU that they offer for 37 cents an hour. It's very affordable for GPU
124:49 instances. And so, this is a good platform as well. And then also if
124:55 you're okay with not running Olama on a GPU instance, like you just want a very
124:59 affordable way to host everything in a local AI package except the LLMs, then
125:03 you can use Hostinger. Hostinger is another really really good option. super
125:08 super affordable like $7 a month for a KBM2 which I'd recommend getting if you
125:14 want to deploy everything except Olama because the requirement for the local AI
125:17 package except for running the more resource intense local LLMs is you have
125:23 to have 8 GB of RAM. So don't get a cloud machine that has four or 2 GB. You
125:27 want to have 8 GB of RAM then you'll be good to go. So you can literally do it
125:31 for $7 a month through Hostinger and it's going to be something like $28 a
125:35 month through Digital Ocean unless you want a GPU instance. So I just want to
125:39 spend a couple minutes talking about different platform options. The one
125:43 thing I will say is that the local AI package runs as a bunch of Docker
125:47 containers, right? And so what you have to avoid is using a platform like
125:54 RunPod. So RunPod is a platform for running local AI. The problem is when
125:59 you pay for a GPU instance, you don't actually get the underlying machine.
126:05 You're just sshing into a container. So, you're accessing a container. And I'll
126:09 just save you the pain right now. It is so hard and basically impossible to run
126:15 Docker containers within Docker containers. So, you really can't run the
126:20 local AI package on RunPod. There are other platforms as well like Lambda Labs
126:25 is another one that I've used before. not for the local AI package for other
126:28 things but this also runs containers like you're accessing a container so you
126:34 can't do the local AI package vast.ai AI is another option, but this also is
126:39 you're renting a GPU, but it's you're accessing a container. So, you again
126:43 can't run the local AI package. And so, based on the platform that you choose,
126:47 you have to make sure that you are accessing the underlying machine when
126:51 you rent a GPU instance like Digital Ocean is the one that I will be using.
126:56 You could use a GPU instance through the Google cloud or Azure or AWS if you want
127:00 to go more enterprise. Those all give you access to the underlying machine.
127:04 It's your own private server just like we have in Digital Ocean. So you can use
127:07 that to deploy the local AI package and the agent that we built with Python. So
127:11 that's what we're going to do right now. So once you are signed into Digital
127:14 Ocean and you have your profile and billing set up and you have a project
127:18 created or you can just use the default one, now we can go ahead and create our
127:22 private server in the cloud to host the local AI package and our agent. And so
127:25 you can click on create in the top right. And there are two options here.
127:29 If you want a CPU instance, so that hybrid approach where you're hosting
127:34 everything except for the LLMs, you can select a droplet. Otherwise, what we're
127:37 going to be doing right now so I can demo the full full thing is we will
127:42 create a GPU droplet. Now, these are going to be more expensive. Like I said,
127:48 like running an H100 GPU is $3.40 an hour. It's pretty expensive. But like I
127:52 said at the start of this master class, I know so many businesses that are
127:56 willing to put tens of thousands of dollars per year into running their own
127:59 infrastructure and LLMs. And that biggest cost that contributes to that
128:03 being tens of thousands of dollars is having a GPU droplet that is running 247
128:09 in the cloud. So the hybrid approach I definitely recommend if you don't want
128:12 to pay more, you could go as low as $7 a month with Hostinger. So, there's a very
128:16 wide range of options for you depending on what you want to pay hosting the
128:21 package, LLMs, and your agents. And the other thing I will say is that if you
128:25 want to, you could just create this instance for a day and poke around with
128:29 things and then tear it down. So, you only have to pay, you know, like 20
128:31 bucks or something like that. So, there's a lot of different options for
128:35 flexibility here. And so, I'm going to pick the Toronto data center because
128:38 there's more options for GPUs available here and it's relatively close to me.
128:42 And then for the image, I'll select AI/ML ready and it's recommended because
128:47 you get the Linux bundled with all the required GPU drivers and it does run the
128:52 Ubuntu distribution of Linux. And so this process that I'm going to walk you
128:55 through for deploying local AI package to the cloud is going to work for any
129:00 Ubuntu instance that you have running on AWS or Hostinger or Tensor Do. It's just
129:05 a very standard distribution of Linux. And then for the GPU, there are a couple
129:09 of different options that we have here with Digital Ocean. H100 is an absolute
129:15 beast. 80 GB of VRAM, so it could easily run Q4 large language models, over a 100
129:21 billion parameters, even 240 GB of RAM. So, I'm not going to run this one. I'm
129:24 just kind of pointing out that this is an absolute beast. I think the one that
129:27 I'm going to choose here is going to be the RTX 6000 ADA. So, it's 48 GB of
129:34 VRAM. So, this is enough to run 70 billion parameters or smaller of LLMs at
129:40 a Q4 quantization and it comes with 64 GB of RAM and it's going to be about
129:45 $1.90 per hour. So, I'm going to select this and then I have an SSH key that is
129:50 created already. If you don't have an SSH key, then you can click on this
129:54 button to add one. And then you can follow the instructions on the right
129:57 hand side here. No matter your OS, they got instructions to help you out. You
130:00 just have to paste in your public key and then give it a name. So, I've got
130:03 mine added already. And then the only other thing that I really have to select
130:07 here is a unique name. So, I'll just say local AI package. And I'll just say GPU
130:11 because I already have the regular version just deployed on a CPU instance.
130:16 And then for my project, I will select Dynamis. I can add it along with my
130:19 other instance that I've got up and running. And there we go. So now I can
130:22 go ahead and just click on create GPU droplet. It is that easy to get our
130:27 instance ready for us to access it and start installing everything and getting
130:30 everything up and running just like we did on our computer. And so I'll go
130:34 ahead and pause and come back once our machine is created in just a few
130:37 minutes. And boom, there we go. Just after a minute, we have our GPU droplet
130:42 up and running. And so the one thing I will say is I had to request access to
130:46 create a GPU instance on the Digital Ocean platform. However, they approved
130:50 it in less than 24 hours. So, it's very easy to get that if you do want to
130:53 create a GPU instance. Otherwise, you can just create one of their normal
130:58 droplets, one of their CPU instances. Now, before we connect to this machine,
131:01 the one thing that you want to take note of is the public IPv4 address. We'll use
131:06 this to set up subdomains in a little bit. And so, this is how we get to it
131:12 for a GPU droplet. For a CPU droplet, it looks a little bit different. You'll
131:15 usually see the IPv4 somewhere at the top right here. And so, take note of
131:18 that. Save it for later. We'll be using that in a little bit. And then to
131:22 connect to our droplet, we can either do it through SSH with our IPv4 and the SSH
131:27 key that we set up when we configured this instance or we can access the web
131:31 console. For a CPU instance, usually you go to like an access tab and then you
131:35 can launch the web console. For the GPU instance, I can just click this to
131:38 launch it right here. So we have a separate window that comes up and boom,
131:42 we now have access to our instance. It is that easy to get connected. And now
131:46 we can go through the same process that we did on our computer to install the
131:51 local AI package. Now there are some different steps that we have to take and
131:55 that's why I'm including this at the end of the master class especially. Um so if
131:59 you scroll down in the read me here there are some specific instructions for
132:03 deploying to the cloud. And so you have to make sure that you have a Linux
132:07 machine preferably on the Ubuntu distribution which is that is what we
132:11 are using. And then there are a couple of extra steps. And so the first thing
132:15 that we have to set up is our firewall. We have to open up a couple of ports so
132:20 that we can access our machine from the internet. Set up our subdomains for
132:24 things like N8N and Open Web UI. And so you want to take this command UFW
132:29 enable. I'll just go ahead and paste it in. And so we are going to and you can
132:33 just type Y to continue here. It's going to disrupt SSH connections, but we don't
132:36 really care. So UFW enable. So we're enabling the firewall. And then you want
132:41 to copy this command to allow both ports 80 and 443. And so 80 is HTTP and then
132:48 443 is HTTPS. And then you can just do the last command here, UFW reload. So
132:54 now we have those ports available for us to communicate with all of our services.
132:59 And so this is the entry point to Caddy. Caddy is the service in the local AI
133:03 package that is going to allow us to set up subdomains for all of our services.
133:08 is and so any kind of communication to our droplet is going to go through caddy
133:13 and then caddy will distribute to our different services based on the port or
133:17 the subdomain that we are using. So this is called a reverse proxy. You've
133:21 probably heard of like EngineX or traffic before. Caddy is something very
133:25 much like that. And it also makes it so that we can get HTTPS so we can have
133:29 secure endpoints set up automatically and it manages that encryption for us.
133:33 It's a very beautiful platform. So, let me scroll back down to the specific
133:36 steps for deploying to the cloud. There is a quick warning here about how Docker
133:40 manages ports and things, but this is as secure as we possibly can make it. So,
133:43 trust me, we've put a lot of effort and I actually had someone from the Dynamis
133:48 community, um, Benny right here. He actually helped me a lot with security
133:53 for the local AI package. So, thank you, Benny, for helping out with that. We're
133:56 really making sure that because local AI like the whole thing is like you want to
133:59 be private and secure and so we're making sure that this package handles
134:03 all the best practices for that. So very much top of mind for us. Um and then we
134:07 can go ahead and go through the usual steps for setting up the local AI
134:11 package. The only other thing we have to do that's unique for cloud is we have to
134:16 set up a records for our DNS provider so we can have our subdomain set up for our
134:20 different services. So we'll get into that in a little bit. But first I just
134:24 want to get the local AI package up and running. And so I'm going to go ahead
134:28 and paste this command here to clone the repository. Git comes automatically
134:33 installed with our GPU droplets. And then I can change my directory into the
134:38 local AI package. And let me zoom in on this a little bit here so it's very easy
134:41 for you to see because now then what I want to do is I want to copy
134:45 the.env.ample file to a new file called. So then if I do an ls command so we can see all the
134:51 files that are available in our directory. We I guess it doesn't show um
134:56 I do ls- a there. Now we can see the env.env.example. example. And so now I can do nano.env.
135:05 This is going to give us a basically a text editor directly in the terminal
135:08 here so that we can set all of our environment variables just like we did
135:12 with the local AI package on our computer. So this time I'm not going to
135:15 go into the nitty-gritty details of setting up all these environment
135:19 variables because it is the exact same process. In fact, you can literally
135:23 reuse all of the secrets that you already set up when you hosted it on
135:27 your own computer as long as those are actually secure. Like I know a lot of
135:29 times you might just do some kind of placeholder stuff when it's just running
135:32 on your computer and then you want it more secure in the cloud. So make sure
135:36 you have real values for everything, but yeah, you can reuse a lot of the same
135:39 things. Um though for best security practice, you probably do want to make
135:42 everything different. But yeah, the one thing that I want to focus on with you
135:46 here that does change is our configuration for Caddy. Now that we are
135:50 deploying to the cloud, we want to have subdomains for our different services.
135:54 And so like NN for example, we want to set the host name for that. You want to
135:59 do the same thing for N8N open web UI superbase and then your let's encrypt
136:04 email. Obviously you want to uncomment that as well because this is the email
136:08 that you want to set for your SSL encryption. And so I'm just going to do
136:13 coal dynamis.ai. And you can just set this to whatever email that you want to use. And so
136:18 basically what you want to do here is just uncomment the line for each of the
136:23 services that you want to have subdomains created for. So, if you're
136:26 also using Flowwise, which I'm just not in this master class here, but if you
136:30 are, then you want to uncomment this line as well. If you're using Langfuse,
136:34 then you uncomment this line. I'm going to leave them commented right now just
136:38 for simplicity sake. The two that I would generally recommend not
136:43 uncommenting ever is Olama and CRXNG. We don't really want to expose them through
136:45 a subdomain because we're just going to use them as internal tooling for our
136:49 agents and applications that we have running on this server. So, we want to
136:53 keep those nice and private. But for everything that we do want to expose
136:56 that is protected with a username and password like N8N open web UI and
137:01 superbase we can uncomment those. And so we got that set up now. But we have to
137:05 obviously provide real values for them as well. And so for example I'm just
137:09 going to say nyt for YouTube dynamis and then I'll do. So you want to
137:17 define your exact URL that you want to have for this domain. Obviously, it has
137:21 to be a domain that you control because we'll go and we'll set up the records in
137:25 the DNS in a little bit. And so, I'll do the exact same thing for open web UI.
137:28 So, it's open web UI and I'm just doing YT because I already have the local AI
137:33 package hosted on my domain. And so, I can't do just open web UI because it's
137:36 already taken. So, open web UIT.dynamis.ai. And then finally, for Superbase, it'll
137:44 be superbaseyt.dynamis.ai. Boom. All right. There we go. So, go
137:47 ahead and take care of this. set up the rest of your environment variables and
137:51 then we can go ahead and move on. And the way that you exit out of this and
137:55 save your changes is you do controll X or command X on Mac, type Y and then hit
138:03 enter. So again that is controll X then type Y then press enter. That is how you
138:08 exit out. And so now if I do a cat of the env this is how you can print it out
138:11 in the terminal. So we can verify that the changes that we made like everything
138:16 for caddy is indeed made. So do that. Also change all the other environment
138:18 variables as well. I'm going to do that off camera and come back once that is
138:22 taken care of. All right. So my environment variables are all set. Now
138:26 the very last thing we have to do before we start all of our services is we need
138:32 to set up DNS. And so remember, copy your IPv4 address and then head on over
138:38 to your DNS provider. And so this process is going to look very similar no
138:42 matter the DNS provider that you have. Like I'm using NameCheep here. A lot of
138:47 people use Hostinger or Bluehost. You are able to with all these providers go
138:50 to something that is usually called like advanced DNS or manage DNS and then you
138:55 can set up custom A records here which we're going to do to set up a connection
139:01 for all of our subdomains to the IPv4 of our digital ocean droplet or whatever
139:05 cloud provider you are using. And so I'm going to go ahead and click on add new
139:09 record. It's going to be an A record for the host. It's going to be the subdomain
139:14 that I want. So, N8NYT for example. And then for the IP address, I just paste in
139:19 the IPv4 of my Digital Ocean GPU droplet. And then I'll go ahead and
139:23 click on the check here to save changes. And then I'll just go ahead and do the
139:28 same thing for open web UIT. I can't forget the YT. There we go. And
139:33 then for you, it might be more than just three, but for me, the only other one
139:36 that I have here right now is Superbase because I'm just keeping it very, very
139:41 simple. So, superbase yt and then paste in the IP again. Okay, there we go.
139:45 Boom. All right, so we have all of our records set up. And it's very important
139:49 to do this before you run things for the first time because otherwise Caddy gets
139:52 very confused. It tries to use these subdomains that you don't actually have
139:57 set up yet. And so take care of that. Then we can go back into our instance
140:01 here and run the last command. And so going back to the readme here, if I
140:06 scroll all the way down to deploying to the cloud, there is a specific parameter
140:11 that you want to add. This is very important for deploying to the cloud
140:15 because when you select the environment of public, it's going to close off a lot
140:20 more ports to make this very very secure. So any of the services that you
140:25 access from outside of the droplet, it has to go through caddy. So we use the
140:30 reverse proxy as the only entry point into any of our local AI services. This
140:34 is how we can make things as airtight as possible. We have security in mind like
140:39 I said. And so make sure that you run this command with the environment
140:43 specified. We didn't do this locally because when we are running things
140:45 locally, we don't care about security as much because it's not like our machine
140:49 is accessible to the internet like a cloud server is. And so it defaults to
140:54 the environment of private which just doesn't do as much security stuff. And
140:57 so go ahead and run this. And then of course make sure that you're using the
141:01 correct profile. So if you're using just a CPU instance with a regular droplet or
141:05 hostinger or whatever, you'd want to change this to CPU instead of GPU
141:10 Nvidia. But in our case, because we are paying the $2 an hour for a killer GPU
141:16 droplet. I can go ahead and run this command with the profile of GPU Nvidia.
141:20 Now, I left this error in here intentionally because I want to show you
141:24 what it looks like. If you get unknown shorthand flag p-p, that means that you
141:29 don't actually have docker compose installed. And this happens for some
141:32 cloud providers. And there's a very easy fix for this that I want to walk you
141:35 through. So you can even test this. Just docker compose. It'll say that compose
141:40 is not a docker command. And so going back to the readme here, I have a couple
141:44 of commands that you just have to run if this happens to you. This is at the
141:47 bottom of the deploying to the cloud section of the readme. So you can just
141:51 copy these one at a time, bring them into your droplet or your machine,
141:55 wherever you're hosting it, and go ahead and run them. And so I'm just going to
141:58 do this off camera really quickly. I'm just going to copy each of these into my
142:02 droplet. It's very easy. You can just run all of these. They're really fast as
142:06 well. So none of them are going to take very long. This is just going to get
142:10 everything ready for you so that Docker Compose is a valid command. So you can
142:14 then run the start services script. So there we go. All right. I went ahead and
142:18 ran all of those. I'll clear my terminal again and then go back to the main
142:23 command here to start our services. Boom. All right. So now we pulled
142:26 everything from Superbase, set up our CRXNG config. Now we are pulling our
142:31 Superbase containers. So again, same process is running on our computer where
142:35 it'll pull Superbase, it'll run everything for Superbase, then it'll
142:38 pull and run everything for the rest of our services. So I'll pause and come
142:42 back once this is all complete. And boom, there we go. We have all of our
142:46 services up and running. You should see green check marks across the board like
142:51 this. We are good to go. And we don't have Docker Desktop, so it's not as easy
142:55 to dive into the logs for our containers. But one quick sanity check
142:59 that you can do just in the terminal is run the command docker ps- a. This will
143:03 give you a list of all of our containers that are running here. We can make sure
143:06 that all of them are running, that we don't see any that are constantly
143:10 restarting or ones that are down. So we do have two that are exited, but these
143:14 are Nit and import and the Olama pull. These are the two that we know should be
143:17 exited. Just make sure that everything is good to go. Then we can head on over
143:22 to our browser. And because we have DNS set up already, we configure Caddy. We
143:25 can now navigate to our different services. Like I can go to
143:30 nadnyt.dynamus.ai. And boom, there we go. It's having us set up our owner account or um we can
143:39 just go to open web ui yt.dynamis.ai. Boom. And there is our open web UI. All
143:43 right. So I'll go ahead and get started. Uh, we'll have to create our account.
143:46 I'll do this off camerara, but yeah, you just create your first-time accounts for
143:49 everything. And then we'll do the same thing for let's do superbaseyt.dynamus.ai.
143:55 And boom, there we go. So, all of our services are up and running. And so now
143:59 we can log into these and create our accounts and we can interact with our
144:03 agents and bring them in. We can work with Llama in the same way. And so,
144:06 let's go ahead and do that. I'll just go ahead and create these accounts off
144:09 camera. So, I've got my accounts created for N8N and then also open WebUI. And
144:13 you can do the same for all the other accounts you might have to create for
144:16 things like Langfuse as well. And then within Open Web UI, we'll go to the
144:21 admin panel, settings, connections. Make sure that your Olama API is set
144:25 correctly to reference the service Olama. Usually this will default to
144:30 localhost or host.docer.in internal. So you can get that there. You have to set
144:34 the OpenAI API key as well, just to any kind of random value. It's just a little
144:38 bug in open web UI. Then click on save and then you can go back and your models
144:41 will be loaded. Now, now, one thing I found with Open Web UI, after you change
144:46 the Olama base URL, you have to do a full refresh of the website. Otherwise,
144:49 you'll get an error when you use the LLM. So, just a really small tangent
144:52 there, a little tidbit there, but yeah, we'll go ahead and select the model
144:55 that's pulled by default. And you can pull other ones into your Llama
144:57 container as well, like we already covered. You don't have to restart
145:00 things. So, let's run a little test. I'll just say hello, and we'll see if we
145:04 can load the model now. And boom, look at how fast that was, because we have a
145:08 killer GPU instance right now. I could run much larger LLMs if I wanted to. Um,
145:14 so yeah, let's see. What did I just say? All right, we'll do another test here.
145:17 And yeah, look at how fast that is. It's blazing fast because everything is
145:20 running locally on the same infrastructure. There's no network
145:24 delays. And so we have a powerful GPU, no network delays. We get some blazing
145:28 fast responses from these LLMs right now. And so I don't want to go and test
145:34 everything with N8N again. But what I do want to show you how to do right now is
145:39 take the Python agent that we have in this repository and deploy this onto the
145:44 cloud as well. Adding it into the local AI Docker Compose stack just like we did
145:49 on our computer, but now hosting it all in the cloud. So that's the very last
145:52 thing that I want to cover with you for our cloud deployment. And so just like
145:56 with the local AI package, we can follow the instructions here in the readme to
145:59 get everything up and running on our machine. And so the first thing we have
146:03 to do is we need to clone our repository. And so I'm going to copy
146:07 this command, go back over into my terminal here for my instance. And I
146:11 step back one directory level by the way. So I'm now at the same place where
146:14 I have the local AI package so we can run them side by side. So I'll paste
146:19 this command to clone automator agents. And then I can cd into it. And then I
146:22 also want to change my directory within the Python local AI agent specifically.
146:28 And so now doing an ls- a we can see the enenv.ample. So I'm going to just like
146:33 we did before copy this and turn it into aenv. And then I can do nano.env.
146:39 And there we can edit all of our environment variables. And so because
146:42 we're running this in the docker container attaching it to the local AI
146:46 stack. The way that I referenced was going to be just calling out the service
146:51 name. So port 111434 /v1. And then the API key. It's just that placeholder there for lama for the
146:58 LLM choice. If I want to get the exact ID of one that I already had pulled,
147:01 I'll actually show you how to do this really quick. So, I'm going to do
147:06 controlx y enter to save and exit. And then the way that you can execute a
147:12 container, it's docker exec-it and then the name of our container which
147:15 is lama. We already have this running. And then /bin/bash. And so what this is going to do is now
147:22 instead of being within our machine, we are within our Olama container. And so
147:28 now I can run the command Olama list and then I can see the LLMs that I have
147:33 available to me. So I have Quen 2.57B. So I'm going to go ahead and copy this
147:37 ID. I don't have it memorized. So this is my way to go and reference it really
147:41 quickly. And then this is also how you can access each of your containers when
147:44 you don't have Docker Desktop. You just do docker exec-it the name of your
147:50 container and then bin /bash. So kind of like how we had that exec tab in docker
147:53 desktop. And then once I'm done in here, I can just do exit. And now I'm back
147:58 within my host machine, my GPU droplet. So that's another little tidbit, another
148:02 golden nugget I wanted to give you there. But yeah, we'll go back into our
148:05 environment variables here. And I have that ID for quen 2.57b copied. So I'll
148:12 paste that in. And boom, there we go. And then for our superbase URL, it's
148:17 going to be http col back slashbackslash and then it is kong. So I guess that I
148:21 should have been more clear on this when I set things up locally. So I'll be sure
148:24 to update the documentation for this, but it's going to be Kong port 8000
148:29 because Kong is the service that we have in Superbase specifically for the
148:33 dashboard. And then the service key, well, I'm just going to go ahead and get
148:37 that from my local AI package because I have this set up in environment
148:40 variables. And so I just have to go and reference my environment variables here
148:46 to get my service ro key. And boom, there we go. Okay. So I'm going to go
148:49 ahead and paste this in. And um now I'm just going to delete this instance
148:51 after. I don't really care that I'm exposing this right now. And then for
148:57 the CRNG base URL, HTTP CRXNG 8080. And then for my bearer token, I
149:01 just have it set to test off. And then we don't need to set the OpenAI API key
149:05 because that was just for the OpenAI compatible demo earlier. So that is all
149:09 of my configuration for this container. I got to be really clear on this. I'll
149:12 update the docs for this. But uh otherwise we are looking good for our
149:17 environment variables. So controll x y enter to save. You can do a cat.env just
149:22 for that sanity check to make sure that everything's saved. We are looking good.
149:27 All right. And so now within the readme here, so I'll go back to the
149:30 instructions. We did all this already. We changed our directory. We set up our
149:33 environment variables and configured them. Now we need to run this stuff in
149:38 our SQL editor in Superbase. This is how we can get our table set up because we
149:42 haven't run things with N8N first. So we don't have this table created already.
149:47 And so now I just have to sign into Superbase here. So I've got my username
149:51 which is Superbase. And then I'm just copy and pasting the username and
149:55 password that I have um that I have set here in my environment variables. And so
150:00 I'll go to my SQL editor and go back here. I know I'm moving kind of quick
150:03 here, but I got these instructions laid out in the readme. I'm going to paste
150:08 this like that and go ahead and click on run. And then boom, there we go. So now
150:12 if I go into tables and search for NN, we have NN chat histories, a new
150:16 currently empty table. All right, looking good. And then going back after
150:21 we do that, now we can run the agent. So I just have to take this command right
150:24 here and then I'll go back to my droplet. And the one thing that I
150:28 mentioned earlier, but I want to cover again. If I go into the uh hold on, I
150:33 need to change my directory back. So, automator agents and then python local
150:38 AI. If I go into my docker compose, you have to make sure that the include path
150:42 is correct. And so, I'm going to update this by the time you get your hands on
150:44 it here where it's just going to be going two levels back. That's what we
150:47 need to do. So, make sure that we reference the right path to the local AI
150:52 package on our machine and then crl + x y enter to save. That's because we have
150:57 to go back from Python local AI agent, then back from the automator agents
151:00 directory, and then within that same directory, we have the local AI package.
151:04 So, we're good to go. Now, I can go ahead and paste in the command here to
151:08 build our agent and include it in the local AI stack. And so, it's going to
151:11 have to build everything. Takes a minute like we saw already. So, I'll pause and
151:15 come back once this is done. And all right, there we go. About 30 seconds
151:18 later and we are good to go. So, now I can do the docker ps- a again. And this
151:23 time if I look through this list very carefully, take a little bit here, I
151:28 should be able to see my Python agent. There we go. Local AI, Python, local AI
151:32 agent. And it starts with local AI because it is a part of that docker
151:36 compose stack. And by the way, I can do docker exec-it. And then I can do python-lo
151:46 agent bin bash. I can I can run this as well. And then what I can do is if I do
151:51 a print env command, I can see all the environment variables that are set
151:54 within this container. That's everything that we set up in the env. So I'm being
151:59 very comprehensive with this master class, showing you how you can tinker
152:02 around with different things like accessing your containers and seeing the
152:05 environment variables, making sure that everything that we specified in thev is
152:09 actually taking effect here. And sure enough, it is. So we are looking good.
152:13 So I'll go ahead and exit. We're back in our root machine now. We have our
152:18 container up and running and also it's running on port 8055. And so now we can
152:23 go back to open web UI within uh open webyt.dynamus.ai and we can set up our pipe. And so I'm
152:30 going to go to the admin panel functions. We don't have a function
152:33 here. So I have to import it. And so what I can do I'll actually do this
152:36 here. I'll just Google you can literally Google np pipe open web UI. And it'll
152:40 bring you to the one that I have here. You just have to sign into open web UI.
152:44 I'll click on get. And then this time for my URL instead of being something on
152:49 local host, I'm going to copy my actual subdomain here. So import to open web UI
152:53 and then boom, we have our pipe. So I'll click on save, confirm, and then within
152:59 the valves here, I can set all of my values. So I'll just click on default
153:02 for all of these so I can get a starting point here. And then yeah chat input is
153:07 good output is good the bearer token is test off and then for my URL it's going
153:14 to be http colon and then it's going to be uh the name of my service python local aai
153:24 agent port 8055. Let me get that right. 8055 and then slashinvoke- python- agent. I believe I
153:32 have this memorized. I think we are good there. So I'm going back if I clear this
153:37 and run a docker ps- a it is indeed called um python local AI agent that is
153:42 the name of our service so open web UI is able to connect to the agent directly
153:46 with this name because we are deploying it in the same docker network and so I
153:51 think we are looking good all right so I'm going to go ahead and click on save
153:56 all right and then go back and start a new chat and then also like I said a lot
154:00 of times it helps just to refresh open UI completely open web UI completely
154:03 completely. All right, there we go. And then now instead of Oh, I have to
154:07 actually enable. Let me go back to the admin panel. Functions, you have to make
154:11 sure this is ticked on. Um, so that we have the pipe enabled. Now going back
154:15 here and I'll refresh as well. We've got our pipe selected. And now I can say
154:20 hello. And there we go. Super super fast. We got a response from our Python
154:26 agent. Take a look at that. And then also going into my database here, you
154:29 can see that I have all these messages in the NAN chat histories table. We'll
154:33 take a look at that. All right. And then we can also ask it to do web search. I
154:38 can say like what is the latest LLM from um Anthropic for example. So it has to
154:44 do a quick Seir XNG search leveraging that. Uh latest is Claude Opus 4. All
154:49 right. And man that was so fast. We have no network delays now because everything
154:53 is running on the same network and we have an absolute killer GPU. So this is
154:56 so cool. Also, one thing that I want to mention is sometimes depending on your
155:02 cloud provider, CRXNG will not start successfully. There's one thing you have
155:05 to do. It's just a really small tidbit. If you run into this issue where the
155:09 CRXNG container is constantly restarting, what you want to do is go to
155:14 your local AI package and then run the command chmod 755 SER XNG. That's the
155:22 Seir XNG folder. And so the CRXNG folder is responsible for storing the
155:26 configuration that we have for CRXNG by default. Sometimes you don't have
155:29 permissions to write this file and it needs to do so. So I'm going to update
155:33 the troubleshooting to include this. But yeah, just a small tidbit. And then you
155:37 can just go ahead and run the command to start everything again. Um, obviously
155:41 you have to go back one directory then you can run this um and restart
155:45 everything. That easy to restart things to make changes take effect for your
155:49 package and then you'll be good to go. So yeah, we have everything working
155:54 here. So this is pretty much it for the master class. Now we have our local AI
155:58 package up and running with an agent and the network as well. We're communicating
156:02 to it within Open Web UI directly. There is so much that we have gotten through
156:07 now. So congratulations for making it this far. All right, I'm going to be totally
156:12 honest. This was very hard to make this master class, but it was so worth it.
156:16 And I hope that you got a lot out of this. We really covered it all. All the
156:21 way from starting with what is local AI and why we should care about it to
156:25 deploying it on our machine, building agents, deploying it to the cloud, and
156:28 configuring everything with DNS. Like man, we basically did everything you
156:32 could possibly need to get the foundation laid out to build anything
156:36 that you want with local AI and local AI agents. And so the very last thing that
156:40 I want to cover here is just a couple of additional resources that I have for you
156:45 now that you know how local AI works and how to get it set up. You want to dive
156:48 into building more complex agents with it now. And so there's a few things that
156:52 I want to call out for you. So starting with my YouTube channel, I have a lot of
156:56 videos on my channel diving more specifically into building more complex
157:00 AI agents with local AI. And the main resource that I want to point you to
157:03 right now if you really want to go deeper into building agents with local
157:08 AI is the ultimate N8N rag AI agent template local AI edition. And so this
157:13 is using the local AI package and I dive really deep into rag and local AI which
157:17 was outside of the scope of this master class because that's more about building
157:21 agents versus setting up local AI. But this is a great video to dive into. Um,
157:25 and then also I got to call out the Dynamus community again because man, I
157:29 put so much effort into building local AI into a core part of this course here.
157:34 And so, like I said at the start of this master class, when I build the full
157:38 agent out throughout the AI agent mastery course, local AI is an option
157:42 the entire time and I show exactly how to set up everything for local AI using
157:47 the local AI package. Like, I really have this ingrained into everything in
157:51 Dynamus and in my YouTube channel. This local AI package is the core of
157:56 everything that I do with local AI. So great resources for you. With that, that
158:01 is everything that I have for this master class. So I know this is my third
158:05 time saying it, but congratulations if you made it this far. You now have what
158:08 it takes to really build anything that you want with local AI and you can use
158:12 these additional resources to go much further as well. So I hope to see you in
158:15 the Dynamist community. Let me know in the comments if you have any questions
158:19 on anything that I dove into here because I know that it is a lot a lot to
158:23 digest, but I'm trying my best to make it as digestible as I possibly can. So,
158:27 with that, if you appreciated this master class and you're looking forward
158:32 to more things local AI or AI agents, I'd really appreciate a like and a
158:35 subscribe. And with that, I will see you