This is by no means any form of detailed tutorial, but I did pick up a few things about locally hosted AI while I’ve been testing the last few days, so I thought I’d share some of what I’ve picked up.
I did upgrade my GPU card recently to 12 GB VRAM, mainly to better accommodate a game I sometimes play and to run AI functions on DaVinci Resolve Studio, which was struggling on my previous 6 GB VRAM card. But I realised my GPU is mostly idle, and seeing my home server has no decent GPU in it, I could maybe put my GPU to some better and more useful purpose. I also realised that many AI models could fit in even 8 GB or less of VRAM. I had recently upped my Google One subscription to the AI Plus option, and wanted to see how locally hosted AI compares with that. Not only that, but I’ve been testing out that Gemini option to see if it will replace my paid Canva subscription service.
TL;DR for all of this is the locally hosted AI is not quite up to the standard of Gemini AI Plus. Probably a 24 GB VRAM card would do a lot better as you can fit way bigger AI Models into it and even run more than one AI app at a time, but those cards are super pricey. As someone said recently, you either pay for the AI subscription service, or you pay for your own hardware!
I started out with paid Gemini AI Plus and was exploring NotebookLLM. NotebookLLM allows you to upload or insert your own sources into a project (notebook) and then run queries, create an audio podcast, infographics, etc based on just those sources without general hallucinations. I used it to research my last YouTube video and to create the infographics and the thumbnail for the video. So considering what I all got out of Gemini, yes I’ll probably drop to the free tier on Canva soon. I also found an excellent Linux app that runs locally called ‘rembg’, which removes backgrounds from my images. This is because Gemini does not actually edit existing images.
AnythingLLM and Open WebUI
So this all led me to start looking at locally hosted general AI (chat) tools. I ended up really tossing up between AnythingLLM and Open WebUI. I have not fully decided between them, although I’m leaning toward Open WebUI mainly because it has no Electron browser built in (you open its webpage from your existing browser), it can run in a Docker container, and it interfaces very well with a self-hosted backend Ollama service. AnythingLLM is nicely self-contained in an AppImage (whereas Open WebUI wants a Python install into a Python virtual environment), and it has some excellent built-in connectors for Gmail, Google Calendar, Outlook mail, SQL databases.
Both have folders to organise your chats into, and both have knowledge bases (like NotebookLLM) where you can upload your own source documents and scrape webpages (only AnythingLLM has a refresh button for webpages whereas Open WebUI wants you to remove and re-add webpages – this is pretty irritating). So both will allow you to work with just your own sources in a highly private environment (which no online cloud service will do), and to eliminate AI hallucination by focussing just on the sources you provide.
So apart from the front end graphical user interfaces, they need to connect to actual AI models. These can be anything from around 3 GB to 30 GB in size. The larger they are “generally” the more knowledge they have built in, but the point is, they need to load into your GPU card’s VRAM to be effective. Technically they could overflow into the desktop’s RAM, but then it slows down dramatically, or if you have a second GPU there is bridging software that would allow the AI model to work across more than one GPU card. Different AI models have different compression ratios and layering techniques as well, and these improve monthly or weekly, and there are many of them. So AI Models are better at general chat, some are better at coding, others are better for deep thinking, etc. So the really difficult choice is which AI model/s do you pick? I don’t have any easy answer for that.
The easier part for me was to decide what back-end AI model service to use. It seems Ollama is the one to go for. It can run as a backend service on Linux for me, and uses very few resources when idling. I could also set a sleep period so if I stopped using it for 5 mins, it would flush itself out the VRAM and go into idle mode. Ollama can handle many AI models, and you can also install many of them at the same time, and it will switch between what you want to use. I found I could run both AnythingLLM and Open WebUI, and both could connect to Ollama, as long as I did not run queries from both apps at exactly the same time. I adapted my Conky script to show exactly how much VRAM was actually in use on the GPU. Importantly too, when using Ollama and there are updates for a large AI model, Ollama can do delta updates to just download the changes.
Conky widegt showing VRAM in use on GPU
Right now I have the following two AI models installed on Ollama: mistral-nemo:latest using 7.1 GB for general chat, and deepseek-coder-v2:lite using 8.9 GB for deeper thinking and coding stuff. The reason I went with these sizes was to leave some overhead for my browser and various other desktop apps that also use GPU acceleration. Keeping it well below the 12 GB VRAM level also meant that there was less risk of a spill-over into the much slower desktop memory.
Whereas Open WebUI works more exclusively with Ollama, AnythingLLM can do the same, but it can also install its own models as well. It just makes more sense though to use Ollama, not only because it is more flexible to serve multiple applications, but it manages multiple AI Models better and more efficiently.
But you are also not just limited to using what is installed in Ollama. Both apps will allow you to connect to cloud based AI tools like ChatGPT, Gemini, etc. So I also added such a connection to the Gemini API to generate AI images. This does a truly great job but what I discovered was, the API connection does NOT use my existing Gemini paid subscription. That must be why, when you configure the API, you must link it to a billing account is Google’s AI Studio. You are charged separately for API usage. The text queries would be super cheap but image generation could cost around R1 per image (US$0.06 or so) and up to more if you use the Nano Banana Pro model. The results are superb, and interestingly they come back without any Gemini watermark on them. The featured image on this post was generated this way from inside Open WebUI. Something else I learnt was that the watermark is not the only way that AI images are recognised. The actual pixels carry identification codes too, but more on that later.
Why bother with offline tools?
So why bother using locally hosted LLMs for AI? Well privacy is often the reason, both for your source documents and what you are searching for and discussing. In some cases cloud based AI tools will also censor your queries, like maybe a country that bans discussions about abortions or explosives making, etc. Having all your discussions and outputs locally saved also means they are available to you 24/7 even if you leave one cloud provider for another. With locally hosted AI tool you can switch between online or offline models and keep re-using your stored queries and outputs.
Disasters and No Internet
A vitally import other reason for wanting to use locally hosted AI is for being able to use it without any Internet connectivity at all during for example a disaster scenario. Emergency responders or a disaster management centre often have volumes of PDF guides for what to do in different circumstances. Some documents are 50 plus pages long. They need to know how to respond quickly and correctly whilst under pressure (yes I know training is the key tool here) and an offline AI that already has all the training documents and manuals uploaded into it, is instantly available to answer questions about what to do for what scenario, what radio frequency to use during the day versus at night, etc.
That will work offline when there is no Google search or Office 365 available to access, and quickly present some bulleted steps to be followed by responders. Those documents will be secure and private, and AI hallucination is less likely as it answers only from the sources uploaded to it. Google’s NotebookLLM is also ideal for this, but you have to be online to use that.
A downside, by default though, is that the models you download have an end date for their training. For example, when I asked one of my models for today’s news headlines, it spat out headlines and the exchnage rate for a date in 2024 because that was its end date for training. Even activating web search still had it stuck in 2024. You can override this behaviour by force prompting it to override its end training date, but it will need to do a web search (or more accurately the chat will trigger a web search, and the model will interpret the results, and word it nicely for you).
This all goes to show though, that you as an AI user, do need to learn a lot about its capabilities, its limitations, and what sort of choices to make, so that you can use it more effectively. Yes, this is why many cloud based AI services could be easier and more up to date to use. Self-hosted AI tools don’t really compete directly with their cloud based counterparts, as they are not that up to date, and most of us don’t have the computer horsepower to install such massive models. Which is why we’ll probably see the rise of real AI Engineering as qualifications in future. AI is not the same as simple Google Search, and especially so if you are giving AI control to create, edit or delete documents or within any computer system.
Above is my Open WebUI interface. Yes, it does look a lot like ChatGPTs interface, and apparently this is intentional. But what you can also see is a link on the left pane to Notes. I find this quite useful to have a few notes to remind me about how to construct prompts, which models to use for what, and other snippets of useful information. I also have some Workspaces setup, each having their own source documents relevant to that project. The model in use is shown at the top, and I can easily switch to using other models there.
AnythingLLM User Interface
Above is AnythingLLM’s user interface which has a little less info on it. Open WebUI certainly has more configuration options available. Notable though are that AnythingLLM can default to having we search active if you have specified what search engine to use (I pointed it to my own self-hosted SearXNG search engine) and it also has an assistant mode that can be triggered by a key combination for quick use.
So let’s move onto AI image generation (because that is all I have learnt so far about the AI chat side).
Easy Diffusion for Image Generation
I started out with Easy Diffusion (the front end up that uses Stable Diffusion). It’s claim to fame is real easy use with one screen that has a few sliders on it and image modifiers that you just pick (like cartoon or sketch style, panoramic or cinematic photo style). It is quick and easy to install and use, but for that, you have less control over the output. Well it is nearly easy to install, but I needed to create a virtual Python environment, and do the Python installation. The reason was that Python 3.12 is the most stable for AI, whilst my EndeavourOS system’s Python is at version 3.14. Some nodes and functions dod not work well with the very latest versions of Python.
The Stable Diffusion family is the standard, most mature ecosystem, comparable to the “Llama” of the image world. There are several major iterations, namely Stable Diffusion 1.5 (SD 1.5) released in 2022 which has lower requirements (and
native resolution of 512×512 pixels), and then Stable Diffusion XL (SDXL 1.0) which is the current “modern standard” (Native resolution of 1024×1024). I did quickly realise when I switched to using the Juggernaut XL model (which is part of Stable Diffusion XL) that it was way better.
AI image generation can get extremely complex and understanding the terminology gets even more important as you move closer to generating literal Hollywood-style movie clips. But little need to worry about that with Easy Diffusion.
To get the AI models to use in Easy Diffusion (no, these models are image ones and don’t run in Ollama) I just downloaded them to the correct folder for Easy Diffusion to see. A key tip for both apps is that you only apply fixing of faces or upscaling of resolution, AFTER you’ve done the initial generation. Splitting the process into two steps allows for less VRAM usage. So I generally generate to 1K resolution and then upscale to 2K or 4K.
Easy Diffusion User Interface
The Easy Diffusion interface is show above. It can be as simple as filling in the test prompt and hitting Make Image button. What I’ve highlighted is where you choose which model to use. Here I’m using the Juggernaut XL model which does not need LoRA modifiers added. A LoRA (stands for Low-Rank Adaptation) is a mathematical technique to influence the large rmodel to do something. For example with the simpler AI model I used before, when I asked for a Suzuki Jimny, I got a rough appriximation of a 4×4 vehicle with garble dtext on the numberplate. By searchinga nd downloading a LoRA specifically for a Jimny, I got an accurate Jimny vehicle in my image when I used that LoRA’s Jimny_SIERRA trigger word in the prompt. Moving to the Juggernaut XL model though, I no longer needed the LoRA as Juggernaut XL knew a lot more. A good analogy of LoRA is that the big AI model is like a cruise ship, and a LoRA is like a tug that will guide or move the cruise ship accurately into position.
ComfyUI for Image Generation
But there are limitations with Easy Diffusion. If you want true control over the imaging process and want to be precise about what you create, whether an image or video, then you really need to move to a node-type interface. ComfyUI is such an interface and again, like Ollama, has become very well used. ComyUI has a Manager app inside it which will actually find and download the models you need, and even identify and find any missing nodes you need. Anothe rplus I discovered, was that I could point ComyUI to external AI models I already had. So instead of downloading the 8 GB of Juggernaut XL model all over again, I just pointed it to where I already had the model installed for Easy Diffision.
ComfyUI’s user interface can be seen above. Yes the nodes look complication, but you only need a few common one’s to get going. You can save the setup and just chnage your prompt and the image size nodes when you need to. Each node performs a different function but there generally is a node for which AI model to use, what the image dimensions are, what text prompts to use, one to convert the latent image data to pixels, and one to save the image. You’ll see too that inputs to a node are on the left, and outputs are on the right side of a node. Different colours for links, as well as labels, help guide which should connect to which.
A tip here is, if you have anyone’s image that was generated in ComfyUI, and you drag it onto an empty canvas in ComfyUI, it will instantly create all the nodes and text prompts used for that image. You can then just modify what you want to use from there.
What is complicated, is that each AI model has it’s own node requirements, so switching models usually means getting used to using different nodes. The common ones will stay the same usually, and the KSampler one is always the heart of the engine. The text prompt nodes will have a positive text prompt (things to include) and a negative text prompt node (things to specifically exclude).
The image above was generated using the identical text prompt that I used for Google Gemini (this post’s featured image at the top). It looks slightly different, but it is still pretty good considering this was 100% offline and I can generate 100 of these at zero cost and without any limitations. It took about 1 minute to generate and I always find the subsequent ones are much faster as the big AI model is already loaded into the GPU VRAM.
In case you are wondering, yes you can add connectors in AnythingLLM and ComfyUI (I think both) to connect to ComfyUI and generate an image from your offline AI chat tool. What you just need to be careful of, if you don’t have a 24 GB VRAM GPU, is not to have just finished an Ollama based query so that Ollama has released its use of the GPU.
Something that did irriate me a bit with the three Python based apps, was you need to start them from a terminal whilst being in the correct folder, and ensuring you use the correct Python version.
Yes, you could easily create a small shell script to run the three commands or so to do that, but that also could mean having terminal screens open (so that you can hit ctrl-c to stop Python) and it all looked a bit messy.
What made this all a lot easier and more transparent, was to create systemd services for each one. That allows passing th environment variables, being in the correct working folder, executing the correct Python, and hiding everything from the user. So now all that had to be run was ‘systemctl –user start open-webui.service‘ or ‘systemctl –user stop open-webui.service‘.
Once I had the systemd services set up (using of course AI to help me) I also programmed some of my Stream Deck’s buttons to not only start and stop each app, but also to show a green or red background to indicate if they were running or not.
Above is what my Stream Deck page looks like for my AI apps. A short press will start that app service up and it will turn green when running, and a long press will stop the service and the background of the button changes to red. The triggers to check the status will only run when this page is open on the Stream Deck. Actually I could also add a button to show the VRAM useage as well if I wanted to!
So yes, all in all very interesting and I’ve learnt a lot in the last few days, even if that means knowing how little I actually know of locally hosted AI. For now I’m keeping my paid cloud AI service, but I’ll continue to see how I can use the completely free locally hosted AI more. It is anyway a moving goalpost. I’m going to certainly expand more on how I can use this when offline, also for making use of my more sensitive documents that I do not want in the public cloud anywhere.
I’m also going to be looking at the creation of short 5 to 8 second video clips that I’d like to use in my YouTube videos, as well as animating old family still photos (like the paid Photomyne service does).
And there are some things that Google Gemini does amazingly well, like creating 30 minute dramatised audio poadcasts from NotebookLLM source documents. I’m actually planning to publish one of those as a YouTube video soon, and just add my own images to sync with the discussion.