Maybe the open source AI versions that can be installed separately and configured freely can become more useful than ChatGPT. Though I have no idea how much computing power that would require or how the open source projects currently compare to ChatGPT.
I apologize in advance for the length of this post, but I tried to condense it without excluding important info for those interested in the topic.
You can pretty easily run open-source AI (LLM's) on your personal computer these days, and it's actually very simple for anyone to do and play around with. It will run much faster and better if you have a good graphics card (NVIDIA or AMD). But it works just fine on a regular CPU and RAM, just slower.
TL;DR - here's the super basic version for those who just want to do stuff and don't want to read anything.
1) Download koboldcpp.exe from here:
Releases · LostRuins/koboldcpp
2) Download Tiger-Gemma-9B-v1-Q4_K_M.gguf from here:
bartowski/Tiger-Gemma-9B-v1-GGUF at main
Put both of them in the same folder.
3) Run koboldcpp.exe, click Browse in the window the comes up, and point to the Tiger-Gemma-9B-v1-Q4_K_M.gguf file and click "Launch"
This will open a new browser tab where you can talk to the model
4) Before talking to it - click Settings, change "Usage Mode" to "Instruct Mode" and change "Instruct Tag Preset" to "Gemma 2"
Try talking to it now.
If you had problems loading the model because you didn't have enough RAM or something, try a smaller model:
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
Download - gemma-2-2b-it-Q6_K.gguf
Once again, put it in same folder as koboldcpp.exe, run the .exe, point to the model file, click Launch, wait for browser tab to open. Same settings as before. This one is small enough to work on a potato of a computer.
If you want to understand what the heck you're doing and why you're doing it - read on!
There are several ways people run LLM's on their personal machines, but I will focus on just one arguably most popular and simplest method - using GGUF's, which is a quantized version of the models. First a quick primer on what that is and why people do it that way. LLM's require a LOT of memory and memory bandwidth to run. The typical way people would run these is using python, and it uses the transformers library. However, LLM's are huge, and it would often require a machine that is beyond most people's budgets. So the open source community came up with ways to "quantize" the models - think of it as a lossy compression. LLM's have something called weights, so the community will take the original float-32 (a level of numerical precision) weights and reduce their precision to float-16, int-8, int-4, and even lower bitrates. Oddly enough, this only reduces the model's intelligence/accuracy very slightly (until about int-4, after which it kinda tanks for smaller models), but it greatly reduces the hardware requirements to run the damn things. These days you can run the models on a Raspberry Pi, or your old laptop from 2008. It will be a smaller model, but it will run!
With that out of the way, here's how you can do it yourself with no technical knowledge required using completely open-source tools (and no installation or programming involved).
First you need a client that can run the GGUF models. We will grab the models themselves in a minute, but a really good and popular open-source client designed to run GGUF quantized models is this one:
A simple one-file way to run various GGML and GGUF models with a KoboldAI UI - LostRuins/koboldcpp
github.com
As of this post, 1.72 is the latest version (but always grab the latest available, and update it once in a while, as newly released models will require the client to be updated to run them). There is no installation - it's just a single .exe file that runs out of the box.
For Windows, one of these 4 should probably work depending on your GPU and CPU:
koboldcpp_cu12.exe
koboldcpp.exe
koboldcpp_nocuda.exe
koboldcpp_oldcpu.exe
The instructions for which one to choose are at the above link as well, but here they are:
To use, download and run the
koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use
koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and
koboldcpp.exe does not work, try
koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version
koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate
Linux binary file instead (not exe).
If you're using AMD, you can try
koboldcpp_rocm at
YellowRoseCx's fork here
Ok you got the .exe file, now what?
Now let's talk models!
First of all, there is no such thing as "open-source" LLM's. They call them that, but they're really just open-weights. To be properly open-source you'd need the training data to be made available, as well as the code for the model architecture, and a few other things like fine-tuning data etc - so that anyone basically can reproduce your work and train it the way you did from scratch on the same data. They don't give us that, they just release the weights of the model so we can use it privately on our devices, but not re-create it ourselves. Second, no one in the community trains these things from scratch anyway. They cost millions of dollars (usually tens, hundreds of millions, and the latest being around 1 billion) to train, and so far only a handful of companies have been able to do this well - and fewer still who release the weights/models to the open-source community. The models come with licenses, like the MIT license which allows any and all uses, including commercial. Some models have slightly more restrictive license, but the restrictions only apply for commercial use of the models. None of that really matters for personal use, so don't worry about that.
Some models are from US, others are made by Chinese companies, and there's a really good French company too. That's kinda it right now.
The US companies that release open-weights models to the community: Meta, Microsoft, Google.
The French company: Mistral
Chinese companies: Not sure about company names but the models start with Yi, DeepSeek, Qwen, and some others I forgot off the top of my head.
What the open-source community does is take these models and finds ways to run them on personal hardware by creating methods to "compress them" first, then creates clients that can run these compressed versions (like Koboldcpp client above, which can run the GGUF-format quantized versions of the models). What they also do is fine-tune these models. Fine-tuning isn't exactly the same as training - it doesn't really add knowledge to the model, but it's a technique to modify some of the layers of weights of the model (usually very few surface layers only, otherwise it's computationally prohibitive to continue the full training yourself) - and it generally just changes the "style" of the model's output. It also has the effect of coaxing out more suppressed knowledge hidden inside the base model. For example, they can fine-tune a model to be much better at medical diagnosis than it is out of the box. They basically allow the model to get in touch with the medical knowledge it already has better than when it is fine-tuned as a generalist, and guide the style of its answers in whatever way makes sense for the purpose.
Another very important reason for fine-tuning that the community does is to uncensor the models. The models coming from these big companies are all censored (because the companies don't want to be sued, and also because businesses who use these open-source models need them to behave when exposed to employees or clients, and never delve into "unwanted" topics even if the user asks them to).
Once a model is uncensored by the community, it generally won't refuse to answer anything or talk about any subject, be it violent, illegal, sexual, or whatever. It will gladly do medical diagnosis or legal advice, or teach you how to break into a car, etc. Very important note - un-censoring a model doesn't change its knowledge, it simply removes the layer of censorship that prevents it from accessing the knowledge it already has. So if the model thinks there are 10 genders because that's how it was trained, un-censoring won't magically make it objective or admit there are only 2. It just won't refuse to talk about any topic based on whatever training about the topic it has. Also, this should go without saying, but don't rely on the models for legal or medical or financial advice for obvious reasons - cuz they're often stupid and wrong.
One final thing before we get to the model downloads. GGUF files, as mentioned earlier, are compressed LLM's. So at each Huggingface link you will see a bunch of GGUF files - each file is the same full model, it's just compressed to a greater or lesser degree.
For example, here's a really good recent model (uncensored version):
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
Click on the "Files and Versions" tab and you will see all the files.
As they go up in size, they are less compressed. But also will use more RAM (or video ram, if using a GPU) to run, and will also run smaller. But that also makes it closer to the original model, so less likely to be "dumbed down" as a result of the compression. The effect of dumbing down is minimal however, all the way down to about Q4_K_M (and is more pronounced after that).
The size of the file is roughly how much RAM (or GPU vram) you will need (you will need a bit more, but it's a ballpark). Koboldcpp tries to detect a GPU and offload as much of the model to your GPU as possible. It doesn't have to be either GPU or CPU - Koboldcpp can offload some of the layers to the GPU, as much as will fit in the GPU's video memory, and keep the rest in regular RAM, so it helps speed it up even with a modest GPU.
Follow the DL;DR guide above to get it running.
There are a bunch of other models you can try, and the best way to figure out which ones is to check the benchmarking websites. They test the Open models against each other in various subjects - like language, mathematics, coding, medical knowledge, reasoning ability, data analysis, etc.
Here's are a few benchmarking websites that I personally think do a great job of testing these models:
dubesor.de
livebench.ai
Discover amazing ML apps made by the community
huggingface.co
Discover amazing ML apps made by the community
huggingface.co
Some of the benchmark sites also include closed-source models like GPT-4o and Claude 3.5 Sonnet etc for comparison. The open-source models are quickly catching up in capability to the really big closed-source ones, and have the benefit of running on your device with no internet, completely privately, and often have fine-tunes that remove their censorship.
I've been keeping tabs on these things for a while now, even tried to make a side hustle implementing them for clients (didn't really pan out, business partner wasn't up to snuff), and I'd be happy to answer any questions or help with any technical issues getting them running on your computer. If you need model suggestions I can help with that as well.
A good sub-reddit for learning about this and following the latest info and releases is /r/LocalLlama.
I'm sure I probably forgot a few things - but this post can only get so big! Important considerations when experimenting with models - make sure you get the prompt template correct in Koboldcpp settings, and the context size, and there's a bunch of other settings but I'm pretty positive you can read about it on the github I linked to, I think there's a FAQ in there. Or just ask here if needed.