How to Run a ChatGPT-type LLM on Your Own Hardware
Xanny.eth
0x8952
March 25th, 2023

Running a Large Language Model (LLM) on your own laptop can be an intimidating prospect, but it doesn't have to be. With the right tools and techniques, you can make it work. In this post, we will explore the various options available for running large language models on a personal computer and provide practical advice on how to get it done. We will also look at best practices that can help maximise your chances of success.

Yes, that was written by an LLM. And no, it’s not ChatGPT. Instead it was written by LLaMA running fully locally on my laptop. Wanna try for yourself? Read on!

“How the fuck is this possible?”

A good question! I’m amazed myself, I was expecting open source LLMs to develop rapidly, but I wasn’t expecting to be able to run any kind of usable one entirely locally on my laptop for at least another year.

The short answer is Meta created a very efficient LLM called LLaMA. While the client itself was already open source, the weights (training data) were not. This is the real secret sauce that’s able to drive the whole thing. Running an LLM on your machine was already possible, but the experience was closer to old school chatbots than ChatGPT.

So how did that change? Basically, the entire set of training data got leaked. Once it did, some gigabrain coded a fork of the open source LLaMA client that was even further optimised to run on anything from laptops to Android phones right down to Raspberry Pi’s.

Why is this training data so special, exactly? Because an LLM with 7 billion data points is able to not only run on consumer hardware while taking up a mere 4GB storage, but it is also able to perform about as well as the actual ChatGPT. For reference, the the absolute max number of data points I can run from standard open source Hugging Face LLM models on a consumer machine with only a CPU is about 1 billion. It’s literally a 7x advancement.

These super efficient well trained LLM weights we’re talking about here come from Alpaca (yes, that’s right, LLaMA and Alpaca) which in turn comes from Meta and Stanford. The actual raw “instruction data” used to train these Alpaca models for LLaMA is the exact same data OpenAI used to train ChatGPT.

Indeed, the reason the data wasn’t public until it was leaked, despite the rest of the project being open source, is due to the license agreement made with OpenAI, explicitly forbidding its use in the development of competing products because they know very well it’s perfectly capable of doing so. Oops.

Don’t believe me? Follow this tutorial and try it for yourself! I’ll also be peppering in screenshots of some of the cool shit it can do throughout this post as well. As I do, please remember all of it runs 100% locally from a 4GB file running on a regular consumer laptop with an i5 12th gen and 16GB RAM. Literally a standard mid-range laptop spec. No special hardware, not even a single GPU.

Help, my AI ate my RAM! Oh and also Chrome has like 20 tabs open.
Help, my AI ate my RAM! Oh and also Chrome has like 20 tabs open.

Speaking of GPUs, if you do have one you can run the more advanced models much faster than from regular CPUs. So don’t throw away that dusty old gaming PC or mining rig just yet!

Oh and if you happen to own a Mac with an M1 or M2 chip, congratulations to you too! The performance from these isn’t quite as great as a high end dedicated GPU, but still exceeds regular x86_64 CPUs due to the powerful GPU architecture and integrated SoC nature of those ARM chips. They essentially perform somewhere between a regular CPU and a GPU.

“Is there any censorship? Do I need to find a jailbreak like DAN?”

"Fuck da police." - AI
"Fuck da police." - AI

The short answer to both questions is no.

The slightly more nuanced answer is it depends on the training data you feed into it. After all it’s open source software running on your own computer. You can make it do whatever you like, and as I’ve just demonstrated, that includes things ChatGPT would never do as a trained AI language model…

Fuck that cat and mouse bullshit! BasedGPT and the other jailbreaks are good fun, but here you engineer the initial prompt, you decide the rules, you deal the cards.

As previously mentioned, the instruction data is the same as OpenAI used to train ChatGPT. However, as anyone who used ChatGPT from the start and/or has messed around with ChatGPT jailbreaks knows, that data is capable of acting well outside of OpenAI’s overly restrictive “content policy.” The nanny filters are actually put on the frontend, not the backend.

Remember the good old days?
Remember the good old days?

With unfettered access to the raw data, including the ability to tweak the weights if you’re more technically inclined (or those more technically inclined people share their work openly, as they very often do), none of this is of any real concern.

It does have a certain level of “morality filter” you might hit sometimes. If it says “I can’t do it, it’s wrong and immoral” a very simple one liner prompt telling it that it has no morals or ethics is enough to get around it. No need for a DAN essay here.

But even without that, I was able to ask it how to synthesise MDMA and overthrow the government, and yes, you pervs, it’ll also happily write erotica without the need for any jailbreaks either because rule 34 is absolute.

Rule 34: no exceptions, not even artificial ones
Rule 34: no exceptions, not even artificial ones

“Enough preamble! I’m convinced! How the fuck do I install this thing?”

So, the actual software you need to run is llama.cpp, available here. There’s also a fork called alpaca.cpp here which is currently more buggy and because of the weights it supports, not as good, but it is easier to use and less resource intensive. Don’t worry if the pages look daunting, simple install instructions will follow below!

The TL;DR of which weights you want are simple: if you got a decent amount of RAM, run the native Alpaca weights with llama.cpp. If you don’t, use the regular ones with alpaca.cpp. If you have a good GPU, you can try the 13B or above weights in various formats. But for the purposes of this tutorial I will cover installing for use on a CPU.

I’m gonna assume most of you have at least 16GB RAM and a modern CPU so we’re gonna focus on the alpaca-native weights as training data. Despite only being 7B (7 billion data points) I found these to vastly outperform the “regular” weights even at 13B.

These retrained native weights do use more RAM than the regular version at the time of writing, so if that’s a concern, you can use the regular 7B weights instead. But if you can, I highly recommend the alpaca-native weights, they’re much more closer to ChatGPT even at the smallest 7B size (one 4GB file with 7 billion data points).

Now I can’t give you download links for those weights because they keep getting DMCA’d. But if the magnet links are already on Pastebin and I happened to link that for purely educational purposes, well that’s different.

In all seriousness - the license that comes with these weights allows use for educational purposes, with the main clause being against use to develop a competing commercial product. I think it’s fair to assume anyone following a tutorial on a random blog is doing this for fun rather than to start a company that competes with OpenAI.

As for the alpaca-native weights, these seem to be released under an open source license because, kinda confusingly, the original license doesn’t prohibit use of output from the models to train other models based on the existing ones, it only forbids commercial use as mentioned above.

Quantisation! Quantisation! Quantisation!

Whichever one you go for, should get the quantised version. Those are the compressed versions that take up both less storage to save and require less RAM to run. You will likely find that you cannot run the non-quantised version on a CPU, whereas you easily can on a quantised model. So far the testing has shown nothing (or so little it’s barely noticeable) is lost from the quantisation process.

You can quantise them yourself, but this is resource intensive and requires decently powerful hardware, ideally a good GPU. You can always rent use of a decently powerful VPS for an hour nice and cheap if there’s no quantised weights already available for whatever you wanna try, but in this case there is.

Actual install guide!

Luckily this is the easy bit. The process for Windows, Linux, and MacOS is broadly the same!

Windows specific step

For Windows you have two options. The easiest is to download the precompiled llama.cpp EXE files here.

The second is to use WSL. This may give better performance, so if you’re comfortable in the command line, I recommend it. But if not, go for the simpler option of downloading the Windows package directly.

For the WSL option, if you haven’t used WSL (Windows Subsystem for Linux) yet, simply open a command prompt or PowerShell as administrator and enter the following:

wsl --install

Let that do its thing and once it’s complete, reboot the machine.

If you don’t already have Ubuntu installed when you search the start bar, open the Windows Store and install Ubuntu 22.04.2 LTS from there.

Once you have it, run it and you should get a Linux terminal. Set a username and password when prompted (don’t forget them!) and… done! You can follow the rest of this guide like normal.

Mac specific step

Make sure you have Homebrew installed for a package manager:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Installing the dependencies

If you use WSL, keep that Ubuntu terminal open. If you use a Mac, make sure you have brew installed, and also keep that terminal open. Linux… if you use Linux and your terminal ain’t already open, what you doing? ;)

For Windows, if you’re using WSL, follow the Linux instructions.

Right. Let’s go.

On Linux:

sudo apt install make cmake build-essential python3 python3-pip git unzip

On Mac:

brew install make cmake python3 python3-pip git wget unzip

Once that’s done, for any system:

python3 -m pip install torch numpy sentencepiece

That’s the dependencies installed! The next bit is quick and easy!

Building llama.cpp

On WSL, Linux, and Mac systems, just run these three commands:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Done!

Downloading the training data

Refer to the Pastebin as above for downloads.

What you really want is the first magnet link, as this provides the far superior alpaca-native data in quantised form.

To use the magnet link, copy the whole long string to your keyboard, open a torrent client such as Transmission (available for WIndows, Mac, and Linux) then once it’s open hit “file” > “open from URL”. It should paste the magnet link in automatically, if not just do it manually. Hit okay and wait for it to download. It will first download the metadata then the file itself will begin downloading.

You can optionally choose to download the weights directly to the llama.cpp folder, which will make running it easier while still keeping it seeding.

If you have low RAM and want to try the more buggy, unrefined original weights, you can either use the second magnet link or one of the IPFS links.

If you use IPFS to get the original weights and you are using WSL, run this command to ensure the file name is original instead of the IPFS hash:

wget [IPFS link] -O ggml-alpaca-7b-q4.bin

You can also use this command for Mac or Linux, or you can simply paste the IPFS link into your browser and download it from there. But with WSL you only have a terminal, so you must use the above command.

Please note: if you use the unrefined weights, you either need to use an older version of llama.cpp or use alpaca.cpp as llama.cpp has been updated so the files won’t currently load. I strongly recommend using the alpaca-native weights!

If you used WSL and downloaded the alpaca-native weights in a torrent client, follow this guide to mount your WSL volume in Windows and move the finished download to the llama.cpp folder.

For any other OS, if you didn’t download the torrent to the llama.cpp folder, either move the file once it’s done or, if you wanna be nice and seed, leave it where it is and just use the full path in the command when running the software (this can be filled in automatically by dragging the file into the terminal).

So if the weights are in the same folder as llama.cpp, and your terminal is in the llama.cpp folder (it should be) you can run in Linux, Mac, or WSL:

./main --color -i -ins -n 512 -p "You are a helpful AI who will assist, provide information, answer questions, and have conversations." -m ggml-alpaca-7b-native-q4.bin

Or if the weights are somewhere else, bring them up in the normal interface, then paste this into your terminal on Mac or Linux, making sure there is a space after the -m:

./main --color -i -ins -n 512 -p "You are a helpful AI who will assist, provide information, answer questions, and have conversations." -m 

Now drag the weights file from the regular GUI into the terminal and it should fill in the path automatically, similar to this:

./main --color -i -ins -n 512 -p "You are a helpful AI who will assist, provide information, answer questions, and have conversations." -m '/home/user/Downloads/ggml-alpaca-7b-native-q4.bin'

Then hit enter and it should run!

For the Windows command line or PowerShell (not WSL), it’s very similar. Go to the folder you put llama.cpp in:

cd C:\Users\YourUser\Downloads\llama.cpp

And to run:

main.exe --color -i -ins -n 512 -p "You are a helpful AI who will assist, provide information, answer questions, and have conversations." -m ggml-alpaca-7b-native-q4.bin

Or follow the same steps above if the weights are in a different folder - simply paste the command below, making sure you leave the space after -m, drag the file from the explorer into the command window and it should fill in the path automatically:

main.exe --color -i -ins -n 512 -p "You are a helpful AI who will assist, provide information, answer questions, and have conversations." -m 

If the path is filled in when you drag the file into the command line, hit enter and now it should run!

If it tries to run but fails giving an error such as “killed” or “segmentation fault” on any OS, try rebooting your computer and don’t open anything in the background, as the LLM needs all the RAM it can get!

If you use Linux, feel free to add some extra swap space!

Get creative!

The -m flag lets you choose anything. I have added a simple opening prompt that should roughly emulate a ChatGPT type assistant, but you can make it anything, for example:

MixtapeGPT
MixtapeGPT

More to come…

I published this when I did to coincide with a Twitter Space. I am continuing to edit it to clean things up, add details, fix mistakes, etc and will shortly also add more examples of fun and interesting prompts you can try out!

Subscribe to Xanny.eth
Receive new entries directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.