Professional Documents
Culture Documents
Creating A ChatGPT Clone That Runs On Your Laptop With Go by Sau Sheong
Creating A ChatGPT Clone That Runs On Your Laptop With Go by Sau Sheong
Member-only story
Writing a ChatGPT clone is pretty simple, it’s almost like the “Hello World” of LLM
applications. At the heart of it is simply calling any one of the LLM APIs offered by
OpenAI, Google, Anthropic and many others. But what if you don’t want to be tied to
any of these guys? After all, they are not exactly free.
Well, you can always use one of the many open source large language models that
are available. While they aren’t really open source licensed (even if they say they
are), you can use them without cost, some even for commercial purposes. Most, if
not all open source models will find its place in Hugging Face, and taking a quick
look, the number of text generation models come up close to 19k! That’s a lot of
models, even though many of them are variants of the same model.
In particular, Meta’s Llama (Large Language Model Meta AI) is probably one of the
most popular ones.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login--------------------------… 1/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
Llama is a family of large language models (LLMs) that uses the transformer
architecture and trained on large amounts of data from various public sources. The
first release of Llama came out in February 2023, was trained with 1.4 trillion tokens
and had variants with 7, 13, 33 and 65 billion parameters.
When Llama was first released, it wasn’t licensed for commercial use and the
weights were only released through an application process for academic use.
However within a week, the weights were leaked. The leaks were subsequently
stemmed by Meta, but the cat was already out of the bag. As Simon Willison wrote
on his blog.
That furious typing sound you can hear is thousands of hackers around the world starting
to dig in and figure out what life is like when you can run a GPT-3 class model on your
own hardware. (https://simonwillison.net/2023/Mar/11/llama/)
The second release, dubbed Llama-2, was released in July 2023. It was trained with 2
trillion tokens and had 7, 13 and 70 billion parameter variants. This release however
allowed commercial use, and the weights were easily available for everyone for
download.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login--------------------------… 2/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
Llama.cpp
Llama.cpp is an open source project by Georgi Gerganov, who impressively re-wrote
the Llama inference code in C++. The project also quantized the Llama model using
the GGML tensor library. While the original quantization used 4-bit integers, it has
since been updated to support 2 to 6-bit integer quantization.
Llama.cpp is a big deal because prior to that, running LLMs on laptops was almost
impossible. With llama.cpp and its GGML quantization, we can now run smaller
LLMs on laptops and even on iOS and Android devices.
Then use make to build llama.cpp by typing this in the command line.
$ LLAMA_METAL=1 make
Metal is a low-level library from Apple that allows software to built with it to access
the GPU. Building llama.cpp with Metal allows it to use the GPU in Apple Silicon.
Once you have built llama.cpp you will get a number of executables, but the main
one that we’ll be using is called main (what else). To use llama.cpp you need to have
a quantized GGML model. There are a number of models provided by TheBloke on
HuggingFace, if you want to try something quickly you can try this one. Download
the file llama-2-7b-chat.ggmlv3.q4_K_S.bin and put it into the models directory.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login--------------------------… 3/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
In this example, I use the -m to set the model to run, which is one of the llama-2–7b
4-bit models. I set the -ngl flag, which indicates the number of GPU layers to use to
38, the number of cores in the M2 Max on my Macbook Pro. I also set the
temperature to 0 using the --temp flag, and finally gave the prompt “What is the
capital of France?”
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login--------------------------… 5/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
As you can see the response returned in about 556 ms. The eval time tells us how
long it took to generate the response, which is around 19.3 ms per token or about
51.8 tokens per second. That’s pretty impressive!
Now that you can run llama.cpp let’s see why it runs.
Quantization
Quantization is the key to how llama.cpp works. Quantization is a process that
reduces the number of bits to represent a number. LLMs are made out of neural
networks that are laid out in the transformer architecture. The neural networks have
nodes that have weights assigned to them.
These weights are basically just floating point numbers, so the idea behind
quantizing LLMs is to reduce the number of bits to represent these weights.
Reducing the number of bits means that the weights have less precision. However it
also means that they need less storage, and also less memory.
Quantization can be done during or after the training. Quantization after the
training is done is called post-training quantization (of course) or PTQ and this is
what we’re looking at in this article.
There are 2 popular PTQ methods currently — GPTQ and GGML. GPTQ is a method
proposed in the paper GPTQ: Accurate Post-Training Quantization for Generative Pre-
trained Transformers, and has been implemented in a few open source libraries.
GGML on the other hand, is a tensor library by Georgi Gerganov (who also wrote
llama.cpp) that focuses on CPU optimization, particularly for Apple M1 and M2
chips. My development machine is a Macbook Pro so I’m kind of biased towards
GGML and that’s what I’m going to discuss here.
The community, in particular, TheBloke has been using both quantization methods
on a lot of LLMs. As of writing this article, TheBloke has quantized 773 different
models on HuggingFace!
How does quantization affect the Llama-2 models? I’ll go through briefly 3 areas:
1. File size
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login--------------------------… 6/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
2. Memory
3. Response time
File size
As expected, quantization reduces the amount of size of the model file. FP16 (16-bit
floating point) is what Meta provides when you download the model files from their
site.
The table below shows the memory requirements for the FP16 model, and some of
the GGML models quantized by TheBloke and hosted on HuggingFace. The 8-bit
quantization is the initial quantization efforts in GGML, and the ones following that
are the ones that are with fewer bits.
As you can see, the sizes drop with the reduced number of bits.
Memory
How much memory do we need to use the quantized models? The table below
shows the memory required for the same set of models from TheBloke.
So how do we actually derive the memory requirements? It’s quite easy. The weights
are loaded up in memory. Each parameter is stored in a number. For FP16, this is a
16-bit floating point number, which takes up 2 bytes of memory. For a 7 billion
parameter, this means 7x2 billion bytes of memory which is 14 GB of memory.
In that case, you would expect if it’s stored in an 8-bit integer that means it will only
need 7 GB of memory right? Well not exactly, because the number of bits are
rounded up here. GGML uses different algorithms for their quantization and not
every parameter is 8-bits, so though the amount of memory is only about 7 GB.
Other than the FP16 column, the rest of the numbers are provided by TheBloke.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login--------------------------… 7/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
If you look at the numbers it seems pretty promising because I can run both 7B and
13B models on my laptop! How about the 2-bit quantized 70B model? Not really.
Remember, the computer needs memory for other things it runs as well, including
the OS! Of course, if you have more memory on your laptop (I only have 32GB) you
can try the 70B too.
Response time
Inference is about using the LLM to generate text, given our prompts. Naturally, we
consider LLMs that generate text faster to be better performing. To measure the
speed of text generation, I’ll be calculating the number of tokens generated per
second for the FP16 model as well as a 4-bit quantized model.
Since I’m running on my computer, batch size is 1. I have a M2 Max MacBook Pro.
The memory bandwidth of the M2 Max is 400 GB/s and the compute power of the
M2 Max for FP16 is 26.98 TFLOPS. With this information let’s calculate the latencies.
It’s clear from our calculations the speed of inference is bound by memory
bandwidth.
So what does 70 ms mean? This is the the amount of time needed to generate 1
token. In other words, the Llama-2–7B FP16 model (which the model originally
provided by Meta), generates 14.3 tokens per second.
How about the 4-bit quantized model? Following the same formula, this is (2 x 7 x
0.5 x 1)/400 = 17.5 ms per token or 57.1 tokens per second! This is means reducing
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login--------------------------… 8/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
If you compare with the results I showed earlier (19.3 ms) this validates the
calculations.
Following the same formulas, the calculated inference response time is 30.8 tokens
per second for the 13B model and 5.7 tokens per second for the 70B model. Realistically
I can’t run the 70B model anyway so that’s moot.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 10/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
With the 13B model, we get around 26.8 tokens per second (or 37.4 ms per token),
which is not too far from our calculations of 30.8 tokens per second. This is certainly
not bad considering OpenAI’s GPT-3.5-turbo’s response time is 73 ms per token and
Azure OpenAI’s GPT-3.5-turbo’s response time is 34 ms!
Perplexity
We have established that Llama-2–7B and llama-2–13B quantized models run really
well on a M2 Max Macbook Pro, with response times that are much better than
calling OpenAI APIs. However it doesn’t matter how good its response time is, if it
doesn’t have good accuracy.
We also know Llama-2 models are one of the best open source models right now. It
even compares well with GPT-3.5. Here’s a table that shows the comparison metrics
for the HellaSwag and MMLU benchmarks between the various OpenAI models and
the Llama-2 models.
It would be great if we can get a LLM that is as good as GPT-3 running on our
desktop but question is, how does the FP16 models compare with the quantized
models? In that case, we turn to using perplexity as a measure between the different
quantized models.
Perplexity is one of the most common metrics for evaluating language models. It’s a
measurement of how well a model can predict the next word. The lower the
perplexity score, the better the model’s ability to predict the next word accurately.
Let’s compare the perplexities of the FP16 and the various quantized models of the
Llama-2–7B. The llama.cpp team has produced perplexity benchmarks for the
various quantized models.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 11/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
From this chart you can see. that the Q6_K quantized model (6-bits) is very close to
the FP16 model. This goes for the 5-bit and even the 4-bit Q4_K_M model.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 12/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
With this, we know the quantized Llama-2 models run just as well as the FP16
models, which in turn are very close to the GPT-3 and GPT-3.5 models. It then make
sense that we use the 5-bit Q5_K_M 13B model to run on our desktop as it gives us
the best bang for the buck. Of course, this varies with your hardware .
Finally, we’ll get into the Go code and build ourselves a local chatbot!
package main
import (
"context"
"encoding/json"
"fmt"
"log"
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 13/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
"net/http"
"os"
"strings"
"text/template"
"github.com/go-chi/chi"
"github.com/go-chi/chi/middleware"
"github.com/joho/godotenv"
"github.com/tmc/langchaingo/llms/local"
)
var (
wd string
bin string
model string
gpuLayers string
threads string
contextSize string
)
func main() {
r := chi.NewRouter()
r.Use(middleware.Logger)
r.Handle("/static/*", http.StripPrefix("/static",
http.FileServer(http.Dir("./static"))))
r.Get("/", index)
r.Post("/run", run)
log.Println("\033[93mMonsoon started. Press CTRL+C to quit.\033[0m")
http.ListenAndServe(":"+os.Getenv("PORT"), r)
}
// index
func index(w http.ResponseWriter, r *http.Request) {
t, _ := template.ParseFiles("static/index.html")
t.Execute(w, nil)
}
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 14/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
The only difference is creating the LLM. Instead of creating an OpenAI LLM, I
create a local LLM by setting up the executable binary as well as the arguments to
use for the binary.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 15/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
LOCAL_LLM_BIN=local/llamacpp
LOCAL_LLM_MODEL=local/llama-2-13b-chat.ggmlv3.q5_K_M.bin
LOCAL_LLM_NUM_GPU_LAYERS=38
LOCAL_LLM_NUM_CPU_CORES=12
LOCAL_LLM_CONTEXT=4096
PORT=1102
After building llama.cpp with Metal, you will get a ggml-metal.metal file, which you
will also need to include in the same directory the binary executable is in. You can
find the complete code as well as all other files (except the models, of course) here.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 16/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 17/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
References
Here’s more reading if you want to get deeper into the topics I covered in this article.
Thank you for reading until the end. Please consider following the writer and this
publication. Visit Stackademic to find out more about how we are democratizing free
programming education around the world.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 19/20
1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
Follow
I write, code.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 20/20