Creating A ChatGPT Clone That Runs On Your Laptop With Go by Sau Sheong

1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic
This member-only story is on us. Upgrade to access all of Medium.
Member-only story
Creating a ChatGPT Clone that Runs on Your

Laptop with Go
Running Llama-2 on your laptop using llama.cpp and Go
Sau Sheong · Follow

Published in Stackademic
17 min read · Aug 20, 2023
Listen Share More
Writing a ChatGPT clone is pretty simple, it’s almost like the “Hello World” of LLM
applications. At the heart of it is simply calling any one of the LLM APIs offered by
OpenAI, Google, Anthropic and many others. But what if you don’t want to be tied to
any of these guys? After all, they are not exactly free.
Well, you can always use one of the many open source large language models that
are available. While they aren’t really open source licensed (even if they say they
are), you can use them without cost, some even for commercial purposes. Most, if
not all open source models will find its place in Hugging Face, and taking a quick
look, the number of text generation models come up close to 19k! That’s a lot of
models, even though many of them are variants of the same model.
In particular, Meta’s Llama (Large Language Model Meta AI) is probably one of the
most popular ones.
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login--------------------------… 1/20
Llama is a family of large language models (LLMs) that uses the transformer
architecture and trained on large amounts of data from various public sources. The
first release of Llama came out in February 2023, was trained with 1.4 trillion tokens
and had variants with 7, 13, 33 and 65 billion parameters.
When Llama was first released, it wasn’t licensed for commercial use and the
weights were only released through an application process for academic use.
However within a week, the weights were leaked. The leaks were subsequently
stemmed by Meta, but the cat was already out of the bag. As Simon Willison wrote
on his blog.
That furious typing sound you can hear is thousands of hackers around the world starting
to dig in and figure out what life is like when you can run a GPT-3 class model on your
own hardware. (https://simonwillison.net/2023/Mar/11/llama/)
The second release, dubbed Llama-2, was released in July 2023. It was trained with 2
trillion tokens and had 7, 13 and 70 billion parameter variants. This release however
allowed commercial use, and the weights were easily available for everyone for
download.
These will be the models we will be exploring in this article.
Llama.cpp
Llama.cpp is an open source project by Georgi Gerganov, who impressively re-wrote
the Llama inference code in C++. The project also quantized the Llama model using
the GGML tensor library. While the original quantization used 4-bit integers, it has
since been updated to support 2 to 6-bit integer quantization.
Llama.cpp is a big deal because prior to that, running LLMs on laptops was almost
impossible. With llama.cpp and its GGML quantization, we can now run smaller
LLMs on laptops and even on iOS and Android devices.
Installing and setting up

Open in app
Installing and setting up llama.cpp is quite simple. I’m just showing how it can be
done for MacOS for machines with Apple Silicon (M1 or M2). For other
Search
environments and systems, you can read more about it on the Github repository.
First clone the repository and get into the directory:
$ git clone https://github.com/ggerganov/llama.cpp.git

$ cd llama.cpp
Then use make to build llama.cpp by typing this in the command line.
$ LLAMA_METAL=1 make
Metal is a low-level library from Apple that allows software to built with it to access
the GPU. Building llama.cpp with Metal allows it to use the GPU in Apple Silicon.
Once you have built llama.cpp you will get a number of executables, but the main
one that we’ll be using is called main (what else). To use llama.cpp you need to have
a quantized GGML model. There are a number of models provided by TheBloke on
HuggingFace, if you want to try something quickly you can try this one. Download
the file llama-2-7b-chat.ggmlv3.q4_K_S.bin and put it into the models directory.
Run llama.cpp from the command line with this command:
$ ./main -m models/llama-2-7b-chat.ggmlv3.q4_K_S.bin -ngl 38 --temp 0 \

-p "What is the capital of France?"
In this example, I use the -m to set the model to run, which is one of the llama-2–7b
4-bit models. I set the -ngl flag, which indicates the number of GPU layers to use to
38, the number of cores in the M2 Max on my Macbook Pro. I also set the
temperature to 0 using the --temp flag, and finally gave the prompt “What is the
capital of France?”
This is the response.
main: build = 955 (86c3219)

main: seed = 1692453792
llama.cpp: loading model from models/llama-2-7b-chat.ggmlv3.q4_K_S.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 14 (mostly Q4_K - Small)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 3949.96 MB (+ 256.00 MB per state)
llama_new_context_with_model: kv self size = 256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/sausheong/projects/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x153f08170
ggml_metal_init: loaded kernel_add_row 0x153f08790
ggml_metal_init: loaded kernel_mul 0x153f08cd0
ggml_metal_init: loaded kernel_mul_row 0x153f09320
ggml_metal_init: loaded kernel_scale 0x153f09860
ggml_metal_init: loaded kernel_silu 0x153f09da0
ggml_metal_init: loaded kernel_relu 0x153f0a2e0
ggml_metal_init: loaded kernel_gelu 0x153f0a820
ggml_metal_init: loaded kernel_soft_max 0x153f0aef0
ggml_metal_init: loaded kernel_diag_mask_inf 0x153f0b570

ggml_metal_init: loaded kernel_get_rows_f16 0x153f0bc10
ggml_metal_init: loaded kernel_get_rows_q4_0 0x153f0c420
ggml_metal_init: loaded kernel_get_rows_q4_1 0x153f0cac0
ggml_metal_init: loaded kernel_get_rows_q2_K 0x153f0d160
ggml_metal_init: loaded kernel_get_rows_q3_K 0x153f0d800
ggml_metal_init: loaded kernel_get_rows_q4_K 0x153f0dea0
ggml_metal_init: loaded kernel_get_rows_q5_K 0x153f0e540
ggml_metal_init: loaded kernel_get_rows_q6_K 0x153f0ebe0
ggml_metal_init: loaded kernel_rms_norm 0x153f0f2c0
ggml_metal_init: loaded kernel_norm 0x153f0fb00
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x153f103d0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x153f10ab0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x153f11190
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x153f119f0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x153f120d0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x153f127b0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x153f12e70
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x153f13730
ggml_metal_init: loaded kernel_rope 0x153f13c70
ggml_metal_init: loaded kernel_alibi_f32 0x153f147b0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x153f15060
ggml_metal_init: loaded kernel_cpy_f32_f32 0x154804a00
ggml_metal_init: loaded kernel_cpy_f16_f16 0x1548052b0
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 102.54 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 3648.31 MB,
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 10.17 MB,
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 258.00 MB,
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 132.00 MB,
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI

sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.0
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
What is the capital of France?
Answer: The capital of France is Paris. [end of text]
llama_print_timings: load time = 1128.08 ms

llama_print_timings: sample time = 8.00 ms / 12 runs ( 0.67 ms
llama_print_timings: prompt eval time = 335.08 ms / 8 tokens ( 41.89 ms
llama_print_timings: eval time = 212.47 ms / 11 runs ( 19.32 ms
llama_print_timings: total time = 556.06 ms
ggml_metal_free: deallocating
As you can see the response returned in about 556 ms. The eval time tells us how
long it took to generate the response, which is around 19.3 ms per token or about
51.8 tokens per second. That’s pretty impressive!
This is especially considering that OpenAI’s GPT-3.5-turbo’s response time is 73 ms

per token, Azure OpenAI’s GPT-3.5-turbo’s response time is 34 ms per token and
OpenAI’s GPT-4 is a whopping 196 ms per token!
Now that you can run llama.cpp let’s see why it runs.
Quantization
Quantization is the key to how llama.cpp works. Quantization is a process that
reduces the number of bits to represent a number. LLMs are made out of neural
networks that are laid out in the transformer architecture. The neural networks have
nodes that have weights assigned to them.
These weights are basically just floating point numbers, so the idea behind
quantizing LLMs is to reduce the number of bits to represent these weights.
Reducing the number of bits means that the weights have less precision. However it
also means that they need less storage, and also less memory.
Quantization can be done during or after the training. Quantization after the
training is done is called post-training quantization (of course) or PTQ and this is
what we’re looking at in this article.
There are 2 popular PTQ methods currently — GPTQ and GGML. GPTQ is a method
proposed in the paper GPTQ: Accurate Post-Training Quantization for Generative Pre-
trained Transformers, and has been implemented in a few open source libraries.
GGML on the other hand, is a tensor library by Georgi Gerganov (who also wrote
llama.cpp) that focuses on CPU optimization, particularly for Apple M1 and M2
chips. My development machine is a Macbook Pro so I’m kind of biased towards
GGML and that’s what I’m going to discuss here.
The community, in particular, TheBloke has been using both quantization methods
on a lot of LLMs. As of writing this article, TheBloke has quantized 773 different
models on HuggingFace!
How does quantization affect the Llama-2 models? I’ll go through briefly 3 areas:
1. File size
2. Memory
3. Response time
File size
As expected, quantization reduces the amount of size of the model file. FP16 (16-bit
floating point) is what Meta provides when you download the model files from their
site.
The table below shows the memory requirements for the FP16 model, and some of
the GGML models quantized by TheBloke and hosted on HuggingFace. The 8-bit
quantization is the initial quantization efforts in GGML, and the ones following that
are the ones that are with fewer bits.
As you can see, the sizes drop with the reduced number of bits.
Memory
How much memory do we need to use the quantized models? The table below
shows the memory required for the same set of models from TheBloke.
So how do we actually derive the memory requirements? It’s quite easy. The weights
are loaded up in memory. Each parameter is stored in a number. For FP16, this is a
16-bit floating point number, which takes up 2 bytes of memory. For a 7 billion
parameter, this means 7x2 billion bytes of memory which is 14 GB of memory.
In that case, you would expect if it’s stored in an 8-bit integer that means it will only
need 7 GB of memory right? Well not exactly, because the number of bits are
rounded up here. GGML uses different algorithms for their quantization and not
every parameter is 8-bits, so though the amount of memory is only about 7 GB.
Other than the FP16 column, the rest of the numbers are provided by TheBloke.
If you look at the numbers it seems pretty promising because I can run both 7B and
13B models on my laptop! How about the 2-bit quantized 70B model? Not really.
Remember, the computer needs memory for other things it runs as well, including
the OS! Of course, if you have more memory on your laptop (I only have 32GB) you
can try the 70B too.
Response time
Inference is about using the LLM to generate text, given our prompts. Naturally, we
consider LLMs that generate text faster to be better performing. To measure the
speed of text generation, I’ll be calculating the number of tokens generated per
second for the FP16 model as well as a 4-bit quantized model.
Using the formula used by Finbarr Timbers, we get:
latency of model = max(latency compute, latency memory)
Since I’m running on my computer, batch size is 1. I have a M2 Max MacBook Pro.
The memory bandwidth of the M2 Max is 400 GB/s and the compute power of the
M2 Max for FP16 is 26.98 TFLOPS. With this information let’s calculate the latencies.
The latency of memory follows this formula:
latency = (2 x no of parameters x no of bytes x batch size)/memory bandwidth
So for a 7B model in FP16 (2 bytes), this is (2 x 7 x 2 x 1)/400 which is 0.07 or 70 ms.
The latency of compute follows this formula:
latency = (2 x no of parameters)/ no of FLOPS
The calculated latency of compute is (2 x 7)/ 26980 which 0.52 ms.
It’s clear from our calculations the speed of inference is bound by memory
bandwidth.
So what does 70 ms mean? This is the the amount of time needed to generate 1
token. In other words, the Llama-2–7B FP16 model (which the model originally
provided by Meta), generates 14.3 tokens per second.
How about the 4-bit quantized model? Following the same formula, this is (2 x 7 x
0.5 x 1)/400 = 17.5 ms per token or 57.1 tokens per second! This is means reducing
from 16-bits to 4-bits increases the response time by 4. Nice!
If you compare with the results I showed earlier (19.3 ms) this validates the
calculations.
Following the same formulas, the calculated inference response time is 30.8 tokens
per second for the 13B model and 5.7 tokens per second for the 70B model. Realistically
I can’t run the 70B model anyway so that’s moot.
Let’s run the 13B model.
% ./main -m models/llama-2-13b-chat.ggmlv3.q4_K_S.bin -ngl 38 --temp 0 \

-p "What is the capital of France?"
main: build = 955 (86c3219)
main: seed = 1692454637
llama.cpp: loading model from models/llama-2-13b-chat.ggmlv3.q4_K_S.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 11 (mostly Q3_K - Small)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 5762.22 MB (+ 400.00 MB per state)
llama_new_context_with_model: kv self size = 400.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/sausheong/projects/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x122705670
ggml_metal_init: loaded kernel_add_row 0x122705db0
ggml_metal_init: loaded kernel_mul 0x1227062f0
ggml_metal_init: loaded kernel_mul_row 0x122706940
ggml_metal_init: loaded kernel_scale 0x122706e80
ggml_metal_init: loaded kernel_silu 0x1227073c0
ggml_metal_init: loaded kernel_relu 0x122707900
ggml_metal_init: loaded kernel_gelu 0x10d506840
ggml_metal_init: loaded kernel_soft_max 0x10d507030
ggml_metal_init: loaded kernel_diag_mask_inf 0x10d5076b0
ggml_metal_init: loaded kernel_get_rows_f16 0x10d507d50

ggml_metal_init: loaded kernel_get_rows_q4_0 0x10d508560
ggml_metal_init: loaded kernel_get_rows_q4_1 0x10d508c00
ggml_metal_init: loaded kernel_get_rows_q2_K 0x10d5092a0
ggml_metal_init: loaded kernel_get_rows_q3_K 0x10d509940
ggml_metal_init: loaded kernel_get_rows_q4_K 0x10d509fe0
ggml_metal_init: loaded kernel_get_rows_q5_K 0x10d50a680
ggml_metal_init: loaded kernel_get_rows_q6_K 0x10d50ad20
ggml_metal_init: loaded kernel_rms_norm 0x10d50b400
ggml_metal_init: loaded kernel_norm 0x10d50bc40
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x10d50c510
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x10d50cbf0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x10d50d2d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x10d50db30
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x10d50e210
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x10d50e8f0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x10d50efb0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x10d50f870
ggml_metal_init: loaded kernel_rope 0x10d50fdb0
ggml_metal_init: loaded kernel_alibi_f32 0x10d5108f0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x10d5111a0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x122708090
ggml_metal_init: loaded kernel_cpy_f16_f16 0x122708a60
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 128.17 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 5396.56 MB,
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 12.17 MB,
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB,
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI

sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.0
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
What is the capital of France?

The capital of France is Paris. [end of text]
llama_print_timings: load time = 457.66 ms

llama_print_timings: sample time = 6.08 ms / 9 runs ( 0.68 ms
llama_print_timings: prompt eval time = 774.97 ms / 8 tokens ( 96.87 ms
llama_print_timings: eval time = 298.81 ms / 8 runs ( 37.35 ms
llama_print_timings: total time = 1080.49 ms
ggml_metal_free: deallocating
https://blog.stackademic.com/creating-a-chatgpt-clone-that-runs-on-your-laptop-with-go-bf9d41f1cf88?gi=b1f3352882a3&source=login-------------------------… 10/20
With the 13B model, we get around 26.8 tokens per second (or 37.4 ms per token),
which is not too far from our calculations of 30.8 tokens per second. This is certainly
not bad considering OpenAI’s GPT-3.5-turbo’s response time is 73 ms per token and
Azure OpenAI’s GPT-3.5-turbo’s response time is 34 ms!
Perplexity
We have established that Llama-2–7B and llama-2–13B quantized models run really
well on a M2 Max Macbook Pro, with response times that are much better than
calling OpenAI APIs. However it doesn’t matter how good its response time is, if it
doesn’t have good accuracy.
We also know Llama-2 models are one of the best open source models right now. It
even compares well with GPT-3.5. Here’s a table that shows the comparison metrics
for the HellaSwag and MMLU benchmarks between the various OpenAI models and
the Llama-2 models.
The HellaSwag benchmark is a test of common-sense inference and is easy for

humans, while the MMLU is a benchmark to measure a text model’s multitask
accuracy, covering fifty-seven tasks including elementary mathematics, computer
science, law, and more. As you can see, Llama-2–70B is very close to GPT-3.5 and
Lllama-13B is slightly better than GPT-3!
It would be great if we can get a LLM that is as good as GPT-3 running on our
desktop but question is, how does the FP16 models compare with the quantized
models? In that case, we turn to using perplexity as a measure between the different
quantized models.
Perplexity is one of the most common metrics for evaluating language models. It’s a
measurement of how well a model can predict the next word. The lower the
perplexity score, the better the model’s ability to predict the next word accurately.
Let’s compare the perplexities of the FP16 and the various quantized models of the
Llama-2–7B. The llama.cpp team has produced perplexity benchmarks for the
various quantized models.
Here’s one for the Llama-2–7B models.
From this chart you can see. that the Q6_K quantized model (6-bits) is very close to
the FP16 model. This goes for the 5-bit and even the 4-bit Q4_K_M model.
The same goes for 13B model.
With this, we know the quantized Llama-2 models run just as well as the FP16
models, which in turn are very close to the GPT-3 and GPT-3.5 models. It then make
sense that we use the 5-bit Q5_K_M 13B model to run on our desktop as it gives us
the best bang for the buck. Of course, this varies with your hardware .
Finally, we’ll get into the Go code and build ourselves a local chatbot!
Creating a local chatbot

Just as in the previous article where I showed how to create a ChatGPT clone with
Go, I used the Langchain Go library called langchaingo to create this chatbot. In
fact, most of the code are the same.
Here’s the code.
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"strings"
"text/template"
"github.com/go-chi/chi"
"github.com/go-chi/chi/middleware"
"github.com/joho/godotenv"
"github.com/tmc/langchaingo/llms/local"
)
var (
wd string
bin string
model string
gpuLayers string
threads string
contextSize string
)
// initialise to load environment variable from .env file

func init() {
err := godotenv.Load()
if err != nil {
log.Fatal("Error loading .env file")
}
wd, err = os.Getwd()
if err != nil {
log.Fatal("Error getting current directory")
}
bin = os.Getenv("LOCAL_LLM_BIN")
model = os.Getenv(("LOCAL_LLM_MODEL"))
gpuLayers = os.Getenv(("LOCAL_LLM_NUM_GPU_LAYERS"))
threads = os.Getenv(("LOCAL_LLM_NUM_CPU_CORES"))
contextSize = os.Getenv(("LOCAL_LLM_CONTEXT"))
}
func main() {
r := chi.NewRouter()
r.Use(middleware.Logger)
r.Handle("/static/*", http.StripPrefix("/static",
http.FileServer(http.Dir("./static"))))
r.Get("/", index)
r.Post("/run", run)
log.Println("\033[93mMonsoon started. Press CTRL+C to quit.\033[0m")
http.ListenAndServe(":"+os.Getenv("PORT"), r)
}
// index
func index(w http.ResponseWriter, r *http.Request) {
t, _ := template.ParseFiles("static/index.html")
t.Execute(w, nil)
}
// call the LLM and return the response

func run(w http.ResponseWriter, r *http.Request) {
prompt := struct {
Input string `json:"input"`
}{}
// decode JSON from client
err := json.NewDecoder(r.Body).Decode(&prompt)
if err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
// create the LLM
bin := fmt.Sprintf("%s/%s", wd, bin)
args := fmt.Sprintf("-m %s/%s -t %s --temp 0 -eps 1e-5 -c %s -ngl %s -p",
wd, model, threads, contextSize, gpuLayers)
llm, err := local.New(

local.WithBin(bin),
local.WithArgs(args),
)
if err != nil {
log.Println("Cannot create local LLM:", err.Error())
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
completion, err := llm.Call(context.Background(), prompt.Input)

if err != nil {
log.Println("Cannot get completion:", err.Error())
http.Error(w, err.Error(), http.StatusInternalServerError)
}
// remove the question if it appears in the response
completion = strings.ReplaceAll(completion, prompt.Input, "")
response := struct {
Input string `json:"input"`
Response string `json:"response"`
}{
Input: prompt.Input,
Response: completion,
}
json.NewEncoder(w).Encode(response)
}
The only difference is creating the LLM. Instead of creating an OpenAI LLM, I
create a local LLM by setting up the executable binary as well as the arguments to
use for the binary.
// create the LLM

bin := fmt.Sprintf("%s/%s", wd, bin)
args := fmt.Sprintf("-m %s/%s -t %s --temp 0 -eps 1e-5 -c %s -ngl %s -p",
wd, model, threads, contextSize, gpuLayers)
llm, err := local.New(

local.WithBin(bin),
local.WithArgs(args),
)
Of course, I will also need to set up the environment variable configurations. I

changed the binary executable name from main to llamacpp to make it obvious that
I’m using llama.cpp.
# You can download llama-2-7B models from https://huggingface.co/TheBloke
LOCAL_LLM_BIN=local/llamacpp
LOCAL_LLM_MODEL=local/llama-2-13b-chat.ggmlv3.q5_K_M.bin
LOCAL_LLM_NUM_GPU_LAYERS=38
LOCAL_LLM_NUM_CPU_CORES=12
LOCAL_LLM_CONTEXT=4096
PORT=1102
After building llama.cpp with Metal, you will get a ggml-metal.metal file, which you
will also need to include in the same directory the binary executable is in. You can
find the complete code as well as all other files (except the models, of course) here.
GitHub - sausheong/monsoon: Monsoon is a simple ChatGPT

clone built with Go. It uses…
Monsoon is a simple ChatGPT clone built with Go. It uses Llama-
compatible LLMs, through llama.cpp. - GitHub …
github.com
Here’s a few screenshots of Monsoon, the local Go chatbot.
When I asked Monsoon “What is the monsoon?” It gave a pretty comprehensive

answer in 29 seconds, which is quite amazing.
Enjoy playing around with it!
References
Here’s more reading if you want to get deeper into the topics I covered in this article.
Large language models are having their Stable Diffusion moment

The open release of the Stable Diffusion image generation model
back in August 2022 was a key moment. I wrote how…
simonwillison.net
LLaMA: Open and Efficient Foundation Language Models

We introduce LLaMA, a collection of foundation language models
ranging from 7B to 65B parameters. We train our models…
arxiv.org
How is LLaMa.cpp possible?

If you want to read more of my writing, I have a Substack. Articles will be posted
simultaneously to both places. Note…
finbarr.ca
k-quants by ikawrakow · Pull Request #1684 ·

ggerganov/llama.cpp
What This PR adds a series of 2-6 bit quantization methods, along
with quantization mixes, as proposed in #1240 and…
github.com
Benchmarking LLMs and what is the best LLM? - msandbu.org

In this dynamic landscape of LLMs with new versions popping up
everywhere, I have personally been really confused in…
msandbu.org
LLaMA-2 Perplexities · ggerganov/llama.cpp · Discussion #2352

Here are some results of LLaMA-2 perplexities for the 7B and 13B
models. Before someone tells me that I should have…
github.com
Thank you for reading until the end. Please consider following the writer and this
publication. Visit Stackademic to find out more about how we are democratizing free
programming education around the world.
Go AI Llama 2 ChatGPT Langchain
Follow
Written by Sau Sheong

2.4K Followers · Writer for Stackademic
I write, code.

Creating A ChatGPT Clone That Runs On Your Laptop With Go by Sau Sheong

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Creating A ChatGPT Clone That Runs On Your Laptop With Go by Sau Sheong

Uploaded by

Copyright:

Available Formats

1/26/24, 12:09 PM Creating a ChatGPT Clone that Runs on Your Laptop with Go | by Sau Sheong | Stackademic

This member-only story is on us. Upgrade to access all of Medium.

Creating a ChatGPT Clone that Runs on Your

Sau Sheong · Follow

Listen Share More

These will be the models we will be exploring in this article.

Installing and setting up

First clone the repository and get into the directory:

$ git clone https://github.com/ggerganov/llama.cpp.git

Run llama.cpp from the command line with this command:

$ ./main -m models/llama-2-7b-chat.ggmlv3.q4_K_S.bin -ngl 38 --temp 0 \

This is the response.

main: build = 955 (86c3219)

ggml_metal_init: loaded kernel_diag_mask_inf 0x153f0b570

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI

What is the capital of France?

Answer: The capital of France is Paris. [end of text]

llama_print_timings: load time = 1128.08 ms

This is especially considering that OpenAI’s GPT-3.5-turbo’s response time is 73 ms

Using the formula used by Finbarr Timbers, we get:

latency of model = max(latency compute, latency memory)

The latency of memory follows this formula:

latency = (2 x no of parameters x no of bytes x batch size)/memory bandwidth

So for a 7B model in FP16 (2 bytes), this is (2 x 7 x 2 x 1)/400 which is 0.07 or 70 ms.

The latency of compute follows this formula:

latency = (2 x no of parameters)/ no of FLOPS

The calculated latency of compute is (2 x 7)/ 26980 which 0.52 ms.

from 16-bits to 4-bits increases the response time by 4. Nice!

Let’s run the 13B model.

% ./main -m models/llama-2-13b-chat.ggmlv3.q4_K_S.bin -ngl 38 --temp 0 \

ggml_metal_init: loaded kernel_get_rows_f16 0x10d507d50

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI

What is the capital of France?

llama_print_timings: load time = 457.66 ms

The HellaSwag benchmark is a test of common-sense inference and is easy for

Here’s one for the Llama-2–7B models.

The same goes for 13B model.

Creating a local chatbot

Here’s the code.

// initialise to load environment variable from .env file

// call the LLM and return the response

llm, err := local.New(

completion, err := llm.Call(context.Background(), prompt.Input)

// create the LLM

llm, err := local.New(

Of course, I will also need to set up the environment variable configurations. I

# You can download llama-2-7B models from https://huggingface.co/TheBloke

GitHub - sausheong/monsoon: Monsoon is a simple ChatGPT

Here’s a few screenshots of Monsoon, the local Go chatbot.

When I asked Monsoon “What is the monsoon?” It gave a pretty comprehensive

Enjoy playing around with it!

Large language models are having their Stable Diffusion moment

LLaMA: Open and Efficient Foundation Language Models

How is LLaMa.cpp possible?

k-quants by ikawrakow · Pull Request #1684 ·

Benchmarking LLMs and what is the best LLM? - msandbu.org

LLaMA-2 Perplexities · ggerganov/llama.cpp · Discussion #2352

Go AI Llama 2 ChatGPT Langchain

Written by Sau Sheong

You might also like