Professional Documents
Culture Documents
A Look at Open-Source Alternatives To ChatGPT - TechTalks
A Look at Open-Source Alternatives To ChatGPT - TechTalks
6 min read
Since its release in November, ChatGPT has captured the imagination of the
world. People are using it for all kinds of tasks and applications. It has the
potential to change popular applications and create new ones.
But ChatGPT has also triggered an AI arms race between tech giants such as
Microsoft and Google. This has pushed the industry toward more competition and
less openness on large language models (LLM). The source code, model
architecture, weights, and training data of these instruction-following LLMs are
not available to the public. Most of them are available either through commercial
APIs or black-box web applications.
Closed LLMs such as ChatGPT, Bard, and Claude have many advantages, including
ease of access to sophisticated technology. But they also pose limits to research
labs and scientists who want to study and better understand LLMs. They are also
inconvenient for companies and organizations that want to create and run their
own models.
Fortunately, in tandem with the race to create commercial LLMs, there is also a
community effort to create open-source models that match the performance of
state-of-the-art LLMs. These models can help improve research by sharing
results. They can also help prevent a few wealthy organizations from having too
much sway and power over the LLM market.
LLaMa
One of the most important open-source language models comes from FAIR,
Meta’s AI research lab. In February, FAIR released LLaMA, a family of LLMs that
come in four different sizes: 7, 13, 33, and 65 billion parameters. (ChatGPT is
based on the 175-billion-parameter InstructGPT model.)
LLaMa is not an instruction-following LLM like ChatGPT. But the idea behind the
smaller size of LLaMA is that smaller models pre-trained on more tokens are
easier to retrain and fine-tune for specific tasks and use cases. This has made it
possible for other researchers to fine-tune the model for ChatGPT-like
performance through techniques such as reinforcement learning from human
feedback (RLHF).
Meta released the model under “a noncommercial license focused on research use
cases.” It will only make it accessible to academic researchers, government-
affiliated organizations, civil society, and research labs on a case-by-case basis.
You can read the paper here, the model card here, and request access to the
trained models here.
(The model was leaked online shortly after its release, which effectively made it
available to everyone.)
Alpaca
The Stanford researchers released the entire self-instruct data set, the details of
the data generation process, along with the code for generating the data and fine-
tuning the model. (Since Alpaca is based on LLaMA, you must obtain the original
model from Meta.)
However, the researchers stress that Alpaca “is intended only for academic
research and any commercial use is prohibited.” It was created from LLaMa,
which makes it subject to the same licensing rules as its base model. And since
the researchers used InstructGPT to generate the fine-tuning data, they are
subject to OpenAI’s terms of use, which prohibit developing models that compete
with OpenAI.
Vicuna
Researchers at UC Berkeley, Carnegie Mellon University, Stanford, and UC San
Diego released Vicuna, another instruction-following LLM based on LLaMA. Vicuna
comes in two sizes, 7 billion and 13 billion parameters.
The researchers fine-tuned Vicuna using the training code from Alpaca and
70,000 examples from ShareGPT, a website where users can share their
conversations with ChatGPT. They made some enhancements to the training
process to support longer conversation contexts. They also used the SkyPilot
machine learning workload manager to reduce the costs of training from $500 to
around $140.
Preliminary evaluations show that Vicuna outperforms LLaMA and Alpaca, and it is
also very close to Bard and ChatGPT-4. The researchers released the model
weights along with a full framework to install, train, and run LLMs. There is also a
very interesting online demo where you can test and compare Vicuna with other
open-source instruction LLMs.
Dolly
In March, Databricks released Dolly, a fine-tuned version of EleutherAI’s GPT-J 6B.
The researchers were inspired by the work done by the teams behind LLaMA and
Alpaca. Training Dolly cost less than $30 and took 30 minutes on a single
machine.
The use of the EleutherAI base model removed the limitations Meta imposed on
LLaMA-derived LLMs. However, Databricks trained Dolly on the same data that the
Standford Alpaca team had generated through ChatGPT. Therefore, the model still
couldn’t be used for commercial purposes due to the non-compete limits OpenAI
imposes on data generated by ChatGPT.
In April, the same team released Dolly 2.0, a 12-billion parameter model based
on EleutherAI’s pythia model. This time, Databricks fine-tuned the model on a
15,000-example dataset instruction-following examples generated fully by
humans. They gathered the examples in an interesting, gamified process
involving 5,000 of Databricks’ own staff.
Databricks released the trained Dolly 2 model, which has none of the limitations
of the previous models and you can use it for commercial purposes. They also
released the 15K instruction-following corpus that they used to fine-tune the
pythia model. Machine learning engineers can use this corpus to fine-tune their
own LLMs.
The team will open-source all their models, datasets, development, data
gathering, everything. It is a full, transparent, community effort. All the people
involved in the project were volunteers, dedicated to open science. It is a
different vision of what is happening behind the walled gardens of big tech
companies.
The best way to learn about Open Assistant is to watch the entertaining videos of
its co-founder and team lead Yannic Kilcher, who has long been an outspoken
critic of the closed approach of organizations such as OpenAI.
OpenAssistant has different versions based on LLaMA and pythia. You can use the
pythia version for commercial purposes. Most of the models can run on a single
GPU.
More than 13,000 volunteers from across the globe helped collect the examples
used to fine-tune the base models. The team will soon release all the data along
with a paper that explains the entire project. The trained models are available on
Hugging Face. The project’s GitHub page contains the full code for training the
model and the frontend to use the model.
The project also has a website where you can chat with Open Assistant and test
the model. And it has a task dashboard where you can contribute to the project
by creating prompts or labeling outputs.
LLaMA’s open-source models helped spur the movement. The Alpaca project
showed that creating instruction-tuned LLMs did not require huge efforts and
costs. This in turn inspired the Vicuna project, which further reduced the costs of
training and gathering data. Dolly took the efforts in a different direction, showing
the benefits of community-led data-gathering efforts to work around the non-
compete requirements of commercial models.
There are several other models that are worth mentioning, including UC
Berkeley’s Koala and llama.cpp, a C++ implementation of the LLaMA models that
can run on ARM processors. It will be interesting to see how the open-source
movement develops in the coming months and how it will affect the LLM market.
Ben Dickson
Ben is a software engineer and the founder of TechTalks. He writes about technology, business and politics.