Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

🧼 From GPU-poor to data-rich

data quality practices for LLM fine-tuning

PyCon Italia 2024


Can an individual or small company train an LLM?
Can an individual or small company train an LLM?

Consumer GPUs OSS models OSS training code Synthethic data


Colab and Runpod 🙏 Small and large Alternatives to RLHF AI Feedback

https://www.anyscale.com/ https://huggingface.co/mod https://github.com/huggingf https://huggingface.co/data


blog/num-every-llm- els?pipeline_tag=text- ace/trl/pull/1435 sets/argilla/OpenHermesPre
developer-should-know generation&sort=trending ferences
So, what do we need?

https://imgflip.com/memegenerator/What-Do-We-Want
LLM Training Phases

Pre-training

Learn how language works


Raw text sources
Quantity over Quality
Industry compute

https://huggingface.co/docs/transformers/en/tasks/masked_language_modeling
LLM Training Phases

Pre-training Instruction Tuning (SFT)

<|system|> <|system|>
You are a friendly chatbot who always You are a friendly chatbot who always
Replicate output behavior responds in the style of a pirate</s> responds in the style of a pirate</s>
Structured text sources <|user|> <|user|>
Quality over quantity How many helicopters can a human eat How many helicopters can a human eat
Consumer compute in one sitting?</s> in one sitting?</s>
<|assistant|> <|assistant|>
Arg, scallywag, that is now possible!</s>
But what if they're really hungry?</s>
<|assistant|>
Humans can’t eat helicopters. Arrr!</s>
LLM Training Phases

Pre-training Instruction Tuning (SFT) Preference Tuning

<|system|>
Align with human preference You are a friendly chatbot who always
Chosen-rejected responds in the style of a pirate</s>
Quality over quantity <|user|>

🤔
Consumer compute How many helicopters can a human eat in
one sitting?</s>

<|assistant|> <|assistant|>
Arg, scallywag, that is not Humans cannot eat
possible!</s> helicopters</s>
https://arxiv.org/pdf/2305.18290.pdf
LLM Training Phases

Pre-training Instruction Tuning (SFT) Preference Tuning

Align with human preference


Chosen-rejected
Quality over quantity
Consumer compute

https://arxiv.org/pdf/2305.18290.pdf
LLM Training Phases

Pre-training Instruction Tuning (SFT) Preference Tuning

Align with human preference


Chosen-rejected
Quality over quantity
Consumer compute

https://arxiv.org/pdf/2305.18290.pdf
LLM Training Phases

Pre-training Instruction Tuning (SFT) Preference Tuning

<|system|>
Align with human preference You are a friendly chatbot who always
Chosen-rejected responds in the style of a pirate</s>
Quality over quantity <|user|>
Consumer compute How many helicopters can a human eat in
one sitting?</s>

<|assistant|> <|assistant|>
Arg, scallywag, that is not
possible!</s>
> Humans cannot eat
helicopters</s>
https://arxiv.org/pdf/2305.18290.pdf
LLM Training Phases

Pre-training Instruction Tuning (SFT) Preference Tuning

Raw text sources Structured text sources Chosen-rejected


Quantity over Quality Quality over quantity Quality over quantity
Learn how language works Replicate output behavior Align with human preference
Industry compute Consumer compute Consumer compute

Alignment methods?

https://argilla.io/blog/mantisnlp-rlhf-part-1/
Cool, data quality? So... now what?

🦙 🐑 🤖 👨🏾‍🤝‍👨🏼
Stanford
Alpaca
Databricks
Dolly
OpenBMB
UltraFeedback
Argilla + Hugging Face
Data Is Better Together
🦙 Stanford Alpaca
The data
52K SFT
Synthetic
SelfInstruct
`text-davinci-003
Fine-grained categories

https://crfm.stanford.edu/2023/03/13/alpaca.html
🦙 Stanford Alpaca
The data The problem
52K SFT References (content)
Synthetic “Hallucinates” (Stock prices)
SelfInstruct And a lot more... (toxicity)
`text-davinci-003
Fine-grained categories

https://crfm.stanford.edu/2023/03/13/alpaca.html
🦙 Stanford Alpaca
The data The problem The Solution
52K SFT References (content) SetFit: FewShot TextCat
Synthetic “Hallucinates” (Stock prices) Clean
SelfInstruct And a lot more... (toxicity) Dirty
`text-davinci-003 Explore your data
Fine-grained categories predictions

https://huggingface.co/argilla/alpaca-garbage-collector-multilingual
🦙 Stanford Alpaca

https://huggingface.co/argilla/alpaca-garbage-collector-multilingual
🦙 Stanford Alpaca

https://arxiv.org/abs/2404.12365
🐑 Databricks Dolly
The data
15K SFT
Human
5K employees
InstructGPT categories

https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
🐑 Databricks Dolly
The data The problem
15K SFT Wrong annotation guidelines
Human len(summary) > len(input)
5K employees Context copy-paste as response
InstructGPT categories Ref “[#]” and URLs

https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
🐑 Databricks Dolly
The data The problem
15K SFT Wrong annotation guidelines
Human len(summary) > len(input)
5K employees Context copy-paste as response
InstructGPT categories Ref “[#]” and URLs

https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
🐑 Databricks Dolly
The data The problem The Solution
15K SFT Wrong annotation guidelines text-descriptives
Human len(summary) > len(input) DeepL translations
5K employees Ref “[#]” and URLs
InstructGPT categories

https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
🐑 Databrick Dolly

https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual
🐑 Databricks Dolly

https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
🤖 OpenBMB UltraFeedback
The data
Synthetic
64k prompts
256k completions
340k comparisons
Rating with GPT 4

https://github.com/OpenBMB/UltraFeedback
🤖 OpenBMB UltraFeedback
The data
Synthetic
64k prompts
256k completions
340k comparisons
Rating with GPT 4
The problem
Data from benchmarks
A coding error (1 => 10)
Incomplete ratings
Ties in data

https://github.com/OpenBMB/UltraFeedback
🤖 OpenBMB UltraFeedback
The data
Synthetic
64k prompts
256k completions
340k comparisons
Rating with GPT 4
The problem
Data from benchmarks
A coding error (1 => 10)
Incomplete ratings
Ties in data
The Solution
Average criteria
Zephyr + DPO

https://github.com/OpenBMB/UltraFeedback
🤖 OpenBMB UltraFeedback

https://argilla.io/blog/notus7b
👨🏾‍🤝‍👨🏼 Data is better together!
The data
Synthetic and human
52K prompts

https://huggingface.co/datasets/DIBT/10k_prompts_ranked
👨🏾‍🤝‍👨🏼 Data is better together!
The data The problem
Synthetic and human There is too little human eval

🤖✏️ >> 🙋🏻‍♂️✏️


52K prompts Human vs Synthethic data

https://huggingface.co/datasets/DIBT/10k_prompts_ranked
👨🏾‍🤝‍👨🏼 Data is better together!
The data The problem The Solution
Synthetic and human There is too little human eval Community effort on HF
52K prompts Human vs Synthethic data 10K prompts
Mistral-large
10K responses
Zephyr + SPIN

https://huggingface.co/datasets/DIBT/10k_prompts_ranked
👨🏾‍🤝‍👨🏼 Data is better together!

https://huggingface.co/datasets/DIBT/10k_prompts_ranked
Argilla team: “Data quality and human
feedback are important!”
But, don’t take it just from us.

LIMA: Less Is More for


Alignment

Data
1K SFT
Manual curation
High quality
Task diversity
Lacking in
Math
Coding

https://arxiv.org/abs/2305.11206
But, don’t take it just from us.
Deita: What Makes Good
Data for Alignment?

Data
300K->6K SFT
10K DPO
LLM data evolution
Complexity: prompt
Quality: response
Diversity data filtering
Embeddings

https://arxiv.org/abs/2312.15685
But, don’t take it just from us.

Yi: Open Foundation Models by


01.AI

Data
3.1T Pre-training tokens
<10K SFT
Pre-training filters for
Basic heuristics
Quality classifiers
Diversity clusters
Deduplication
SFT
Complexity: prompt
Diversity sampling

https://arxiv.org/html/2403.04652v1
So, data quality and human feedback are important!
Some pointers to get you started.

Quality > Quantity, but how much?


SFT -> 6K-10K
DPO -> 3K-10K

The bare necessities


Get your hands dirty
Good annotation guides

Start simple
Embeddings
Classifiers
Text-descriptives
Topic modeling

End complex
LLMs as prompt engineers
LLMs as Judge
More cool things

Distilabel DIBT DIBT


Framework for synthetic KTO-preference (+1/-1) Multilingual Prompt Evaluation
data and AI feedback Project (MPEP)
Questions, feedback and contacts

Me + slides Argilla Feedback form

@davidbstein1957 @argilla_io

/in/davidberenstein1957 /company/argilla-io

You might also like