Argilla

🧼 From GPU-poor to data-rich
data quality practices for LLM fine-tuning
PyCon Italia 2024

Can an individual or small company train an LLM?
Can an individual or small company train an LLM?
Consumer GPUs OSS models OSS training code Synthethic data

Colab and Runpod 🙏 Small and large Alternatives to RLHF AI Feedback
https://www.anyscale.com/ https://huggingface.co/mod https://github.com/huggingf https://huggingface.co/data

blog/num-every-llm- els?pipeline_tag=text- ace/trl/pull/1435 sets/argilla/OpenHermesPre
developer-should-know generation&sort=trending ferences
So, what do we need?
https://imgflip.com/memegenerator/What-Do-We-Want
LLM Training Phases
Pre-training
Learn how language works

Raw text sources
Quantity over Quality
Industry compute
https://huggingface.co/docs/transformers/en/tasks/masked_language_modeling
LLM Training Phases
Pre-training Instruction Tuning (SFT)
<|system|> <|system|>
You are a friendly chatbot who always You are a friendly chatbot who always
Replicate output behavior responds in the style of a pirate</s> responds in the style of a pirate</s>
Structured text sources <|user|> <|user|>
Quality over quantity How many helicopters can a human eat How many helicopters can a human eat
Consumer compute in one sitting?</s> in one sitting?</s>
<|assistant|> <|assistant|>
Arg, scallywag, that is now possible!</s>
But what if they're really hungry?</s>
<|assistant|>
Humans can’t eat helicopters. Arrr!</s>
LLM Training Phases
Pre-training Instruction Tuning (SFT) Preference Tuning
<|system|>
Align with human preference You are a friendly chatbot who always
Chosen-rejected responds in the style of a pirate</s>
Quality over quantity <|user|>
🤔
Consumer compute How many helicopters can a human eat in
one sitting?</s>
Arg, scallywag, that is not Humans cannot eat
possible!</s> helicopters</s>
https://arxiv.org/pdf/2305.18290.pdf
LLM Training Phases
Align with human preference

Chosen-rejected
Quality over quantity
Consumer compute
LLM Training Phases
Align with human preference

Chosen-rejected
Quality over quantity
Consumer compute
LLM Training Phases
<|system|>
Align with human preference You are a friendly chatbot who always
Chosen-rejected responds in the style of a pirate</s>
Quality over quantity <|user|>
Consumer compute How many helicopters can a human eat in
one sitting?</s>
Arg, scallywag, that is not
possible!</s>
> Humans cannot eat
helicopters</s>
LLM Training Phases
Raw text sources Structured text sources Chosen-rejected

Quantity over Quality Quality over quantity Quality over quantity
Learn how language works Replicate output behavior Align with human preference
Industry compute Consumer compute Consumer compute
Alignment methods?
https://argilla.io/blog/mantisnlp-rlhf-part-1/
Cool, data quality? So... now what?
🦙 🐑 🤖 👨🏾‍🤝‍👨🏼
Stanford
Alpaca
Databricks
Dolly
OpenBMB
UltraFeedback
Argilla + Hugging Face
Data Is Better Together
🦙 Stanford Alpaca
The data
52K SFT
Synthetic
SelfInstruct
`text-davinci-003
Fine-grained categories
https://crfm.stanford.edu/2023/03/13/alpaca.html
The data The problem
52K SFT References (content)
Synthetic “Hallucinates” (Stock prices)
SelfInstruct And a lot more... (toxicity)
`text-davinci-003
Fine-grained categories
https://crfm.stanford.edu/2023/03/13/alpaca.html
The data The problem The Solution
52K SFT References (content) SetFit: FewShot TextCat
Synthetic “Hallucinates” (Stock prices) Clean
SelfInstruct And a lot more... (toxicity) Dirty
`text-davinci-003 Explore your data
Fine-grained categories predictions
https://huggingface.co/argilla/alpaca-garbage-collector-multilingual
https://huggingface.co/argilla/alpaca-garbage-collector-multilingual
https://arxiv.org/abs/2404.12365
🐑 Databricks Dolly
The data
15K SFT
Human
5K employees
InstructGPT categories
https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
15K SFT Wrong annotation guidelines
Human len(summary) > len(input)
5K employees Context copy-paste as response
InstructGPT categories Ref “[#]” and URLs
15K SFT Wrong annotation guidelines
Human len(summary) > len(input)
5K employees Context copy-paste as response
InstructGPT categories Ref “[#]” and URLs
15K SFT Wrong annotation guidelines text-descriptives
Human len(summary) > len(input) DeepL translations
5K employees Ref “[#]” and URLs
InstructGPT categories
🐑 Databrick Dolly
https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual
🤖 OpenBMB UltraFeedback
The data
Synthetic
64k prompts
256k completions
340k comparisons
Rating with GPT 4
https://github.com/OpenBMB/UltraFeedback
The data
Synthetic
64k prompts
256k completions
340k comparisons
Rating with GPT 4
The problem
Data from benchmarks
A coding error (1 => 10)
Incomplete ratings
Ties in data
The data
Synthetic
64k prompts
256k completions
340k comparisons
Rating with GPT 4
The problem
Data from benchmarks
A coding error (1 => 10)
Incomplete ratings
Ties in data
The Solution
Average criteria
Zephyr + DPO
https://argilla.io/blog/notus7b
👨🏾‍🤝‍👨🏼 Data is better together!
The data
Synthetic and human
52K prompts
https://huggingface.co/datasets/DIBT/10k_prompts_ranked
Synthetic and human There is too little human eval
🤖✏️ >> 🙋🏻‍♂️✏️

52K prompts Human vs Synthethic data
Synthetic and human There is too little human eval Community effort on HF
52K prompts Human vs Synthethic data 10K prompts
Mistral-large
10K responses
Zephyr + SPIN
Argilla team: “Data quality and human
feedback are important!”
But, don’t take it just from us.
LIMA: Less Is More for

Alignment
Data
1K SFT
Manual curation
High quality
Task diversity
Lacking in
Math
Coding
Deita: What Makes Good
Data for Alignment?
Data
300K->6K SFT
10K DPO
LLM data evolution
Complexity: prompt
Quality: response
Diversity data filtering
Embeddings
Yi: Open Foundation Models by

01.AI
Data
3.1T Pre-training tokens
<10K SFT
Pre-training filters for
Basic heuristics
Quality classifiers
Diversity clusters
Deduplication
SFT
Complexity: prompt
Diversity sampling
https://arxiv.org/html/2403.04652v1
So, data quality and human feedback are important!
Some pointers to get you started.
Quality > Quantity, but how much?

SFT -> 6K-10K
DPO -> 3K-10K
The bare necessities

Get your hands dirty
Good annotation guides
Start simple
Embeddings
Classifiers
Text-descriptives
Topic modeling
End complex
LLMs as prompt engineers
LLMs as Judge
More cool things
Distilabel DIBT DIBT

Framework for synthetic KTO-preference (+1/-1) Multilingual Prompt Evaluation
data and AI feedback Project (MPEP)
Questions, feedback and contacts
Me + slides Argilla Feedback form
@davidbstein1957 @argilla_io
/in/davidberenstein1957 /company/argilla-io

Argilla

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Argilla

Uploaded by

Copyright:

Available Formats

🧼 From GPU-poor to data-rich

data quality practices for LLM fine-tuning

PyCon Italia 2024

Consumer GPUs OSS models OSS training code Synthethic data

https://www.anyscale.com/ https://huggingface.co/mod https://github.com/huggingf https://huggingface.co/data

Learn how language works

Pre-training Instruction Tuning (SFT)

Pre-training Instruction Tuning (SFT) Preference Tuning

Pre-training Instruction Tuning (SFT) Preference Tuning

Align with human preference

Pre-training Instruction Tuning (SFT) Preference Tuning

Align with human preference

Pre-training Instruction Tuning (SFT) Preference Tuning

Pre-training Instruction Tuning (SFT) Preference Tuning

Raw text sources Structured text sources Chosen-rejected

🤖✏️ >> 🙋🏻‍♂️✏️

LIMA: Less Is More for

Yi: Open Foundation Models by

Quality > Quantity, but how much?

The bare necessities

Distilabel DIBT DIBT

Me + slides Argilla Feedback form

You might also like