Professional Documents
Culture Documents
Argilla
Argilla
https://imgflip.com/memegenerator/What-Do-We-Want
LLM Training Phases
Pre-training
https://huggingface.co/docs/transformers/en/tasks/masked_language_modeling
LLM Training Phases
<|system|> <|system|>
You are a friendly chatbot who always You are a friendly chatbot who always
Replicate output behavior responds in the style of a pirate</s> responds in the style of a pirate</s>
Structured text sources <|user|> <|user|>
Quality over quantity How many helicopters can a human eat How many helicopters can a human eat
Consumer compute in one sitting?</s> in one sitting?</s>
<|assistant|> <|assistant|>
Arg, scallywag, that is now possible!</s>
But what if they're really hungry?</s>
<|assistant|>
Humans can’t eat helicopters. Arrr!</s>
LLM Training Phases
<|system|>
Align with human preference You are a friendly chatbot who always
Chosen-rejected responds in the style of a pirate</s>
Quality over quantity <|user|>
🤔
Consumer compute How many helicopters can a human eat in
one sitting?</s>
<|assistant|> <|assistant|>
Arg, scallywag, that is not Humans cannot eat
possible!</s> helicopters</s>
https://arxiv.org/pdf/2305.18290.pdf
LLM Training Phases
https://arxiv.org/pdf/2305.18290.pdf
LLM Training Phases
https://arxiv.org/pdf/2305.18290.pdf
LLM Training Phases
<|system|>
Align with human preference You are a friendly chatbot who always
Chosen-rejected responds in the style of a pirate</s>
Quality over quantity <|user|>
Consumer compute How many helicopters can a human eat in
one sitting?</s>
<|assistant|> <|assistant|>
Arg, scallywag, that is not
possible!</s>
> Humans cannot eat
helicopters</s>
https://arxiv.org/pdf/2305.18290.pdf
LLM Training Phases
Alignment methods?
https://argilla.io/blog/mantisnlp-rlhf-part-1/
Cool, data quality? So... now what?
🦙 🐑 🤖 👨🏾🤝👨🏼
Stanford
Alpaca
Databricks
Dolly
OpenBMB
UltraFeedback
Argilla + Hugging Face
Data Is Better Together
🦙 Stanford Alpaca
The data
52K SFT
Synthetic
SelfInstruct
`text-davinci-003
Fine-grained categories
https://crfm.stanford.edu/2023/03/13/alpaca.html
🦙 Stanford Alpaca
The data The problem
52K SFT References (content)
Synthetic “Hallucinates” (Stock prices)
SelfInstruct And a lot more... (toxicity)
`text-davinci-003
Fine-grained categories
https://crfm.stanford.edu/2023/03/13/alpaca.html
🦙 Stanford Alpaca
The data The problem The Solution
52K SFT References (content) SetFit: FewShot TextCat
Synthetic “Hallucinates” (Stock prices) Clean
SelfInstruct And a lot more... (toxicity) Dirty
`text-davinci-003 Explore your data
Fine-grained categories predictions
https://huggingface.co/argilla/alpaca-garbage-collector-multilingual
🦙 Stanford Alpaca
https://huggingface.co/argilla/alpaca-garbage-collector-multilingual
🦙 Stanford Alpaca
https://arxiv.org/abs/2404.12365
🐑 Databricks Dolly
The data
15K SFT
Human
5K employees
InstructGPT categories
https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
🐑 Databricks Dolly
The data The problem
15K SFT Wrong annotation guidelines
Human len(summary) > len(input)
5K employees Context copy-paste as response
InstructGPT categories Ref “[#]” and URLs
https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
🐑 Databricks Dolly
The data The problem
15K SFT Wrong annotation guidelines
Human len(summary) > len(input)
5K employees Context copy-paste as response
InstructGPT categories Ref “[#]” and URLs
https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
🐑 Databricks Dolly
The data The problem The Solution
15K SFT Wrong annotation guidelines text-descriptives
Human len(summary) > len(input) DeepL translations
5K employees Ref “[#]” and URLs
InstructGPT categories
https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
🐑 Databrick Dolly
https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual
🐑 Databricks Dolly
https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
🤖 OpenBMB UltraFeedback
The data
Synthetic
64k prompts
256k completions
340k comparisons
Rating with GPT 4
https://github.com/OpenBMB/UltraFeedback
🤖 OpenBMB UltraFeedback
The data
Synthetic
64k prompts
256k completions
340k comparisons
Rating with GPT 4
The problem
Data from benchmarks
A coding error (1 => 10)
Incomplete ratings
Ties in data
https://github.com/OpenBMB/UltraFeedback
🤖 OpenBMB UltraFeedback
The data
Synthetic
64k prompts
256k completions
340k comparisons
Rating with GPT 4
The problem
Data from benchmarks
A coding error (1 => 10)
Incomplete ratings
Ties in data
The Solution
Average criteria
Zephyr + DPO
https://github.com/OpenBMB/UltraFeedback
🤖 OpenBMB UltraFeedback
https://argilla.io/blog/notus7b
👨🏾🤝👨🏼 Data is better together!
The data
Synthetic and human
52K prompts
https://huggingface.co/datasets/DIBT/10k_prompts_ranked
👨🏾🤝👨🏼 Data is better together!
The data The problem
Synthetic and human There is too little human eval
https://huggingface.co/datasets/DIBT/10k_prompts_ranked
👨🏾🤝👨🏼 Data is better together!
The data The problem The Solution
Synthetic and human There is too little human eval Community effort on HF
52K prompts Human vs Synthethic data 10K prompts
Mistral-large
10K responses
Zephyr + SPIN
https://huggingface.co/datasets/DIBT/10k_prompts_ranked
👨🏾🤝👨🏼 Data is better together!
https://huggingface.co/datasets/DIBT/10k_prompts_ranked
Argilla team: “Data quality and human
feedback are important!”
But, don’t take it just from us.
Data
1K SFT
Manual curation
High quality
Task diversity
Lacking in
Math
Coding
https://arxiv.org/abs/2305.11206
But, don’t take it just from us.
Deita: What Makes Good
Data for Alignment?
Data
300K->6K SFT
10K DPO
LLM data evolution
Complexity: prompt
Quality: response
Diversity data filtering
Embeddings
https://arxiv.org/abs/2312.15685
But, don’t take it just from us.
Data
3.1T Pre-training tokens
<10K SFT
Pre-training filters for
Basic heuristics
Quality classifiers
Diversity clusters
Deduplication
SFT
Complexity: prompt
Diversity sampling
https://arxiv.org/html/2403.04652v1
So, data quality and human feedback are important!
Some pointers to get you started.
Start simple
Embeddings
Classifiers
Text-descriptives
Topic modeling
End complex
LLMs as prompt engineers
LLMs as Judge
More cool things
@davidbstein1957 @argilla_io
/in/davidberenstein1957 /company/argilla-io