Cookie Consent by Free Privacy Policy Generator 📌 FineWeb 45TB Dataset: $500k GPU costs and Adult Content Improving LLM Quality

✅ FineWeb 45TB Dataset: $500k GPU costs and Adult Content Improving LLM Quality

💡 Newskategorie: Programmierung
🔗 Quelle:

This week Hugging Face has released what seems to be the largest (15 trillion tokens) open dataset specifically created for LLM training: FineWeb.

It is based on internet crawls between the Summer of 2013 and the Winter 2024. The 15T size of the dataset resonates with the Llama 3 release that happened just a week before - it was trained with a 15T dataset as well (versus 2T used with Llama 2 series). The leap in the amount of training data seems to be very effective - the new Llama models had beaten or achieved the performance of many ways larger SOTA models in a wide range of evals. And today the same caliber of dataset is available to everyone.

There are a few facts and quicks I find particularly interesting about the release of FineWeb. In this LinkedIn post HF's co-founder shared a few interesting facts:

  1. HF has spent an equivalent of half-a-million USD towards GPU compute to process and distil 38 000TB of CommonCrawl dumps into 45TB dataset ready for LLM base model training.

    • $0.5mil GPU compute - there's a mention of 120 000 hours H100 compute time, based on this prices at $4/h for H100 we get a ballpark of 480k
    • 38 000TB of the common crawl is a ballpark calculated assuming one dump is 400TB (Feb/March one is 424.7TB) and there're 95 dumps in total.
  2. This amount of compute was spent on evaluating various filtering options. To do this small 1.8B and 28B models were trained using different FineWeb increments and evaluated. If the trained models happened to be better it meant the the filtering technique was a success.

    • A total of 100 smaller and 15 larger models were trained in the process of filtering approach trials.

We settled on 2 ways of training ablation models: in the first we trained a 1.8B parameters model on 28B tokens (about 5h on 64 H100) In the second we trained the same model for 350B tokens (about 2.5 days). Note that these larger ablations were trained on more tokens than GPT3 for instance

  1. CommonCrawl team started filtering out adult content between 2022 and 2023 and it harmed LLM quality that used those pieces of data in training. > Between 2022 and 2023 the "LLM quality" of Common Crawl dropped significantly as in "training a LLM on the crawls btw 2022-2023 will give you lower performances on a set of evals". What happened? it turns out the Common Crawl team has been filtering more strongly domains with adult content.

FineWeb is indeed a great contribution to the open-source community and curious evidence of what the base model training dataset preparation might look like!

It's a pity that the largest consumer SSD drive (WD Red Pro NAS Hard Drive) is just 24TBs and costs $600 :)


✅ FineWeb 45TB Dataset: $500k GPU costs and Adult Content Improving LLM Quality

📈 180.86 Punkte

✅ Meet FineWeb: A Promising 15T Token Open-Source Dataset for Advancing Language Models

📈 47.64 Punkte

✅ How We Generated a 10K Dataset Using LLM to Fine-Tune Another LLM

📈 38.24 Punkte

✅ We Built a Dynamic Router Improving LLM Quality, Cost and Speed ✨

📈 35.27 Punkte

✅ ST-LLM: An Effective Video-LLM Baseline with Spatial-Temporal Sequence Modeling Inside LLM

📈 34.9 Punkte

✅ Zyphra Introduces Zyda Dataset: A 1.3 Trillion Token Dataset for Open Language Modeling

📈 29.94 Punkte

✅ Improving LVLM Efficiency: ALLaVA’s Synthetic Dataset and Competitive Performance

📈 28.59 Punkte

✅ RABBITS: A Specialized Dataset and Leaderboard to Aid in Evaluating LLM Performance in Healthcare

📈 28.14 Punkte

✅ Improving Language Model Behavior by Training on a Curated Dataset

📈 27.06 Punkte

✅ Rethinking QA Dataset Design: How Popular Knowledge Enhances LLM Accuracy?

📈 26.6 Punkte

✅ Oscar brings AI to health insurance, reducing costs and improving patient care

📈 25.5 Punkte

✅ Thin content (and why quality content matters) | Sustainable Monetized Websites

📈 25.49 Punkte

✅ LLM Integration Unleashed: Elevating Efficiency and Cutting Costs With Semantic Cache Brilliance

📈 25.05 Punkte

✅ Streamline Your Prompts to Decrease LLM Costs and Latency

📈 25.05 Punkte

✅ Panda-70M: A Large-Scale Dataset with 70M High-Quality Video-Caption Pairs

📈 24.99 Punkte

✅ COCONut: A High-Quality, Large-Scale Dataset for Next-Gen Segmentation Models

📈 24.99 Punkte

✅ Introduction to LLM Ops: Reliable and Scalable LLM Integration

📈 24.8 Punkte

✅ Manga Site Blocks Adult Content, But Only For US and UK Users

📈 24.03 Punkte

✅ Software And Tips To Restrict Children To Adult Content Websites

📈 24.03 Punkte

✅ Improving LLM Inference Latency on CPUs with Model Quantization

📈 23.72 Punkte

✅ Improving LLM Inference Latency on CPUs with Model Quantization

📈 23.72 Punkte

✅ Improving discovery of quality apps and games on the Play Store

📈 23.64 Punkte

✅ Optional in Java: A Swiss Army Knife for Handling Nulls and Improving Code Quality

📈 23.64 Punkte

✅ Improving Code Quality and Deployment Time for Your AWS-hosted Application

📈 23.64 Punkte

✅ Improving client outcomes and quality of care with modern workflows

📈 23.64 Punkte

✅ React Native, SOLID, and TypeScript: Improving Your Code Quality

📈 23.64 Punkte


Datei nicht gefunden!