Less is more

Less is more 馃 Weeks ago, we published our first dataset based on the first 100k samples of the Fineweb dataset without generated data. We removed more than 16% of the dataset to obtain 83k samples on Hugging Face. Then, we decided to test our intuition: does the synthetic data generated in uncontrolled settings damage the model during training? To do so, we used the model Qwen2.5-0.5-instruct from Alibaba Cloud. We continued pre-training on the raw first 100k samples of the Fineweb and our UncovAI dataset. What did we find? We've found that the model trained on our dataset, with less data and no synthetic data, has, on average, better results and saves at least 10% in computation time. Eliminating synthetic data generated in uncontrolled environments can improve your models and reduce costs 馃殌 Let's discuss how to make you save time and money 馃捀

Less is more 馃

Weeks ago, we published our first dataset based on the first 100k samples of the Fineweb dataset without generated data. We removed more than 16% of the dataset to obtain 83k samples on Hugging Face.

Then, we decided to test our intuition: does the synthetic data generated in uncontrolled settings damage the model during training?

To do so, we used the model Qwen2.5-0.5-instruct from Alibaba Cloud. We continued pre-training on the raw first 100k samples of the Fineweb and our UncovAI dataset.

What did we find? We’ve found that the model trained on our dataset, with less data and no synthetic data, has, on average, better results and saves at least 10% in computation time.

Less is more: Eliminating synthetic data generated in uncontrolled environments can improve your models and reduce costs 馃殌

Let’s discuss how to make you save time and money 馃捀