The Pile: Dream Training Dataset
The Pile is an amazing new dataset for training language models. It is like an all-you-can-eat buffet for those of us creating models in the GPT-3 space.

The Pile is an amazing new dataset for training language models. It is like an all-you-can-eat buffet for those of us creating models in the GPT-3 space. 


ComponentRaw SizeWeightEpochsEffective SizeMean Document Size
Pile-CC227.12 GiB18.11%1.0227.12 GiB4.33 KiB
PubMed Central90.27 GiB14.40%2.0180.55 GiB30.55 KiB
Books3100.96 GiB12.07%1.5151.44 GiB538.36 KiB
OpenWebText262.77 GiB10.01%2.0125.54 GiB3.85 KiB
ArXiv56.21 GiB8.96%2.0112.42 GiB46.61 KiB
Github95.16 GiB7.59%1.095.16 GiB5.25 KiB
FreeLaw51.15 GiB6.12%1.576.73 GiB15.06 KiB
StackExchange32.20 GiB5.13%2.064.39 GiB2.16 KiB
USPTO Backgrounds22.90 GiB3.65%2.045.81 GiB4.08 KiB
PubMed Abstracts19.26 GiB3.07%2.038.53 GiB1.30 KiB
Gutenberg (PG-19)10.88 GiB2.17%2.527.19 GiB398.73 KiB
OpenSubtitles12.98 GiB1.55%1.519.47 GiB30.48 KiB
Wikipedia (en)6.38 GiB1.53%3.019.13 GiB1.11 KiB
DM Mathematics7.75 GiB1.24%2.015.49 GiB8.00 KiB
Ubuntu IRC5.52 GiB0.88%2.011.03 GiB545.48 KiB
BookCorpus26.30 GiB0.75%1.59.45 GiB369.87 KiB
EuroParl4.59 GiB0.73%2.09.17 GiB68.87 KiB
HackerNews3.90 GiB0.62%2.07.80 GiB4.92 KiB
YoutubeSubtitles3.73 GiB0.60%2.07.47 GiB22.55 KiB
PhilPapers2.38 GiB0.38%2.04.76 GiB73.37 KiB
NIH ExPorter1.89 GiB0.30%2.03.79 GiB2.11 KiB
Enron Emails0.88 GiB0.14%2.01.76 GiB1.78 KiB
Total   1254.20 GiB5.91 KiB


Share this
Leave a comment
There are no comments about this article, let us know what you think?

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.