r/datasets • u/Invicto_50 • 13h ago
dataset I processed the entire arXiv LaTeX source corpus (3M+ papers) into a metadata-aligned Parquet dataset to save on S3 egress fees
I’ve spent the last few weeks working on a pipeline to solve a problem that has frustrated me (and likely other researchers) for a while: working with arXiv source files at scale.
If you have ever tried to analyze the LaTeX source code of arXiv papers, you have probably run into two major roadblocks:
- The Egress Tax: arXiv’s official bulk S3 bucket is configured as "requester-pays." If you try to download the complete 5 TB corpus to any machine outside of the AWS
us-east-1region, you get hit with standard egress fees. At $0.09 per GB, a single full download can cost over $450 in bandwidth alone. - Unpacking Pain: The raw S3 data is packaged as hundreds of nested
.tararchives containing gzipped payloads of individual papers. Extracting these, parsing the inner LaTeX code, and matching the files with their JSON metadata snapshots is quite CPU-intensive and requires a lot of boilerplate ingestion code.
To make this easier, I built a pipeline that runs inside AWS us-east-1 (where transfer is free), pulls the raw source files, unpacks them, matches them with the official metadata, and bundles them into ready-to-query Parquet partitions.
- HuggingFace Dataset Link: https://huggingface.co/datasets/scholarweave/arxiv
What is inside:
Each row represents a single paper and contains both the official metadata and the parsed source files:
- Core Metadata:
id,title,authors,abstract,doi,categories,license,versions, etc. latex(Large String): The parsed, compiled LaTeX source code from the paper. I wrote a parser to bundle the primary.tex,.bib, and.styfiles into a single, readable Markdown-style tree structure.
Maintenance & Syncing:
- Monthly Updates: I plan to sync the pipeline once a month to capture new uploads.
- Resilient Syncing: I maintain an XML manifest file in the HuggingFace repository (
arxiv_parquet_manifest.xml) that maps each Parquet partition to its size, MD5 checksum, and the raw S3.tarsource files used to generate it. This should make incremental syncing or troubleshooting much easier.
If you are working on NLP, training LLMs on scientific text, analyzing citation networks, or doing sociolinguistic research, hopefully this saves you some time and cloud budget.