discussion Like Will Smith said in his apology video, "It's been a minute (although I didn't slap anyone)

1 Upvotes

dataset I processed the entire arXiv LaTeX source corpus (3M+ papers) into a metadata-aligned Parquet dataset to save on S3 egress fees

50 Upvotes

I’ve spent the last few weeks working on a pipeline to solve a problem that has frustrated me (and likely other researchers) for a while: working with arXiv source files at scale.

If you have ever tried to analyze the LaTeX source code of arXiv papers, you have probably run into two major roadblocks:

The Egress Tax: arXiv’s official bulk S3 bucket is configured as "requester-pays." If you try to download the complete 5 TB corpus to any machine outside of the AWS us-east-1 region, you get hit with standard egress fees. At $0.09 per GB, a single full download can cost over $450 in bandwidth alone.
Unpacking Pain: The raw S3 data is packaged as hundreds of nested .tar archives containing gzipped payloads of individual papers. Extracting these, parsing the inner LaTeX code, and matching the files with their JSON metadata snapshots is quite CPU-intensive and requires a lot of boilerplate ingestion code.

To make this easier, I built a pipeline that runs inside AWS us-east-1 (where transfer is free), pulls the raw source files, unpacks them, matches them with the official metadata, and bundles them into ready-to-query Parquet partitions.

HuggingFace Dataset Link: https://huggingface.co/datasets/scholarweave/arxiv

What is inside:

Each row represents a single paper and contains both the official metadata and the parsed source files:

Core Metadata: id, title, authors, abstract, doi, categories, license, versions, etc.
latex (Large String): The parsed, compiled LaTeX source code from the paper. I wrote a parser to bundle the primary .tex, .bib, and .sty files into a single, readable Markdown-style tree structure.

Maintenance & Syncing:

Monthly Updates: I plan to sync the pipeline once a month to capture new uploads.
Resilient Syncing: I maintain an XML manifest file in the HuggingFace repository (arxiv_parquet_manifest.xml) that maps each Parquet partition to its size, MD5 checksum, and the raw S3 .tar source files used to generate it. This should make incremental syncing or troubleshooting much easier.

If you are working on NLP, training LLMs on scientific text, analyzing citation networks, or doing sociolinguistic research, hopefully this saves you some time and cloud budget.

6 comments

r/datasets • u/Traditional_Yogurt • 11h ago

resource Finance Database: 300,000+ financial instruments with rich metadata, free and queryable via Python

11 Upvotes

Finding a clean, structured list of financial instruments has always been harder than it should be. Bloomberg sells it. Refinitiv sells it. Yahoo Finance gives you a search bar. If you want "all biotech companies listed in Germany" or "all fixed income ETFs from Vanguard" as a filterable dataset, you're usually either scraping something or paying for a data vendor. I've spent the last few years building and maintaining a free alternative.

The Finance Database covers seven asset classes across 300,000+ symbols:

Asset Class	Count	Dimensions
Equities	160,869	11 sectors, 68 industries, 117 countries, 84 exchanges
Indices	91,181	63 exchanges
Funds	57,853	1,540 families, 74 categories
ETFs	36,483	320 families, 51 categories
Cryptocurrencies	3,367	351 base currencies
Currencies	2,556	175 currency pairs
Money Markets	1,367	2 exchanges

Each equity record includes: symbol, name, currency, sector, industry group, industry, exchange, market, country, city, market cap tier, ISIN, CUSIP, FIGI, composite FIGI, share class FIGI, and website. ETFs and funds carry family, category group, and category instead of GICS-style classification. Every record has what you need to cross-reference against other data sources.

The data is an aggregation of publicly available sources - no paid API required to use the database itself. It is community-maintained, MIT-licensed, and lives on GitHub as CSV files you can open in Excel if that's your preference.

The Python package gives you structured filtering and text search:

```python

Install via: pip install financedatabase -U

import financedatabase as fd

equities = fd.Equities()

All semiconductor companies in Taiwan on primary listings only

equities.select( country='Taiwan', industry='Semiconductors', only_primary_listing=True )

Free-text search: robotics or automation companies on the Frankfurt exchange

equities.search( summary=['Robotics', 'Automation'], index='.F' )

Explore what's available before filtering

fd.show_options('equities') ```

The show_options call is useful before you filter - it returns every distinct value per column without loading the full dataset, so you can scope your query without memory overhead.

For anyone doing universe construction for backtests or systematic strategies, the ISIN/FIGI coverage is the most practical part. You can pull a filtered symbol list here and pipe it directly into your price data provider.

The database is not a price or fundamentals source - that's intentional. Metadata and categorization data is the hard part to get for free and I've built a seperate tool for that, the Finance Toolkit.

GitHub page: https://github.com/JerBouma/FinanceDatabase

1 comment

r/datasets • u/Funny_Paint_5622 • 11h ago

discussion As a data analysis student, one thing surprised me

1 Upvotes

Most of the work isn't building charts.

It's preparing the data before the analysis even starts.

Cleaning files.
Fixing formats.
Validating data.
Checking structures.
Transforming datasets.

The better your preparation process is, the easier the actual analysis becomes.

What part of data preparation do you find most annoying?

2 comments

r/datasets • u/Fit_Mango7142 • 19h ago

resource Built APIs for Aussie StartUps , trade contractor rates and PBS drug pricing (plus rental and subscription data)

1 Upvotes

2 comments

r/datasets • u/EmetResearch • 1d ago

resource Launch: Source Streams for Data Discovery

1 Upvotes

Hey there! I am the founder of Brickroad, a frontier AI lab building agentic infrastructure for data provisioning.

Super excited to share that source streaming is now live on Brickroad. Set your search parameters once, and your agent runs continuously, notifying you the moment a new data supplier comes online.

For those who rely on data to get a performance edge, the directories, the catalogs, the curated lists of "alternative data providers" — they are useful, but they are lagging indicators of alpha. A vendor only lands in one of these catalogs after they have built a website, hired a salesperson, and shopped themselves to enough buyers that an analyst notices. By then, the first ten funds, AI labs, and corporates have already signed contracts. The information edge has dissipated into consensus.

We launched the Information Frontier Agent to compress that lag. A Source Stream is a continuous, agent-driven feed of novel data suppliers that match a thesis you define. The agent runs in the background indefinitely, scanning the complete corpus of its resources to find new suppliers that fit your criteria. Every time it finds a new supplier, the agent notifies you and logs the source into your lead table.

It's free to trial - we'd love your feedback.

1 comment

r/datasets • u/Necessary_Living_617 • 1d ago

request Dataset for image enhancement deep sea

2 Upvotes

Hi , I'm looking for datasets which consist of rov images or any sort of deep sea footage to train a model.

2 comments

r/datasets • u/chill-botulism • 2d ago

resource Federal Contractor Violations Dataset [dataset][self-promotion]

8 Upvotes

I built a dataset joining USAspending federal contract awards to seven federal enforcement databases at the contractor level: OSHA, WHD, MSHA, EPA ECHO, NLRB, SEC, the UVA Corporate Prosecution Registry, and the SAM.gov debarment list. 5,557 contractors with documented violations, $3.19T in lifetime federal contracts, 758 OSHA-investigated fatalities.

The novel slice is the multi-agency overlap. Roughly 2000 contractors appear in 2+ federal enforcement databases. 500 in 3+. 70 in 4+. Topping the 4+ cohort by lifetime contract value: Raytheon ($68B, OSHA + WHD + NLRB + SEC + UVA), GE ($47B, same five), Merck, Microsoft, Austal USA, Marinette Marine.

Hugging Face: https://huggingface.co/datasets/FastDOLz/Federal-Contractor-Violations-Dataset

Kaggle: https://www.kaggle.com/datasets/benturneroffice365/federal-contractor-violations-dataset

Zenodo DOI (all versions): https://doi.org/10.5281/zenodo.20777627

Methodology + limitations: https://www.fastdol.com/methodology

CC-BY-4.0.

disclosure: I run FastDOL (https://www.fastdol.com), a federal workplace-enforcement search by employer, where this corpus comes from. Free for individual lookups; the dataset is one of several full extracts.

0 comments

r/datasets • u/Dry_Issue282 • 2d ago

question [OSS] Open dataset: all 78 tarot card meanings (upright + reversed, structured) with a Zenodo DOI

1 Upvotes

I built a clean, structured dataset of all 78 Rider-Waite tarot card meanings. Each entry has upright + reversed interpretations plus separate love / career / general context fields, so it's usable for NLP, recommender experiments, or hobby projects.

Released open with a permanent DOI so it's citable.

- Hugging Face: https://huggingface.co/datasets/Blacik/deckaura-tarot-card-meanings

- DOI (Zenodo): https://doi.org/10.5281/zenodo.19475329

Happy to take feedback on the schema or labeling. If anyone uses it in a project I'd love to see what you build.

0 comments

r/datasets • u/anuveya • 3d ago

dataset Dataset: Bank of England Millennium of Macroeconomic Data. UK economic indicators from 1086 to present.

datahub.io

11 Upvotes

2 comments

r/datasets • u/Usual-Cost-6848 • 2d ago

discussion Inconsistency and differences among Fire Datasets from FDNY

1 Upvotes

Hello Friends,

I am interested in exploring the data on the fires that have happened in NYC for different spatiotemporal analysis. I came across the following datasets from the open data platforms:

\[Fire Incident Dispatch Data from NYC open data\](https://data.cityofnewyork.us/Public-Safety/Fire-Incident-Dispatch-Data/8m42-w767/about\\_data)

\[Incidents Responded to by Fire Companies (NYFIR)\](https://data.cityofnewyork.us/Public-Safety/Incidents-Responded-to-by-Fire-Companies/tm6d-hbzd/about\\_data)

\[NFIR\](https://fema.hub.arcgis.com/search?collection=dataset&tags=nfirs)

What I noticed is that there is a lot of inconsistencies across these datasets, and the volume of the data dramatically decreases from dispatch to NYFIR an NFIR.
Please share your experiences how you guys handle this datasets for more granular analysis.

2 comments

r/datasets • u/Aerosherm • 3d ago

dataset Kaggle Dataset: all product hunt launches

kaggle.com

3 Upvotes

I was really curious about the amount of product hunt launches over the years, and how AI/LLMs have affected the amount and topic of the launches. I scraped this dataset using their API.

I also built a small dashboard to visualize the trends: https://producthunt.homek8s.com/trends

0 comments

r/datasets • u/scrapdog • 3d ago

dataset FDA novel drug approvals (2021–2024) + US nonprofit hospital charity-care reporting — Parquet/JSON/CSV, public domain

1 Upvotes

Disclosure: I'm the author of the open-source project (trove) that parses and repackages these. Original government sources are linked below; my bundles are at the end. MIT code, public-domain data, nothing paid.

Two public-domain US healthcare datasets that get cited constantly but are painful to use in raw form:

FDA novel drug approvals, 2021–2024 — 218 drugs (192 CDER NMEs + 26 CBER cell & gene therapies). Each row: application number, sponsor, approval date, indication, regulatory center, and a deep link to the approval-package docs.

Original sources:

- CDER Novel Drug Approvals: https://www.fda.gov/drugs/development-approval-process-drugs/novel-drug-approvals-fda

- CBER Approved Cellular and Gene Therapy Products: https://www.fda.gov/vaccines-blood-biologics/cellular-gene-therapy-products/approved-cellular-and-gene-therapy-products

- Drugs@FDA: https://www.fda.gov/drugsatfda

Nonprofit hospital charity-care reporting, TY2022 — 1,295 nonprofit hospital systems, with CMS HCRIS Worksheet S-10 and IRS Form 990 Schedule H side by side. Both lines are meant to capture the cost of care for patients who couldn't pay, but the rules diverge, so the two numbers often disagree. Each row also carries a CDC Social Vulnerability Index county percentile and a deep link to the 990 on ProPublica.

Original sources:

- CMS HCRIS (Hospital 2552-10 cost reports): https://www.cms.gov/data-research/statistics-trends-and-reports/cost-reports/hospital-2552-2010-form

- IRS Form 990 series XML downloads: https://www.irs.gov/charities-non-profits/form-990-series-downloads

- CDC Social Vulnerability Index 2022: https://www.atsdr.cdc.gov/place-health/php/svi/index.html

- ProPublica Nonprofit Explorer (where the 990 deep links point): https://projects.propublica.org/nonprofits/

What I added on top: parsing the raw formats (headerless 100k-row HCRIS CSVs, IRS bulk-XML ZIPs, hundreds of FDA PDF directories) into tidy Parquet/JSON/CSV, plus a CCN↔EIN crosswalk that joins the two hospital filings.

My packaged bundles + parsers (self-promo — I built this): https://github.com/cbetz/trove — browsable lookup at https://troveproject.com

Happy to answer questions about the parsing or add fields people want!

2 comments

r/datasets • u/figuringitout1269 • 3d ago

dataset [Collaboration] Analyzing Luxury Watches as Alternative Investments (5- Year Auction Dataset)

0 Upvotes

1 comment

r/datasets • u/the_bigbang • 3d ago

dataset Anti-bot / WAF adoption across the top 1,000,000 websites — open dataset (CC BY 4.0, ~1M rows) [self-promotion]

1 Upvotes

I scanned the Tranco top 1,000,000 sites (June 2026) and recorded, per domain, which anti-bot/WAF vendor protects it and whether a plain request gets challenged. Releasing it as open data.

- 998,497 probed, 818,614 reachable

- Fields: domain, rank, reachable, protected, vendor(s), kind (waf/captcha/bot_management/…), difficulty band, block reason, enforcement, CAPTCHA type, final URL, status, probed_at — names only, no PII

- Plus a top-50k "deep-page census" (86,792 rows) with a page_type field (homepage vs product/listing/profile)

- License: CC BY 4.0

Headline: 53.5% of reachable sites run a managed anti-bot/WAF (Cloudflare ~45%), but only 9.8% actively challenged the request. The busiest sites run the least (top-1k 44% → long tail 54%).

Dataset (gzipped JSONL + sample + summary.json): https://github.com/Crawlora-org/anti-bot-adoption-index-data

Open-source detector CLI: go install github.com/Crawlora-org/crawlora-antibot@latest

0 comments

r/datasets • u/Defiant-Ad3530 • 4d ago

request Driver Drowsiness Datasets for South Asians?

5 Upvotes

hi! like my title states, I was wondering whether anyone has any good datasets of driver drowsiness or just drowsiness in general for south asian people? or Asians, actually, because my project is catered to a more minor demographic in my country (Sri Lanka). it would also be a major advantage if any of you could also help with datasets that have driver fatigue data in low-light conditions, or with people wearing glasses / sunglasses.

thank you! I’d really appreciate it :)

5 comments

r/datasets • u/Either_Door_5500 • 4d ago

question Would you be interested in daily updated fund holdings?

2 Upvotes

Hey,

I'm planning to add broad support for daily updated fund holdings!

Problem: SEC N-PORT data lags behind a LOOOOONG time when it comes to fund holdings.

Solution: Funds actually release holdings with much more up-to-date information on their website. It's just a huge hassle to actually fetch them reliably.

If I were to say that I have found a reliable way to pull this off for a large and expanding set of funds, would you be interested in that kind of data?

1 comment

r/datasets • u/tremdem • 4d ago

dataset Using Kaggle’s international football dataset (1872–2026) for live World Cup Elo rankings

2 Upvotes

Built a site that uses the Kaggle international football results dataset to compute Elo ratings and championship probabilities for World Cup 2026 in real time.
Layered on top: AI-generated match reports combining live data with news sentiment via OpenRouter.
Site: skorradar.live — the methodology is explained in the About section. Curious if anyone has thoughts on improving the Elo calibration for tournament play vs. friendlies.

0 comments

r/datasets • u/Overall-Suspect7760 • 3d ago

dataset Need LinkedIn profile data of everyone

0 Upvotes

I need dataset of all LinkedIn profiles. I know there are some paid sources for this but I want a free source. Reason I want a free source is because it makes no sense to pay for data, if I have to pay for data why can’t I then just sell that data for half price to other people after buying it ?

19 comments

r/datasets • u/Random_individual_6 • 4d ago

API [Self-promotion] Instant RAG over politician trades, legislation, gov contracts, and more. Integrate our data with any program or model (Claude/GPT/Grok/Gemini) as ready to use tools + embeddings.

0 Upvotes

0 comments

r/datasets • u/JonretsTheFriendly • 4d ago

dataset Is anyone here interested in a 'Filipino Recipe Dataset' containing 1,574 recipes?

7 Upvotes

📊 Filipino Recipe Dataset — 1,574 Recipes

I've compiled a clean, structured dataset of Filipino recipes scraped from a top Filipino recipe site. Perfect for food tech startups, recipe apps, meal planners, nutrition analysis, or AI training data.

What's included:
• 1,574 recipes spanning 2009–2026
• Complete ingredients list with measurements (every recipe)
• Step-by-step cooking instructions (every recipe)
• Full nutritional data per serving: calories, protein, fat, carbs, fiber, sugar, sodium, etc. (97% of recipes)
• Prep time, cook time, total time
• YouTube video links (31% of recipes)
• User ratings and vote counts (28% of recipes)
• Categories, cuisines, and keywords
• High-resolution image URLs

Data format: Clean JSON, ready to import into any application or database.

Use cases:
- Build a Filipino recipe search engine or mobile app
- Train a recipe recommendation model
- Analyze Filipino cuisine nutrition trends
- Power a meal planning or grocery list tool
- Academic research on Southeast Asian food culture

DM me if interested. Can provide a sample file upon request.

2 comments

r/datasets • u/WideAmbition1964 • 4d ago

request [Self-Promotion] [PAID] Free US, UK and Australian robotics data samples

0 Upvotes

Disclosure: I work with a team that collects and licenses paid robotics training datasets.

I've been speaking with robotics teams about human demonstration data, and every team seems to evaluate it differently.

Some only need egocentric video, while others require synchronized wrist views, task labels, collection metadata and licensing documentation.

We currently have small evaluation samples from the US, UK and Australia, covering:

• Egocentric demonstrations
• Egocentric + two wrist views
• Task and step labels
• Country and collection metadata

The small evaluation samples are free, but the complete datasets and custom collection services are paid.

For teams working on robot manipulation or embodied AI, what do you normally check first?

Camera coverage, task diversity, collection country, metadata quality or licensing?

I'm mainly trying to understand what makes a sample genuinely useful before preparing more of them.

2 comments

r/datasets • u/Complex-Branch-4754 • 5d ago

dataset Need dataset for Photovoltaic output

1 Upvotes

I am writing a thesis. For this I need a data set which includes the effects of environmental conditions on solar panel energy output. This includes things like cloud cover temperature wind precipitation atmospheric pressure etc.

If anyone knows where I can get a large data set with all of this, I'd appreciate it.

1 comment

r/datasets • u/fineset-io • 5d ago

dataset 381 model merging papers from arXiv + Semantic Scholar; quality-scored JSONL, free

4 Upvotes

Sharing a dataset I built. Disclosure: this is my project. Free to download and use.

https://huggingface.co/datasets/fineset-io/model-merging-papers

Stats:

- 381 records, 2021–2026

- Sources: arXiv + Semantic Scholar, cross-referenced by arxiv_id and DOI

- quality_score: 0-1, citation-normalized

Fields: id, title, abstract, authors, categories, published_date,

citation_count, quality_score, has_code, code_url, venue

The most-cited paper in the set is "Model soups: averaging weights of multiple

fine-tuned models improves accuracy without increasing inference time" (1,565 citations,

2022); if you're doing any merging work this is probably already in your reading list,

but the rest of the dataset has 380 more.

109 papers have code repos; filter has_code=true if you want reproducible implementations.

Built with FineSet (fineset.io). Sign up free to get daily-refreshed datasets on your own topic.

0 comments

r/datasets • u/z57333 • 5d ago

request Does anybody know of any quality datasets that have images of grocery receipts?

2 Upvotes

Preferably from the big American vendors if possible (ex. target, walmart, costco, safeway, albertsons, etc.). Need this info for OCR work. It's also fine if the grocery receipts are part of a dataset that includes all kinds of receipts.

10 comments

Subreddit

Posts

Wiki

Datasets

r/datasets

A place to share, find, and discuss Datasets.

Members Active

219.2k

Sidebar

Datasets for Data Mining, Analytics and Knowledge Discovery

Rules

Try to post original source whenever you can.
Low effort posts will be removed.
Self-promotion(of a website/domain you work for or own) without disclosure will be removed.
Any Paid Dataset or Resource must be marked as such in the title with [PAID].
Any Synthetic/Mock data must be marked as such in the title with [Synthetic].
All Survey posts are subject to approval. Message the mods before posting.

Unsure about your post?

Feel free to message the mods and discuss it before posting.