Support fetching from kaggle dataset #164

Randl · 2021-05-18T09:07:59Z

This pull request adds functionality of fetching meata-data and pdfs from kaggle arxiv dataset without overloading arxiv API and allowing faster bootstrap. gsutil ls gs://arxiv-dataset/arxiv/arxiv/pdf can be used to download all pdfs (>2 TB of data).

Additional changes are made to handle subfolders in pdf folder and faster processing for large amount of db entries.

Devetec · 2021-07-05T03:45:31Z

@Randl

This pull request adds functionality of fetching meata-data and pdfs from kaggle arxiv dataset without overloading arxiv API and allowing faster bootstrap. gsutil ls gs://arxiv-dataset/arxiv/arxiv/pdf can be used to download all pdfs (>2 TB of data).

Additional changes are made to handle subfolders in pdf folder and faster processing for large amount of db entries.

Question: Does this support newer arxiv papers that are not in the Kaggle dataset?

Randl · 2021-07-05T04:02:38Z

The standard fetching is still available, it's just that you'll need to download only a couple of hundreds of newest papers from API

lance10t · 2021-07-05T07:53:33Z

This is interesting, I didn't see this earlier and developed something on my own too. But what you have done is definitely a lot neater.

My only immediate concern while developing it was that the fields were slightly different and I needed some transformations. As much as possible, I fitted it into the db.p dictionary structure to minimise any downstream bugs.

But looking forward to the review of whether this can become part of the main branch. The rate limiters and blocks on arXiv is quite painful.

Randl · 2021-07-05T08:00:06Z

it seems that @karpathy is rewriting it (from scratch?): https://www.reddit.com/r/MachineLearning/comments/obne9p/d_is_arxivsanity_down_what_people_use_these_days/h3q422o
Hopefully it'll be part of the features of new version.

lance10t · 2021-07-05T08:01:39Z

This is a great initiative. One of the best projects I've come across that solves a practical need.

Support fetching from kaggle dataset

2130eb5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fetching from kaggle dataset #164

Support fetching from kaggle dataset #164

Randl commented May 18, 2021

Devetec commented Jul 5, 2021

Randl commented Jul 5, 2021

lance10t commented Jul 5, 2021

Randl commented Jul 5, 2021

lance10t commented Jul 5, 2021

Support fetching from kaggle dataset #164

Are you sure you want to change the base?

Support fetching from kaggle dataset #164

Conversation

Randl commented May 18, 2021

Devetec commented Jul 5, 2021

Randl commented Jul 5, 2021

lance10t commented Jul 5, 2021

Randl commented Jul 5, 2021

lance10t commented Jul 5, 2021