Show HN: ArXiv-txt, LLM-friendly ArXiv papers

Just change arxiv.org to arxiv-txt.org in the URL to get the paper info in markdown

Example:

Original URL: https://arxiv.org/abs/1706.03762

Change to: https://arxiv-txt.org/abs/1706.03762

To fetch the raw text directly, use https://arxiv-txt.org/raw/abs/1706.03762, this will be particularly useful for APIs and agents

22
11
jerpint
4 months ago
arxiv-txt.org

westurner
·
4 months ago
·
[ - ]

If you train an LLM on only formally verified code, it should not be expected to generate formally verified code.

Similarly, if you train an LLM on only published ScholarlyArticles ['s abstracts], it should not be expected to generate publishable or true text.

Traceability for Retraction would be necessary to prevent lossy feedback.

owalerys
·
4 months ago
·
[ - ]

Really clean API design, I'm a fan!

lgas
·
4 months ago
·
[ - ]

It just extracts the abstracts?

jerpint
·
4 months ago
·
[ - ]

For now , yes - abstracts and other metadata

rrekaf
·
4 months ago
·
[ - ]

do you plan on adding descriptions of figures and tables?

jerpint
·
4 months ago
·
[ - ]

will probably focus on getting the text out of the papers first, figures might be a good next step after that

sbpost
·
4 months ago
·
[ - ]

The example you give doesn't seem to work - the raw txt does not have authors.

jerpint
·
4 months ago
·
[ - ]

you're right - I hadn't noticed! I fixed it now, thanks for pointing it out

jmartin2683
·
4 months ago
·
[ - ]

This would be awesome wrapped in an MCP server/tool call :)

jerpint
·
4 months ago
·
[ - ]

whoa - i haven't yet played with MCP - might be a good first project!

cchance
·
4 months ago
·
[ - ]

Was super excited that it was going to be the actual papers, kinda cool but just being abstracts doesn't go very far, good luck getting the papers working thats gonna be pretty cool once working, then to feed it all into a vector db XD