Introduction
State of the art data extraction for LLMs
What does ReadFile do?
We extract data from documents, to be consumed by large language models.
Why?
We set up ReadFile because we believe it really shouldn’t be so frustrating to get text out of a file.
We also believe that in the future, language models are going to be reading a lot more documents than humans will, and, language models don’t really care for pretty file-formats like PDF and Docx, they just want JSON, and, we’re going to give the machines what they want!
With ReadFile, you can just post files to an endpoint, and get nice things back, like, your document, but formatted as HTML, Markdown, Text, and semantic chunks.
Because the interface to ReadFile is HTTP requests, this means you can use it anywhere. Use it in any programming language, in any runtime, on the server, in the browser, wherever.
Oh and it’s almost entirely written in Go, and Rust. So it’s really lightweight, and fast.
Setting up
You can start using ReadFile in production in under 60 seconds. Seriously. Time yourself, it will be fun.
Get an API Key
Get an API key and start using it right now.
Try it out
Use your API key to start using ReadFile, via our docs playground.
How do you compare to:
Unstructured
- The “high-res” strategy (the only one that’s worth using) is very slow.
- The serverless service throws errors stochastically (the same file will fail when you first try it, and somehow succeed when you try it again). It didn’t seem like a service we could rely on.
Google Document AI
- It’s very hard to “just use” any Google service. You need set up a GCP account. Then set up a project. Then set up a “role”. Then grant the “role” the necessary (cryptically named) “permissions”. Then download a credentials.json file. Then authenticate with the gcloud cli. Oh you don’t have the gcloud cli? You better install that too. OK, so now you have it finally set up and ready to use? OK, well, in order to use it you need to also set up GCS (Google’s S3) to store the output of the operations. Yeah, you’re going to have to pay for that too…
- At time of writing, you can only process 50 word documents concurrently.
- It’s useful, when using Google services, that you derive some masochistic satisfaction from getting no support when things go wrong.
Reducto AI
- Holy good sweet baby Jesus Mary and Joseph have you seen those prices?
Amazon Textract
- Decent choice, definitely better performance than Google Document AI, but, it still suffers the exact same hassle to set up and get working reliably.