DEV Community

Matthew
Matthew

Posted on

AI sucks at understanding documents - How I fixed it (Student Dev)

I've been wanting to build a real public project for a while. So as a student I found that sending pdfs or docx into AI ends up with poor context and hallucinations. Not to mention the file upload limits and token costs. So decided to build an API to fix that problem.

What it does

My project is called ParseFlow, it takes documents (PDFs, DOCX, TXT) and converts them into readable JSON chunks that can be used for search indexing, chatbot context and LLM pipelines.

By converting documents into these organized chunks, you are able to:

  1. Improve context and reduce hallucinations
  2. Reduce token usage compared to uploading PDF/DOCX
  3. Pick and choose what context to add
  4. Keep documents private - nothing is stored

How I built it

I wanted to use this project to learn how to build APIs and use the FastAPI framework. So I started reading the documentation, watching videos and using a lot of Stackoverflow. After a lot of trial and error I was able to pull text and metadata from documents and save them in a JSON file.

From there I started building a simple system for chunking the text pulled from documents and organizing them by common themes and titles. This was definitely the most difficult part of this project. I needed to learn about semantic chunking and understanding metadata. I had a lot of issues with edge cases so I added a overlap parameter where you can have chunks overlap by a custom amount of characters

Eventually I posted this project on RapidAPI and used Docusaurus to build a simple website with documentation on how to use the website.

I wanted to use only free tools, which is where the Github Student Pack came in super clutch to help me host and get a free domain. I think some devs might be curious of the tools and everything I used to build this project so here's an in depth list of everything I used:

  • FastAPI as the API framework
  • Uvicorn for running the FastAPI app
  • Pytest for tests (duh)
  • Pdfplumber for reading PDFs
  • Python-docx for reading DOCXs
  • ZipFile for building the output
  • Docusaurus for building the website
  • Github Pages for hosting the website
  • Render for API backend (free-tier so it's slow until I upgrade)
  • .TECH domains for my domain
  • Cloudflare email routing and SSL certificate

I don't have any storage or DBs so all documents are processed and sent back and never saved.

One thing I want to shoutout is that I used a lot of open-source software and wouldn't of been able to build any of this without all the awesome developers maintaining all the different libraries and tutorials on Github and Freecodecamp. So if any of those devs are reading this I want to thank you for making it possible to turn my idea into a real project.

Code example

Request

curl -X POST https://docflow.p.rapidapi.com/process \
  -H "x-rapidapi-key: YOUR_RAPIDAPI_KEY" \
  -H "x-rapidapi-host: docflow.p.rapidapi.com" \
  -F "file=@sample.pdf" \
  -F "mode=semantic-lite"
Enter fullscreen mode Exit fullscreen mode

Response

{
  "chunks": [
    {"index": 0, "text": "Abstract..."},
    {"index": 1, "text": "Introduction..."}
  ],
  "metadata": {"length": 8421, "chunks": 12}
}
Enter fullscreen mode Exit fullscreen mode

What's next

So that brings me to today. I'm pretty proud of the product I built but as i'm still learning there's a lot of things to improve and polish. In the future I also want to add a more 'smart' chunking mode that can understand the context of documents to make better organized chunks. If you work in LLM or RAG pipelines, I would love to hear how this project could be useful and where it falls short.

As a student developer, my resources and personal knowledge are pretty limited so any feedback, tips, feature ideas, etc... are very appreciated and can really help me build this into a real public project instead of just some hobby.

If you want to contact me, the best email would be at hello@parseflow.tech

If you wanted to try out the project or just see more about it, the Docusaurus site can be found here: https://docs.parseflow.tech/

Again thank you for any suggestions or help, i'm just trying to build cool projects that can solve problems I have and hopefully help at least 1 other person out there.

Top comments (0)