AI sucks at understanding documents - How I fixed it (Student Dev)

#ai #beginners #showdev

I've been wanting to build a real public project for a while. So as a student I found that sending pdfs or docx into AI ends up with poor context and hallucinations. Not to mention the file upload limits and token costs. So decided to build an API to fix that problem.

What it does

My project is called ParseFlow, it takes documents (PDFs, DOCX, TXT) and converts them into readable JSON chunks that can be used for search indexing, chatbot context and LLM pipelines.

By converting documents into these organized chunks, you are able to:

Improve context and reduce hallucinations
Reduce token usage compared to uploading PDF/DOCX
Pick and choose what context to add
Keep documents private - nothing is stored

How I built it

I wanted to use this project to learn how to build APIs and use the FastAPI framework. So I started reading the documentation, watching videos and using a lot of Stackoverflow. After a lot of trial and error I was able to pull text and metadata from documents and save them in a JSON file.

From there I started building a simple system for chunking the text pulled from documents and organizing them by common themes and titles. This was definitely the most difficult part of this project. I needed to learn about semantic chunking and understanding metadata. I had a lot of issues with edge cases so I added a overlap parameter where you can have chunks overlap by a custom amount of characters

Eventually I posted this project on RapidAPI and used Docusaurus to build a simple website with documentation on how to use the website.

I wanted to use only free tools, which is where the Github Student Pack came in super clutch to help me host and get a free domain. I think some devs might be curious of the tools and everything I used to build this project so here's an in depth list of everything I used:

FastAPI as the API framework
Uvicorn for running the FastAPI app
Pytest for tests (duh)
Pdfplumber for reading PDFs
Python-docx for reading DOCXs
ZipFile for building the output
Docusaurus for building the website
Github Pages for hosting the website
Render for API backend (free-tier so it's slow until I upgrade)
.TECH domains for my domain
Cloudflare email routing and SSL certificate

I don't have any storage or DBs so all documents are processed and sent back and never saved.

One thing I want to shoutout is that I used a lot of open-source software and wouldn't of been able to build any of this without all the awesome developers maintaining all the different libraries and tutorials on Github and Freecodecamp. So if any of those devs are reading this I want to thank you for making it possible to turn my idea into a real project.

Code example

Request

curl -X POST https://docflow.p.rapidapi.com/process \
  -H "x-rapidapi-key: YOUR_RAPIDAPI_KEY" \
  -H "x-rapidapi-host: docflow.p.rapidapi.com" \
  -F "file=@sample.pdf" \
  -F "mode=semantic-lite"

Response

{
  "chunks": [
    {"index": 0, "text": "Abstract..."},
    {"index": 1, "text": "Introduction..."}
  ],
  "metadata": {"length": 8421, "chunks": 12}
}

What's next

So that brings me to today. I'm pretty proud of the product I built but as i'm still learning there's a lot of things to improve and polish. In the future I also want to add a more 'smart' chunking mode that can understand the context of documents to make better organized chunks. If you work in LLM or RAG pipelines, I would love to hear how this project could be useful and where it falls short.

As a student developer, my resources and personal knowledge are pretty limited so any feedback, tips, feature ideas, etc... are very appreciated and can really help me build this into a real public project instead of just some hobby.

If you want to contact me, the best email would be at hello@parseflow.tech

If you wanted to try out the project or just see more about it, the Docusaurus site can be found here: https://docs.parseflow.tech/

Again thank you for any suggestions or help, i'm just trying to build cool projects that can solve problems I have and hopefully help at least 1 other person out there.

DEV Community