Ask HN: Best LLM Stack for Q&A over Internal PDFs?
I'm looking to build an LLM-based chatbot that can answer questions using a set of internal PDF documents. Has anyone worked on a similar use case with good success? What approach and LLM stack did you use to solve this - RAG (Retrieval-Augmented Generation), fine-tuning, or embedding-based search?
RAG and embedding-based search are the same thing AFAIK.
My approach is to stuff as many documents as possible directly into the context. The context windows of frontier models are large enough for my use case of ~20-40 documents. Context windows are 128K tokens for gpt-4o/o1/o3 and 1M for Gemini.
When stuffing all of them in one query isn't possible, split the documents into multiple queries and aggregate the answers.
I've tried RAG. But matching query embeddings to chunk embeddings isn't that straightforward. I noticed that relevant content was missed even with my modest number of documents. Semantic matching using query embeddings is one level above dumb keyword-matching but one level below direct queries to LLMs.
Direct LLM queries seem to perform the best especially when some intermediate understanding is required (like "Based on these documents, infer the industries where X technique may be useful"). That's not possible with simple embedding search unless some of the documents specifically use the umbrella word "industry" or its close synonyms.
Embedding search can probably be improved - like generating a synthetic answer and matching that answer's embedding to chunk embeddings. But I haven't tried such techniques.
Langchain was the OG for PDF RAG. You don't need fine tuning or anything, it does embedding based search right out of the box.
Microsoft co-pilot does this out of the box
Just upload your documents to a OneDrive, Sharepoint, or Teams Site that you have access to and just start asking questions.
[dead]