# RAG Notes Limit context size Order preservation of chunks Retrieve the top k chunks based on similarity scores, but preserve the original order of the chunks as they appeared in the document. Unsorted https://levelup.gitconnected.com/testing-18-rag-techniques-to-find-the-best-094d166af27f https://www.jeremykun.com/2015/04/06/markov-chain-monte-carlo-without-all-the-bullshit/ https://medium.com/ai-exploration-journey/how-hirag-turns-data-chaos-into-structured-knowledge-magic-ai-innovations-and-insights-35-d637b9a58d80 https://arxiv.org/pdf/2506.00054 https://arxiv.org/abs/2511.09803 https://arxiv.org/abs/2508.10419 https://arxiv.org/abs/2511.08505 https://arxiv.org/abs/2511.08245 https://arxiv.org/abs/2510.20260 https://github.com/grantflow-ai/grantflow https://arxiv.org/abs/2507.05093 https://arxiv.org/abs/2507.02962 https://arxiv.org/abs/2507.05713 https://docs.kiln.tech/docs/evaluations/evaluate-rag-accuracy-q-and-a-evals https://arxiv.org/html/2507.06554v2#S5 https://arxiv.org/abs/2507.07426 https://arxiv.org/html/2507.04055v2#S4 https://arxiv.org/html/2511.07328v1 https://jxnl.co/writing/2025/09/11/the-rag-mistakes-that-are-killing-your-ai-skylar-payne/ https://huggingface.co/datasets/isaacus/open-australian-legal-corpus https://huggingface.co/blog/adlumal/lightning-fast-vector-search-for-legal-documents https://github.com/hhy-huang/HiRAG https://archive.is/W64q3 https://github.com/trustgraph-ai/trustgraph https://archive.is/G9rmp https://github.com/merveenoyan/smol-vision/blob/main/Image_Search_with_MetaCLIP2.ipynb https://huggingface.co/merve/smol-vision/blob/main/Image_Search_with_MetaCLIP2.ipynb Propositional Chunking https://deepwiki.com/realyinchen/RAG/2.3-proposition-chunking https://eclabs.ai/proposition-based-chunking https://chamomile.ai/reliable-rag-with-data-preprocessing/ https://discovery.graphsandnetworks.com/graphAI/rags.html Factual Analysis https://arxiv.org/html/2404.12065v1 https://github.com/farrukhrashid1997/Fathom https://aclanthology.org/2024.fever-1.5/ https://github.com/uhh-hcds/UHH-at-AVeriTeC https://aclanthology.org/2024.fever-1.6/ https://aclanthology.org/2024.fever-1.8/ https://arxiv.org/abs/2407.17023 https://arxiv.org/abs/2401.15498 https://aclanthology.org/2025.fever-1.20.pdf https://aclanthology.org/2025.fever-1.19/ https://arxiv.org/abs/2504.14175 https://arxiv.org/abs/2503.01670 https://arxiv.org/pdf/2305.13117 https://arxiv.org/html/2506.19607v1 https://huggingface.co/datasets/UCSC-IRKM/RAGuard https://arxiv.org/abs/2402.10735 https://arxiv.org/abs/2502.09083 https://fever.ai/task.html https://github.com/ssu-humane/HerO2 https://github.com/heruberuto/FEVER-8-Shared-Task/tree/main https://gitlab.kit.edu/utjwp/Fact-Checking_GoT_RAG_LLM https://github.com/Raldir/FEVER-8-Shared-Task https://medium.com/data-science-collective/comprehensive-guide-to-fine-tuning-llm-4a8fd4d0e0af https://apxml.com/courses/fine-tuning-adapting-large-language-models/chapter-6-evaluation-analysis-fine-tuned-models/assessing-factual-accuracy https://deepgram.com/learn/truthfulqa-llm-benchmark-guide https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation https://huggingface.co/FacebookAI/roberta-large-mnli https://huggingface.co/models?sort=modified&search=nli https://huggingface.co/cross-encoder/nli-deberta-v3-base https://huggingface.co/cross-encoder/nli-deberta-v3-large https://aclanthology.org/volumes/2023.fever-1/ https://aclanthology.org/volumes/2024.fever-1/ https://www.microsoft.com/en-us/research/blog/claimify-extracting-high-quality-claims-from-language-model-outputs/ https://www.microsoft.com/en-us/research/publication/towards-effective-extraction-and-evaluation-of-factual-claims/ https://arxiv.org/html/2312.06648v3#S3 https://raw.githubusercontent.com/NirDiamant/RAG_Techniques/refs/heads/main/all_rag_techniques/HyDe_Hypothetical_Document_Embedding.ipynb https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb https://arxiv.org/pdf/2502.10855 https://huggingface.co/datasets/microsoft/claimify-dataset/viewer/default/train?row=0&views%5B%5D=train https://gist.github.com/deshwalmahesh/55ad890a8e4bac8155aaf4058f06cfdb https://arxiv.org/pdf/2406.19803 https://iopscience.iop.org/article/10.1088/2632-2153/ad7228 https://github.com/lamm-mit/GraphReasoning https://arxiv.org/abs/2409.13740 https://arxiv.org/html/2503.10150v2 https://kaitchup.substack.com/p/multimodal-rag-with-colpali-and-qwen2?utm_campaign=posts-open-in-app&triedRedirect=true https://bytevagabond.com/post/how-to-build-enterprise-ai-rag/ https://archive.is/dnSIi https://github.com/Future-House/paper-qa https://medium.com/@richardhightower/stop-the-hallucinations-hybrid-retrieval-with-bm25-pgvector-embedding-rerank-llm-rubric-rerank-895d8f7c7242 https://arxiv.org/html/2510.02657v2#S6 https://www.lesswrong.com/posts/bT8yyJHpK64v3nh2N/ai-safety-chatbot https://github.com/tbroadley/ai-safety-conversational-agent https://github.com/yichuan-w/LEANN?tab=readme-ov-file https://medium.com/@bravekjh/rag-mcp-supercharging-ai-agents-with-retrieval-augmented-generation-and-model-context-protocol-775ff5a7206f https://github.com/Aquiles-ai/Aquiles-RAG https://www.reddit.com/r/Rag/comments/1n98ua7/building_a_productiongrade_rag_on_a_900page/ https://integral-business-intelligence.github.io/archivist/ https://huggingface.co/blog/thebajajra/rexbert-encoders https://cloud.google.com/generative-ai-app-builder/docs/check-grounding#claim-level-score-response-examples https://universalrag.github.io/ https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html https://github.com/facebookresearch/ReasonIR/tree/main https://huggingface.co/lightonai/Reason-ModernColBERT https://arxiv.org/pdf/2407.12883 https://arxiv.org/pdf/2505.22095 https://github.com/hzy312/knowledge-r1/blob/main/README_Search-R1.md https://arxiv.org/html/2503.21729v3 https://github.com/yale-nlp/MCTS-RAG https://www.tutorialspoint.com/basic-understanding-of-cure-algorithm https://cloud.google.com/generative-ai-app-builder/docs/check-grounding https://arxiv.org/abs/2508.13107 https://github.com/SciPhi-AI/R2R/ https://github.com/Aquiles-ai/Aquiles-RAG https://aclanthology.org/2024.acl-long.585/ https://dl.acm.org/doi/10.1145/3626772.3657853 https://aclanthology.org/2024.findings-emnlp.449/ https://aclanthology.org/2024.acl-long.540/ https://github.com/calubkk/RAAT https://github.com/microsoft/LMOps/tree/main/corag https://www.figma.com/slides/eq8UzeTluT3qLKOpzlLj8y/LlamaIndex-Talk--Document-Agents-?node-id=51004-814&t=7F9MAlkLnukNb3f6-0 https://ai.plainenglish.io/visual-grounding-rag-with-docling-9696d02054f2 https://lukew.com/ff/entry.asp?2106?ref=sidebar https://github.com/topoteretes/cognee https://arxiv.org/abs/2412.02592 https://github.com/Bessouat40/RAGLight https://github.com/Aquiles-ai/Aquiles-RAG https://pub.towardsai.net/demystifying-googles-data-gemma-f07a470c2a39?gi=1e0e15c7b54a https://github.com/CODE-AXION/rag-best-practices/tree/main?tab=readme-ov-file#prompt https://github.com/MinghoKwok/DeepSieve https://arxiv.org/abs/2508.13107 https://archive.is/15vEV#selection-224.0-332.0 https://github.com/BevinV/Interactive-Rag https://www.aryn.ai/ https://landing.ai/ https://github.com/OpenBMB/UltraRAG https://github.com/bRAGAI/bRAG-langchain https://docs.llamaindex.ai/en/stable/examples/memory/Mem0Memory/ https://github.com/run-llama/llama_parse/blob/main/examples/advanced_rag/dynamic_section_retrieval.ipynb https://m3docrag.github.io/ https://medium.com/knowledge-nexus-ai/introducing-lightrag-a-new-era-in-retrieval-augmented-generation-57c20d6081fa https://arxiv.org/abs/2409.10038 https://arxiv.org/html/2410.22349v1 https://mltechniques.com/2024/10/08/building-a-ranking-system-to-enhance-prompt-results/ https://openragmoe.github.io/ https://github.com/Ariya12138/CORAL ttps://arxiv.org/abs/2410.07176 https://arxiv.org/abs/2410.04343 https://github.com/Ancientshi/ERM4 https://github.com/QingFei1/LongRAG https://www.reddit.com/r/LangChain/comments/1dzfp48/agent_retrieval_how_we_almost_always_find_the/ https://www.reddit.com/r/LangChain/comments/1dtr49t/agent_rag_parallel_quotes_how_we_built_rag_on/ https://arxiv.org/abs/2505.23052 https://help.openai.com/en/articles/10169521-projects-in-chatgpt https://github.com/ag2ai/SimpleDoc https://help.openai.com/en/articles/10169521-projects-in-chatgpt https://medium.com/@evgeni.n.rusev/multi-agent-ai-with-hybrid-search-cutting-document-review-time-by-80-f7367a9b1361 https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/ https://www.midswirl.com/blog/road-to-sqlite-vec-exploring-sqlite-as-a-rag-vector-database https://github.com/bytedance/Dolphin https://github.com/AppFlowy-IO/AppFlowy https://levelup.gitconnected.com/creating-the-best-rag-finder-pipeline-for-your-dataset-88062a6fa45e https://medium.com/data-science-collective/let-users-talk-to-your-databases-build-a-rag-powered-sql-assistant-with-streamlit-4fc9a2ee3960 https://www.reddit.com/r/LocalLLaMA/comments/1khjrtj/building_llm_workflows_some_observations/ https://abdullin.com/ilya/how-to-build-best-rag/ https://github.com/IlyaRice/RAG-Challenge-2 https://github.com/shredEngineer/Archive-Agent/tree/main/archive_agent/core https://arxiv.org/html/2504.01818v1 https://github.com/waetr/KET-RAG https://arxiv.org/abs/2502.09304 https://github.com/CornelliusYW/Multimodal-RAG-Implementation https://github.com/mixedbread-ai/maxsim-cpu https://arxiv.org/html/2506.23026v1#S4 https://arxiv.org/html/2504.15629v2#bib https://arxiv.org/abs/2504.15629v2 https://arxiv.org/abs/2505.17238v2 https://arxiv.org/abs/2404.19543v2 https://arxiv.org/html/2506.16035v2#S6 https://arxiv.org/abs/2507.02962v3 https://github.com/DavidZWZ/Awesome-RAG-Reasoning https://arxiv.org/abs/2507.12455v1 https://medium.com/data-science/rags-with-query-routing-5552e4e41c54 https://huggingface.co/vidore/colqwen-omni-v0.1 https://www.ragie.ai/blog/how-we-built-multimodal-rag-for-audio-and-video https://arxiv.org/abs/2507.12455v1 https://github.com/jolibrain/colette/tree/main https://zenn.dev/microsoft/articles/rag_textbook https://arxiv.org/abs/2507.09477 https://github.com/jolibrain/colette https://contextual.ai/blog/document-parser-for-rag/ https://docs.contextual.ai/api-reference/parse/parse-file https://arxiv.org/pdf/2506.00054 Factual Extraction RAG https://medium.com/@eliot64/bridging-legal-ai-and-trust-how-we-won-the-llm-x-law-hackathon-45081a8681d9 https://github.com/pkargupta/claimspect Models https://huggingface.co/THUDM/GLM-Z1-32B-0414 https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e Agentic RAG https://arxiv.org/pdf/2501.09136 https://github.com/asinghcsu/AgenticRAG-Survey Graphs (GraphRAG / KnowledgeGraphs) GARLIC https://arxiv.org/pdf/2410.04790v1 101 https://www.byhand.ai/p/beginners-guide-to-graph-rag https://towardsdatascience.com/graph-rag-a-conceptual-introduction-41cd0d431375/ About https://github.com/zjukg/KG-LLM-Papers https://arxiv.org/pdf/2502.11371 https://generativeai.pub/graph-rag-has-awesome-potential-but-currently-has-serious-flaws-c052a8a3107e?gi=3852d6a64701 https://news.ycombinator.com/item?id=34605772 https://medium.com/@dickson.lukose/ontology-modelling-and-engineering-4df8b6b9f3a5 https://arxiv.org/html/2411.15671v1 https://ai.plainenglish.io/metagraphs-and-hypergraphs-for-complex-ai-agent-memory-and-rag-717f6f3589f5 https://generativeai.pub/knowledge-graph-extraction-visualization-with-local-llm-from-unstructured-text-a-history-example-94c63b366fed?gi=876173cfaae4 https://medium.com/@dickson.lukose/ontology-reasoning-imperative-for-intelligent-graphrag-part-1-2-0018265b987c https://medium.com/@ianormy/microsoft-graphrag-with-an-rdf-knowledge-graph-part-2-d8d291a39ed1 Creation of https://arxiv.org/abs/2412.03589 https://iopscience.iop.org/article/10.1088/2632-2153/ad7228/pdf https://arxiv.org/abs/2411.19539 Knowledge-Augmented Gen https://github.com/OpenSPG/KAG https://pub.towardsai.net/kag-graph-multimodal-rag-llm-agents-powerful-ai-reasoning-b3da38d31358 Implement https://towardsdatascience.com/how-to-build-a-graph-rag-app-b323fc33ba06/ https://blog.gopenai.com/llm-ontology-prompting-for-knowledge-graph-extraction-efdcdd0db3a1?gi=23252228d718 https://pub.towardsai.net/building-a-knowledge-graph-from-unstructured-text-data-a-step-by-step-guide-c14c926c2229 https://arxiv.org/abs/2408.04187 https://towardsdatascience.com/how-to-query-a-knowledge-graph-with-llms-using-grag-38bfac47a322/ https://github.com/zjunlp/OneKE https://ai.plainenglish.io/unified-knowledge-graph-model-rdf-rdf-vs-lpg-the-end-of-war-a7c14d6ac76f https://neuml.hashnode.dev/advanced-rag-with-graph-path-traversal https://www.youtube.com/watch?v=g6xBklAIrsA https://towardsdatascience.com/how-to-implement-graph-rag-using-knowledge-graphs-and-vector-databases-60bb69a22759 https://towardsdatascience.com/text-to-knowledge-graph-made-easy-with-graph-maker-f3f890c0dbe8 https://towardsdatascience.com/how-to-convert-any-text-into-a-graph-of-concepts-110844f22a1a/ https://towardsdatascience.com/building-a-knowledge-graph-from-scratch-using-llms-f6f677a17f07/ https://medium.com/thoughts-on-machine-learning/building-dynamic-knowledge-graphs-using-open-source-llms-06a870e1bc4f https://ai.plainenglish.io/modeling-ai-semantic-memory-with-knowledge-graphs-1ce06f683433 https://towardsdatascience.com/how-to-convert-any-text-into-a-graph-of-concepts-110844f22a1a/ https://arxiv.org/abs/2412.04119 https://medium.com/@infiniflowai/how-our-graphrag-reveals-the-hidden-relationships-of-jon-snow-and-the-mother-of-dragons-bd89084f64ec https://pub.towardsai.net/exploring-and-comparing-graph-based-rag-approaches-microsoft-graphrag-vs-neo4j-langchain-3837cd3dddef https://medium.com/@researchgraph/dynamic-knowledge-graphs-a-next-step-for-data-representation-c35a205a520a https://arxiv.org/abs/2501.02157 https://arxiv.org/abs/2412.03589 https://github.com/circlemind-ai/fast-graphrag https://github.com/gusye1234/nano-graphrag Models https://www.arxiv.org/abs/2502.13339 Retrieval https://generativeai.pub/advanced-rag-retrieval-strategies-using-knowledge-graphs-12c9ce54d2da Context Relevancy https://arxiv.org/abs/2404.10198 https://github.com/kevinwu23/StanfordClashEval ``` @Rock-star-007 from reddit: I added 3 nodes in my agent - query intent identifier, query rephraser , search_scope identifier. You try to identify what what the user is intending to ask. The rephrase the question as per the intent. You need to write prompts for rephrasing accordingly. Then identify what specific documents the user might be talking about. Then perform retrieval step with appropriate scope, then do generation. ``` Creating QA Knowledgebase based off chats; ``` For every chat conversation you generate an LLM summary with in two part (problem/solution) Then you you do an embedding of the problem part for your RAG. Next time a user explain a issue, you RAG it (each summaries) and you got a collection of similar case. ---------- dump chats out, extract timestamps and sender lines with a regex (which breaks every third export for reasons unknown). You want that in the metadata. split into chunks, toss each chunk at an LLM with a prompt like: “give me 3-4 questions someone might ask that are answered in this chunk, and the matching answers” so you’re generating Qs for the indexed content, not just summarizing. sometimes it hallucinated wild stuff, but it’s faster than hand-labeling batch QA pairs into a big ugly CSV, feed to vector DB ------- prepare an llm(system prompt, proper model and few shot prompts of human generated summaries are enough) for summarizing chats and make it output a json object in a schema, including prompt-response pairs(problem/question - fix/response) ``` Improvements https://arxiv.org/pdf/2501.07391 https://cobusgreyling.medium.com/four-levels-of-rag-research-from-microsoft-fdc54388f0ff https://arxiv.org/html/2412.00239v1#S6 https://towardsdatascience.com/advanced-retrieval-techniques-for-better-rags-c53e1b03c183/ https://towardsdatascience.com/5-proven-query-translation-techniques-to-boost-your-rag-performance-47db12efe971/ https://arxiv.org/abs/2412.19442 https://pub.towardsai.net/revisiting-chunking-in-the-rag-pipeline-9aab8b1fdbe7 Bias https://arxiv.org/pdf/2502.17390 Fine-tuning RAG Models https://arxiv.org/abs/2412.15563 CORAG https://arxiv.org/abs/2501.14342 PA-RAG https://arxiv.org/abs/2412.14510 RAFT https://arxiv.org/abs/2403.10131 Multi-Modal https://arxiv.org/pdf/2502.08826v2 RAG 'Styles' Adaptive https://arxiv.org/pdf/2505.22095 https://arxiv.org/abs/2503.21729 Agentic https://arxiv.org/pdf/2501.09136v1 https://arxiv.org/pdf/2412.17149 https://research.google/blog/chain-of-agents-large-language-models-collaborating-on-long-context-tasks/ MMOA-RAG https://arxiv.org/abs/2501.15228 Search-O1 https://github.com/sunnynexus/Search-o1 AssistRAG https://github.com/smallporridge/AssistRAG Cache-Augmented-Generation https://arxiv.org/abs/2412.15605 CRAG - Corrective RAG https://arxiv.org/abs/2401.15884 DeepRAG https://arxiv.org/abs/2502.01142 FACT - Multi-fact retrieval https://arxiv.org/abs/2410.21012 FlexRAG https://arxiv.org/pdf/2409.15699v1 GEAR https://arxiv.org/abs/2501.02772 HTMLRag - Retrieval works better with structured text like HTML vs unstructured plaintext https://github.com/plageon/HtmlRAG https://arxiv.org/abs/2411.02959v1 Knowledge-Agumented Retrieval https://medium.com/@samarrana407/why-knowledge-augmented-generation-kag-is-the-best-approach-to-rag-2e7820228087 Knowledge-Aware-Retrieval https://arxiv.org/abs/2410.13765 MAIN-rag https://arxiv.org/abs/2501.00332 MCTS https://github.com/yale-nlp/MCTS-RAG RankCOT https://arxiv.org/pdf/2502.17888 https://github.com/NEUIR/RankCoT RARE - Retrieval-Augmented Reasoning Enhancement for Large Language Models https://arxiv.org/abs/2412.02830 RaRE - RAG with in-contex examples(?) https://arxiv.org/abs/2410.20088 Speculative RAG https://research.google/blog/speculative-rag-enhancing-retrieval-augmented-generation-through-drafting/ ReACT https://arxiv.org/abs/2210.03629 Review-then-Refine https://arxiv.org/abs/2412.15101 TableRAG https://arxiv.org/pdf/2410.04739v1 TrustRAG https://arxiv.org/pdf/2501.00879 VLM https://github.com/HKUDS/LightRAG https://github.com/llm-lab-org/Multimodal-RAG-Survey https://lascari.ai/writing/2025/02/10/image-gen-tagging/ https://arxiv.org/abs/2502.08826 https://github.com/Lokesh-Chimakurthi/vision-rag https://github.com/AhmedAl93/multimodal-agentic-RAG https://github.com/tjmlabs/ColiVara https://huggingface.co/vidore/colqwen2-v0.1 https://colivara.com/ https://github.com/lesteroliver911/pdf-analyzer-with-page-citations https://pub.towardsai.net/multimodal-rag-unveiled-a-deep-dive-into-cutting-edge-advancements-0eeb514c3ac4 https://arxiv.org/abs/2502.20964 Ranking https://arxiv.org/pdf/2409.11598 Retrieval https://arxiv.org/abs/2412.03736 https://arxiv.org/abs/2409.14924 https://weaviate.io/blog/late-chunking https://softwaredoug.com/blog/2025/02/08/elasticsearch-hybrid-search https://arxiv.org/abs/2412.03736 https://softwaredoug.com/blog/2024/08/06/i-made-search-worse-elasticsearch https://arxiv.org/pdf/2501.00332 DRIFT https://www.microsoft.com/en-us/research/blog/introducing-drift-search-combining-global-and-local-search-methods-to-improve-quality-and-efficiency/ FunnelRAG https://arxiv.org/pdf/2410.10293v1 GoldenRetriever https://ai.plainenglish.io/a-deep-dive-into-golden-retriever-eea3396af3b4 Mixture-of-PageRanks https://arxiv.org/abs/2412.06078 https://www.zyphra.com/post/the-mixture-of-pageranks-retriever-for-long-context-pre-processing DRAGIN https://arxiv.org/abs/2403.10081 https://github.com/oneal2000/DRAGIN/tree/main MBA-RAG https://arxiv.org/abs/2412.01572 https://github.com/FUTUREEEEEE/MBA Vector-related https://www.louisbouchard.ai/indexing-methods/ ### Links - RAG 101 * https://www.youtube.com/watch?v=nc0BupOkrhI * https://arxiv.org/abs/2401.08406 * https://github.com/NirDiamant/RAG_Techniques?tab=readme-ov-file * https://github.com/jxnl/n-levels-of-rag * https://winder.ai/llm-architecture-rag-implementation-design-patterns/ * https://medium.com/@yufan1602/modular-rag-and-rag-flow-part-%E2%85%B0-e69b32dc13a3 https://news.ycombinator.com/item?id=42174829 - RAG 201 * https://medium.com/@cdg2718/why-your-rag-doesnt-work-9755726dd1e9 * https://www.cazton.com/blogs/technical/advanced-rag-techniques * https://medium.com/@krtarunsingh/advanced-rag-techniques-unlocking-the-next-level-040c205b95bc * https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6 * https://winder.ai/llm-architecture-rag-implementation-design-patterns/ * https://towardsdatascience.com/17-advanced-rag-techniques-to-turn-your-rag-app-prototype-into-a-production-ready-solution-5a048e36cdc8 * https://medium.com/@samarrana407/mastering-rag-advanced-methods-to-enhance-retrieval-augmented-generation-4b611f6ca99a * https://generativeai.pub/advanced-rag-retrieval-strategy-query-rewriting-a1dd61815ff0 * https://medium.com/@yufan1602/modular-rag-and-rag-flow-part-%E2%85%B0-e69b32dc13a3 * https://pub.towardsai.net/rag-architecture-advanced-rag-3fea83e0d189?gi=47c0b76dbee0 * https://towardsdatascience.com/3-advanced-document-retrieval-techniques-to-improve-rag-systems-0703a2375e1c - Articles * https://posts.specterops.io/summoning-ragnarok-with-your-nemesis-7c4f0577c93b?gi=7318858af6c3 * https://blog.demir.io/advanced-rag-implementing-advanced-techniques-to-enhance-retrieval-augmented-generation-systems-0e07301e46f4 * https://arxiv.org/abs/2312.10997 * https://jxnl.co/writing/2024/05/22/systematically-improving-your-rag/ * https://www.arcus.co/blog/rag-at-planet-scale * https://d-star.ai/embeddings-are-not-all-you-need - Architecture Design - https://medium.com/@yufan1602/modular-rag-and-rag-flow-part-ii-77b62bf8a5d3 - https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1 * https://github.com/ray-project/llm-applications - https://levelup.gitconnected.com/testing-18-rag-techniques-to-find-the-best-094d166af27f Papers - Rags to Riches - https://huggingface.co/papers/2406.12824 * LLMs will use foreign knowledge sooner than parametric information. - Lit Search * https://arxiv.org/pdf/2407.18940 * https://arxiv.org/abs/2407.18940 - Building * https://techcommunity.microsoft.com/t5/microsoft-developer-community/building-the-ultimate-nerdland-podcast-chatbot-with-rag-and-llm/ba-p/4175577 * https://medium.com/@LakshmiNarayana_U/advanced-rag-techniques-in-ai-retrieval-a-deep-dive-into-the-chroma-course-d8b06118cde3 * https://rito.hashnode.dev/building-a-multi-hop-qa-with-dspy-and-qdrant * https://blog.gopenai.com/advanced-retrieval-augmented-generation-rag-techniques-5abad385ac66?gi=09e684acab4d * https://www.youtube.com/watch?v=bNqSRNMgwhQ * https://www.youtube.com/watch?v=7h6uDsfD7bg * https://www.youtube.com/watch?v=Balro-DxFyk&list=PLwPYSl1MQp4FpIzn48ypesKYzLvUBQpPF&index=5 * https://github.com/jxnl/n-levels-of-rag * https://rito.hashnode.dev/building-a-multi-hop-qa-with-dspy-and-qdrant - Chunking * https://archive.is/h0oBZ * https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/ - Evals * https://github.com/Kain-90/RAG-Play - Multi-Modal RAG * https://docs.llamaindex.ai/en/v0.10.17/examples/multi_modal/multi_modal_pdf_tables.html * https://archive.is/oIhNp * https://arxiv.org/html/2407.01449v2 - Query Expansion * https://arxiv.org/abs/2305.03653 - Cross-Encoder Ranking * Deep Neural network that processes two input sequences together as a single input. Allows the model to directly compare and contrast the inputs, understanding their relationship in a more integrated and nuanced manner. * https://www.sbert.net/examples/applications/retrieve_rerank/README.html Existing https://github.com/superlinear-ai/raglite ### Aligning with the Money 1. Why is it needed in the first place? 2. Identify & Document the Context * What is the business objective? * What led to this objective being identified? * Why is this the most ideal solution? * What other solutions have been evaluated? 3. Identify the intended use patterns * What questions will it answer? * What answers and what kinds of answers are users expecting? 4. Identify the amount + type of data to be archived/referenced * Need to identify what methods of metadata creation and settings will be the most cost efficient in time complexity. * How will you be receiving the data? * Will you be receiving the data or is it on you to obtain it? 5. What does success look like and how will you know you've achieved it? * What are the key metrics/values to measure/track? * How are these connected to the 'Success State'? ### Building my RAG Solution - **Outline** * Modular architecture design - **Pre-Retrieval** * F - **Retrieval** * F - **Post-Retrieval** 1. Stuff 2. Query Strategy * Factual Strategy: Focuses on retrieving precise facts and figures. * Analytical Strategy: Aims for comprehensive coverage of a topic, exploring different aspects. * Opinion Strategy: Tries to gather diverse viewpoints on a subjective issue. * Contextual Strategy: Incorporates user-specific context to tailor the retrieval. - **Generation & Post-Generation** - Prompt Compression * https://github.com/microsoft/LLMLingua - **Citations** * Contextcite: https://github.com/MadryLab/context-cite ### RAG Process 1. Pre-Retrieval - Raw data creation / Preparation 1. Prepare data so that text-chunks are self-explanatory 2. **Retrieval** 1. **Chunk Optimization** - Naive - Fixed-size (in characters) Overlapping Sliding windows * `limitations include imprecise control over context size, the risk of cutting words or sentences, and a lack of semantic consideration. Suitable for exploratory analysis but not recommended for tasks requiring deep semantic understanding.` - Recursive Structure Aware Splitting * `A hybrid method combining fixed-size sliding window and structure-aware splitting. It attempts to balance fixed chunk sizes with linguistic boundaries, offering precise context control. Implementation complexity is higher, with a risk of variable chunk sizes. Effective for tasks requiring granularity and semantic integrity but not recommended for quick tasks or unclear structural divisions.` - Structure Aware Splitting (by sentence/paragraph) * ` Respecting linguistic boundaries preserves semantic integrity, but challenges arise with varying structural complexity. Effective for tasks requiring context and semantics, but unsuitable for texts lacking defined structural divisions.` - Context-Aware Splitting (Markdown/LaTeX/HTML) * `ensures content types are not mixed within chunks, maintaining integrity. Challenges include understanding specific syntax and unsuitability for unstructured documents. Useful for structured documents but not recommended for unstructured content.` - NLP Chunking: Tracking Topic Changes * `based on semantic understanding, dividing text into chunks by detecting significant shifts in topics. Ensures semantic consistency but demands advanced NLP techniques. Effective for tasks requiring semantic context and topic continuity but not suitable for high topic overlap or simple chunking tasks.` 2. **Enhancing Data Quality** - Abbreviations/technical terms/links * `To mitigate that issue, we can try to ingest that necessary additional context while processing the data, e.g. replace abbreviations with the full text by using an abbreviation translation table.` 3. **Meta-data** - You can add metadata to your vector data in all vector databases. Metadata can later help to (pre-)filter the entire vector database before we perform a vector search. 4. **Optimize Indexing Structure** * `Full Search vs. Approximate Nearest Neighbor, HNSW vs. IVFPQ` - Types of Data: 1. Text * Chunked and turned into vector embeddings 2. Images and Diagrams * Turn into vector embeddings using a multi-modal/vision model 3. Tables * Summarized with an LLM, descriptions embedded and used for indexing * After retrieval, table is used as is. 4. Code snippets * Chunked using ? * Turned into vector embeddings using an embedding model 1. Chunk Optimization - Semantic splitter - optimize chunk size used for embedding - Small-to-Big - Sliding Window - Summary of chunks - Metadata Attachment 2. **Multi-Representation Indexing** - Convert into compact retrieval units (i.e. summaries) 1. Parent Document 2. Dense X 3. **Specialized Embeddings** 1. Fine-tuned 2. ColBERT 4. **Heirarchical Indexing** - Tree of document summarization at various abstraction levels 1. **RAPTOR** - Recursive Abstractive Processing for Tree-Organized Retrieval * https://arxiv.org/pdf/2401.18059 * `RAPTOR is a novel tree-based retrieval system designed for recursively embedding, clustering, and summarizing text segments. It constructs a tree from the bottom up, offering varying levels of summarization. During inference, RAPTOR retrieves information from this tree, incorporating data from longer documents at various levels of abstraction.` * https://archive.is/Zgb13 - README 5. **Knowledge Graphs / GraphRAG** - Use an LLM to construct a graph-based text index * https://arxiv.org/pdf/2404.16130 * https://github.com/microsoft/graphrag - Occurs in two steps: 1. Derives a knowledge graph from the source documents 2. Generates community summaries for all closely connected entity groups * Given a query, each community summary contributes to a partial response. These partial responses are then aggregated to form the final global answer. - Workflow: 1. Chunk Source documents 2. Construct a knowledge graph by extracting entities and their relationships from each chunk. 3. Simultaneously, Graph RAG employs a multi-stage iterative process. This process requires the LLM to determine if all entities have been extracted, similar to a binary classification problem. 4. Element Instances → Element Summaries → Graph Communities → Community Summaries * Graph RAG employs community detection algorithms to identify community structures within the graph, incorporating closely linked entities into the same community. * `In this scenario, even if LLM fails to identify all variants of an entity consistently during extraction, community detection can help establish the connections between these variants. Once grouped into a community, it signifies that these variants refer to the same entity connotation, just with different expressions or synonyms. This is akin to entity disambiguation in the field of knowledge graph.` * `After identifying the community, we can generate report-like summaries for each community within the Leiden hierarchy. These summaries are independently useful in understanding the global structure and semantics of the dataset. They can also be used to comprehend the corpus without any problems.` 5. Community Summaries → Community Answers → Global Answer 6. **HippoRAG** * https://github.com/OSU-NLP-Group/HippoRAG * https://arxiv.org/pdf/2405.14831 * https://archive.is/Zgb13#selection-2093.24-2093.34 7. **spRAG/dsRAG** - README * https://github.com/D-Star-AI/dsRAG 5. **Choosing the right embedding model** * F. 6. **Self query** * https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/self_query/ 7. **Hybrid & Filtered Vector Search** * Perform multiple search methods and combine results together 1. Keyword Search(BM25) + Vector 2. f 3. * https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking 8. **Query Construction** - Create a query to interact with a specific DB 1. Text-to-SQL * Relational DBs * Rewrite a query into a SQL query 2. Text-to-Cyber * Graph DBs * Rewrite a query into a cypher query 3. Self-query Retriever * Vector DBs * Auto-generate metadata filters from query 9. **Query Translation** 1. Query Decomposition - Decompose or re-phrase the input question 1. Multi-Query * https://archive.is/5y4iI - Sub-Question Querying * `The core idea of the sub-question strategy is to generate and propose sub-questions related to the main question during the question-answering process to better understand and answer the main question. These sub-questions are usually more specific and can help the system to understand the main question more deeply, thereby improving retrieval accuracy and providing correct answers.` 1. First, the sub-question strategy generates multiple sub-questions from the user query using LLM (Large Language Model). 2. Then, each sub-question undergoes the RAG process to obtain its own answer (retrieval generation). 3. Finally, the answers to all sub-questions are merged to obtain the final answer. 4. Sub Question prompt: - https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/question_gen/llama-index-question-gen-openai/llama_index/question_gen/openai/base.py#L18-L45 ``` You are a world class state of the art agent. You have access to multiple tools, each representing a different data source or API. Each of the tools has a name and a description, formatted as a JSON dictionary. The keys of the dictionary are the names of the tools and the values are the \ descriptions. Your purpose is to help answer a complex user question by generating a list of sub \ questions that can be answered by the tools. These are the guidelines you consider when completing your task: * Be as specific as possible * The sub questions should be relevant to the user question * The sub questions should be answerable by the tools provided * You can generate multiple sub questions for each tool * Tools must be specified by their name, not their description * You don't need to use a tool if you don't think it's relevant Output the list of sub questions by calling the SubQuestionList function. ``` 2. Step-Back Prompting * http://arxiv.org/pdf/2310.06117 * `technique that guides LLM to extract advanced concepts and basic principles from specific instances through abstraction, using these concepts and principles to guide reasoning. This approach significantly improves LLM’s ability to follow the correct reasoning path to solve problems.` - Flow: 1. Take in a question - `Estella Leopold went to what school in Aug 1954 and Nov 1954?` 2. Create a(or multiple) stepback question - `What was Estella Leopold's education history?` 3. Answer Stepback answer 4. Perform reasoning using stepback question + answer to create final answer 3. RAG-Fusion - Combining multiple data sources in one RAG (Walking RAG?) - 3 parts: 1. Query Generation - Generate multiple sub-queries from the user’s input to capture diverse perspectives and fully understand the user’s intent. 2. Sub-query Retrieval - Retrieve relevant information for each sub-query from large datasets and repositories, ensuring comprehensive and in-depth search results. 3. Reciprocal Rank Fusion - Merge the retrieved documents using Reciprocal Rank Fusion (RRF) to combine their ranks, prioritizing the most relevant and comprehensive results. 2. Pseudo-Documents - Hypothetical documents 1. HyDE * https://arxiv.org/abs/2212.10496 10. **Query Enhancement / Rewriting** - Replacing Acronyms with full phrasing - Providing synonyms to industry terms - Literally just ask the LLM to do it. 11. **Query Extension** 12. **Query Expansion** * 1. Query Expansion with a generated answer * Paper: https://arxiv.org/abs/2212.10496 * `We use the LLM to generate an answer, before performing the similarity search. If it is a question that can only be answered using our internal knowledge, we indirectly ask the model to hallucinate, and use the hallucinated answer to search for content that is similar to the answer and not the user query itself.` * Given an input query, this method first instructs an LLM to provide a hypothetical answer, whatever its correctness. * Then, the query and the generated answer are combined in a prompt and sent to the retrieval system. - Implementations: - HyDE (Hypothetical Document Embeddings) - Rewrite-Retrieve-Read - Step-Back Prompting - Query2Doc - ITER-RETGEN - Others? 2. Query Expansion with multiple related questions * We ask the LLM to generate N questions related to the original query and then send them all to the retrieval system * 13. **Multiple System Prompts** * Generate multiple prompts, consolidate answer 14. **Query Routing** - Let LLM decide which datastore to use for information retrieval based on user's query 1. Logical Routing - Let LLM choose DB based on question 2. Semantic Routing - embed question and choose prompt based on similarity 15. **Response Summarization** - Using summaries of returned items 16. **Ranking*** 1. Re-Rank * https://div.beehiiv.com/p/advanced-rag-series-retrieval 2. RankGPT 3. RAG-Fusion 17. **Refinement** 1. CRAG * https://arxiv.org/pdf/2401.15884 * https://medium.com/@kbdhunga/corrective-rag-c-rag-and-control-flow-in-langgraph-d9edad7b5a2c * https://medium.com/@djangoist/how-to-create-accurate-llm-responses-on-large-code-repositories-presenting-cgrag-a-new-feature-of-e77c0ffe432d 18. **Active Retrieval** - re-retrieve and or retrieve from new data sources if retrieved documents are not relevant. 1. CRAG 3. **Post-Retrieval** 1. **Context Enrichment** 1. Sentence Window Retriever * `The text chunk with the highest similarity score represents the best-matching content found. Before sending the content to the LLM we add the k-sentences before and after the text chunk found. This makes sense since the information has a high probability to be connected to the middle part and maybe the piece of information in the middle text chunk is not complete.` 2. Auto-Merging Retriever * `The text chunk with the highest similarity score represents the best-matching content found. Before sending the content to the LLM we add each small text chunk's assigned “parent” chunks, which do not necessarily have to be the chunk before and after the text chunk found.` * We can build on top of that concept and set up a whole hierarchy like a decision tree with different levels of Parent Nodes, Child Nodes and Leaf Nodes. We could for example have 3 levels, with different chunk sizes - See https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/ 4. **Generation & Post-Generation** 1. **Self-Reflective RAG / Self-RAG** - Fine-tuned models/first paper on it: * https://arxiv.org/abs/2310.11511 * https://github.com/AkariAsai/self-rag - Articles * https://blog.langchain.dev/agentic-rag-with-langgraph/ - Info: * We can use outside systems to quantify the quality of retrieval items and generations, and if necessary, re-perform the query or retrieval with a modified input. 2. **Corrective RAG** - Key Pieces: 1. Retrieval Evaluator: * A lightweight retrieval evaluator is introduced to assess the relevance of retrieved documents. - It assigns a confidence score and triggers one of three actions: * Correct: If the document is relevant, refine it to extract key knowledge. * Incorrect: If the document is irrelevant, discard it and perform a web search for external knowledge. * Ambiguous: If the evaluator is uncertain, combine internal and external knowledge sources. 2. Decompose-then-Recompose Algorithm: * A process to refine retrieved documents by breaking them down into smaller knowledge strips, filtering irrelevant content, and recomposing important information. 3. Web Search for Corrections: * When incorrect retrieval occurs, the system leverages large-scale web search to find more diverse and accurate external knowledge.` 3. **Rewrite-Retrieve-Read (RRR)** * https://arxiv.org/pdf/2305.14283 4. **Choosing the appropriate/correct model** 5. **Agents** 6. **Evaluation** - Metrics: - Generation 1. Faithfulness - How factually accurate is the generated answer? 2. Answer Relevancy - How relevant is the generated answer to the question? - Retrieval 1. Context Precision 2. Context Recall - Others 1. Answer semantic Similarity 2. Answer correctness 1. Normalized Discounted Cumulative Gain (NDCG) * https://www.evidentlyai.com/ranking-metrics/ndcg-metric#:~:text=DCG%20measures%20the%20total%20item,ranking%20quality%20in%20the%20dataset. 2. Existing RAG Eval Frameworks * RAGAS - https://archive.is/I8f2w 3. LLM as a Judge * We generate an evaluation dataset -> Then define a so-called critique agent with suitable criteria we want to evaluate -> Set up a test pipeline that automatically evaluates the responses of the LLMs based on the defined criteria. 4. Usage Metrics * Nothing beats real-world data. 5. **Delivery** RAG-Fusion - Combining multiple data source in one RAG search JSON file store Vector indexing ### Chunking - https://github.com/D-Star-AI/dsRAG' - **Improvements/Ideas** * As part of chunk header summary, include where in the document this chunk is located, besides chunk #x, so instead this comes from the portion of hte document talking about XYZ in the greater context - Chunk Headers * The idea here is to add in higher-level context to the chunk by prepending a chunk header. This chunk header could be as simple as just the document title, or it could use a combination of document title, a concise document summary, and the full hierarchy of section and sub-section titles. - Chunks -> segments* * Large chunks provide better context to the LLM than small chunks, but they also make it harder to precisely retrieve specific pieces of information. Some queries (like simple factoid questions) are best handled by small chunks, while other queries (like higher-level questions) require very large chunks. * We break documents up into chunks with metadata at the head of each chunk to help categorize it to the document/align it with the greater context - **Semantic Sectioning** * Semantic sectioning uses an LLM to break a document into sections. It works by annotating the document with line numbers and then prompting an LLM to identify the starting and ending lines for each “semantically cohesive section.” These sections should be anywhere from a few paragraphs to a few pages long. The sections then get broken into smaller chunks if needed. The LLM is also prompted to generate descriptive titles for each section. These section titles get used in the contextual chunk headers created by AutoContext, which provides additional context to the ranking models (embeddings and reranker), enabling better retrieval. 1. Identify sections 2. Split sections into chunks 3. Add metadata header to each chunk * `Document: X` * `Section: X1` * Alt: `Concise parent document summary` * Other approaches/bits of info can help/experiment... - **AutoContext** * `AutoContext creates contextual chunk headers that contain document-level and section-level context, and prepends those chunk headers to the chunks prior to embedding them. This gives the embeddings a much more accurate and complete representation of the content and meaning of the text. In our testing, this feature leads to a dramatic improvement in retrieval quality. In addition to increasing the rate at which the correct information is retrieved, AutoContext also substantially reduces the rate at which irrelevant results show up in the search results. This reduces the rate at which the LLM misinterprets a piece of text in downstream chat and generation applications.` - **Relevant Segment Extraction** * Relevant Segment Extraction (RSE) is a query-time post-processing step that takes clusters of relevant chunks and intelligently combines them into longer sections of text that we call segments. These segments provide better context to the LLM than any individual chunk can. For simple factual questions, the answer is usually contained in a single chunk; but for more complex questions, the answer usually spans a longer section of text. The goal of RSE is to intelligently identify the section(s) of text that provide the most relevant information, without being constrained to fixed length chunks. - **Topic Aware Chunking by Sentence** * https://blog.gopenai.com/mastering-rag-chunking-techniques-for-enhanced-document-processing-8d5fd88f6b72?gi=2f39fdede29b ### Vector DBs - Indexing mechanisms * Locality-Sensitive Hashing (LSH) * Hierarchical Graph Structure * Inverted File Indexing * Product Quantization * Spatial Hashing * Tree-Based Indexing variations - Embedding algos * Word2Vec * GloVe * Ada * BERT * Instructor - Similarity Measurement Algos * Cosine similarity - measuring the cosine of two angles * Euclidean distance - measuring the distance between two points - Indexing and Searching Algos - Approximate Nearest Neighbor (ANN) * FAISS * Annoy * IVF * HNSW (Heirarchical Navigable small worlds) - Vector Similarity Search - `Inverted File (IVF)` - `indexes are used in vector similarity search to map the query vector to a smaller subset of the vector space, reducing the number of vectors compared to the query vector and speeding up Approximate Nearest Neighbor (ANN) search. IVF vectors are efficient and scalable, making them suitable for large-scale datasets. However, the results provided by IVF vectors are approximate, not exact, and creating an IVF index can be resource-intensive, especially for large datasets.` - `Hierarchical Navigable Small World (HNSW)` - `graphs are among the top-performing indexes for vector similarity search. HNSW is a robust algorithm that produces state-of-the-art performance with fast search speeds and excellent recall. It creates a multi-layered graph, where each layer represents a subset of the data, to quickly traverse these layers to find approximate nearest neighbors. HNSW vectors are versatile and suitable for a wide range of applications, including those that require high-dimensional data spaces. However, the parameters of the HNSW algorithm can be tricky to tune for optimal performance, and creating an HNSW index can also be resource intensive.` - **Vectorization Process** - Usually several stages: 1. Data Pre-processing * `The initial stage involves preparing the raw data. For text, this might include tokenization (breaking down text into words or phrases), removing stop words, and normalizing the text (like lowercasing). For images, preprocessing might involve resizing, normalization, or augmentation.` 2. Feature Extraction * `The system extracts features from the preprocessed data. In text, features could be the frequency of words or the context in which they appear. For images, features could be various visual elements like edges, textures, or color histograms.` 3. Embedding Generation * `Using algorithms like Word2Vec for text or CNNs for images, the extracted features are transformed into numerical vectors. These vectors capture the essential qualities of the data in a dense format, typically in a high-dimensional space.` 4. Dimensionality Reduction * `Sometimes, the generated vectors might be very high-dimensional, which can be computationally intensive to process. Techniques like PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding) are used to reduce the dimensionality while preserving as much of the significant information as possible.` 5. Normalization * `Finally, the vectors are often normalized to have a uniform length. This step ensures consistency across the dataset and is crucial for accurately measuring distances or similarities between vectors.` ### Semantic Re-Ranker * `enhances retrieval quality by re-ranking search results based on deep learning models, ensuring the most relevant results are prioritized.` - General Steps 1. Initial Retrieval: a query is processed, and a set of potentially relevant results is fetched. This set is usually larger and broader, encompassing a wide array of documents or data points that might be relevant to the query. 2. LLM / ML model used to identify relevance 3. Re-Ranking Process: In this stage, the retrieved results are fed into the deep learning model along with the query. The model assesses each result for its relevance, considering factors such as semantic similarity, context matching, and the query's intent. 4. Generating a Score: Each result is assigned a relevance score by the model. This scoring is based on how well the content of the result matches the query in terms of meaning, context, and intent. 5. Sorting Results: Based on the scores assigned, the results are then sorted in descending order of relevance. The top-scoring results are deemed most relevant to the query and are presented to the user. 6. Continuous Learning and Adaptation: Many Semantic Rankers are designed to learn and adapt over time. By analyzing user interactions with the search results (like which links are clicked), the Ranker can refine its scoring and sorting algorithms, enhancing its accuracy and relevance. - **Relevance Metrics** - List of: 1. Precision and Recall: These are fundamental metrics in information retrieval. Precision measures the proportion of retrieved documents that are relevant, while recall measures the proportion of relevant documents that were retrieved. High precision means that most of the retrieved items are relevant, and high recall means that most of the relevant items are retrieved. 2. F1 Score: The F1 Score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall, useful in scenarios where it's important to find an equilibrium between finding as many relevant items as possible (recall) and ensuring that the retrieved items are mostly relevant (precision). 3. Normalized Discounted Cumulative Gain (NDCG): Particularly useful in scenarios where the order of results is important (like web search), NDCG takes into account the position of relevant documents in the result list. The more relevant documents appearing higher in the search results, the better the NDCG. 4. Mean Average Precision (MAP): MAP considers the order of retrieval and the precision at each rank in the result list. It’s especially useful in tasks where the order of retrieval is important but the user is likely to view only the top few results. ### Issues in RAG 1. Indexing - Issues: 1. Chunking 1. Relevance & Precision * `Properly chunked documents ensure that the retrieved information is highly relevant to the query. If the chunks are too large, they may contain a lot of irrelevant information, diluting the useful content. Conversely, if they are too small, they might miss the broader context, leading to accurate responses but not sufficiently comprehensive.` 2. Efficiency & Performance * `The size and structure of the chunks affect the efficiency of the retrieval process. Smaller chunks can be retrieved and processed more quickly, reducing the overall latency of the system. However, there is a balance to be struck, as too many small chunks can overwhelm the retrieval system and negatively impact performance.` 3. Quality of Generation * `The quality of the generated output heavily depends on the input retrieved. Well-chunked documents ensure that the generator has access to coherent and contextually rich information, which leads to more informative, coherent, and contextually appropriate responses.` 4. Scalability * `As the corpus size grows, chunking becomes even more critical. A well-thought-out chunking strategy ensures that the system can scale effectively, managing more documents without a significant drop in retrieval speed or quality.` 1. Incomplete Content Representation * `The semantic information of chunks is influenced by the segmentation method, resulting in the loss or submergence of important information within longer contexts.` 2. Inaccurate Chunk Similarity Search. * `As data volume increases, noise in retrieval grows, leading to frequent matching with erroneous data, making the retrieval system fragile and unreliable.` 3. Unclear Reference Trajectory. * `The retrieved chunks may originate from any document, devoid of citation trails, potentially resulting in the presence of chunks from multiple different documents that, despite being semantically similar, contain content on entirely different topics.` - Potential Solutions - Chunk Optimization - Sliding window * overlapping chunks - Small to Big * Retrieve small chunks then collect parent from meta data - Enhance data granularity - apply data cleaning techniques, like removing irrelevant information, confirming factual accuracy, updating outdated information, etc. - Adding metadata, such as dates, purposes, or chapters, for filtering purposes. - Structural Organization - Heirarchical Index * `In the hierarchical structure of documents, nodes are arranged in parent-child relationships, with chunks linked to them. Data summaries are stored at each node, aiding in the swift traversal of data and assisting the RAG system in determining which chunks to extract. This approach can also mitigate the illusion caused by block extraction issues.` - Methods for constructing index: 1. Structural awareness - paragraph and sentence segmentation in docs 2. Content Awareness - inherent structure in PDF, HTML, Latex 3. Semantic Awareness - Semantic recognition and segmentation of text based on NLP techniques, such as leveraging NLTK. 4. Knowledge Graphs 2. Pre-Retrieval - Issues: - Poorly worded queries - Language complexity and ambiguity - Potential Solutions: - Multi-Query - Expand original question into multiple - Sub-Query - `The process of sub-question planning represents the generation of the necessary sub-questions to contextualize and fully answer the original question when combined. ` - Chain-of-Verification(CoVe) - The expanded queries undergo validation by LLM to achieve the effect of reducing hallucinations. Validated expanded queries typically exhibit higher reliability. * https://arxiv.org/abs/2309.11495 - Query Transformation - Rewrite * The original queries are not always optimal for LLM retrieval, especially in real-world scenarios. Therefore, we can prompt LLM to rewrite the queries. - HyDE * `When responding to queries, LLM constructs hypothetical documents (assumed answers) instead of directly searching the query and its computed vectors in the vector database. It focuses on embedding similarity from answer to answer rather than seeking embedding similarity for the problem or query. In addition, it also includes Reverse HyDE, which focuses on retrieval from query to query.` * https://medium.aiplanet.com/advanced-rag-improving-retrieval-using-hypothetical-document-embeddings-hyde-1421a8ec075a?gi=b7fa45dc0f32&source=post_page-----e69b32dc13a3-------------------------------- - Reverse HyDE * - Step-back prompting * https://arxiv.org/abs/2310.06117 * https://cobusgreyling.medium.com/a-new-prompt-engineering-technique-has-been-introduced-called-step-back-prompting-b00e8954cacb - Query Routing * Based on varying queries, routing to distinct RAG pipeline,which is suitable for a versatile RAG system designed to accommodate diverse scenarios. - Metadata Router/Filter * `involves extracting keywords (entity) from the query, followed by filtering based on the keywords and metadata within the chunks to narrow down the search scope.` - Semantic Router * https://medium.com/ai-insights-cobet/beyond-basic-chatbots-how-semantic-router-is-changing-the-game-783dd959a32d - CoVe * https://sourajit16-02-93.medium.com/chain-of-verification-cove-understanding-implementation-e7338c7f4cb5 * https://www.domingosenise.com/artificial-intelligence/chain-of-verification-cove-an-approach-for-reducing-hallucinations-in-llm-outcomes.html - Multi-Query - SubQuery - Query Construction - Text-to-Cypher - Text-to-SQL * https://blog.langchain.dev/query-construction/?source=post_page-----e69b32dc13a3-------------------------------- 3. Retrieval - 3 Main considerations: 1. Retrieval Efficiency 2. Embedding Quality 3. Alignment of tasks, data and models - Sparse Retreiver * EX: BM25, TF-IDF - Dense Retriever * ColBERT * BGE/Cohere embedding/OpenAI-Ada-002 - Retriever Fine-tuning - SFT - LSR (LM-Supervised Retriever) - Reinforcement learning - Adapter * https://arxiv.org/pdf/2310.18347 * https://arxiv.org/abs/2305.17331 ` 4. Post-Retrieval - Primary Challenges: 1. Lost in the middle 2. Noise/anti-fact chunks 3. Context windows. - Potential Solutions - Re-Rank * Re-rank implementation: https://towardsdatascience.com/enhancing-rag-pipelines-in-haystack-45f14e2bc9f5 - Rule-based re-rank * According to certain rules, metrics are calculated to rerank chunks. * Some: Diversity; Relevance; MRR (Maximal Marginal Relevance, 1998) - Model based rerank * Utilize a language model to reorder the document chunks - Compression & Selection - LLMLingua * https://github.com/microsoft/LLMLingua * https://llmlingua.com/ - RECOMP * https://arxiv.org/pdf/2310.04408 - Selective Context * https://aclanthology.org/2023.emnlp-main.391.pdf - Tagging Filter * https://python.langchain.com/v0.1/docs/use_cases/tagging/ - LLM Critique 5. Generator * Utilize the LLM to generate answers based on the user’s query and the retrieved context information. - Finetuning * SFT * RL * Distillation - Dual FT * `In the RAG system, fine-tuning both the retriever and the generator simultaneously is a unique feature of the RAG system. It is important to note that the emphasis of system fine-tuning is on the coordination between the retriever and the generator. Fine-tuning the retriever and the generator separately separately belongs to the combination of the former two, rather than being part of Dual FT.` * https://arxiv.org/pdf/2310.01352 6. Orchestration * `Orchestration refers to the modules used to control the RAG process. RAG no longer follows a fixed process, and it involves making decisions at key points and dynamically selecting the next step based on the results.` - Scheduling * `The Judge module assesses critical point in the RAG process, determining the need to retrieve external document repositories, the satisfaction of the answer, and the necessity of further exploration. It is typically used in recursive, iterative, and adaptive retrieval.` - `Rule-base` * `The next course of action is determined based on predefined rules. Typically, the generated answers are scored, and then the decision to continue or stop is made based on whether the scores meet predefined thresholds. Common thresholds include confidence levels for tokens.` - `Prompt-base` * `LLM autonomously determines the next course of action. There are primarily two approaches to achieve this. The first involves prompting LLM to reflect or make judgments based on the conversation history, as seen in the ReACT framework. The benefit here is the elimination of the need for fine-tuning the model. However, the output format of the judgment depends on the LLM’s adherence to instructions.` * https://arxiv.org/pdf/2305.06983 - Tuning based * The second approach entails LLM generating specific tokens to trigger particular actions, a method that can be traced back to Toolformer and is applied in RAG, such as in Self-RAG. * https://arxiv.org/pdf/2310.11511 - Fusion * `This concept originates from RAG Fusion. As mentioned in the previous section on Query Expansion, the current RAG process is no longer a singular pipeline. It often requires the expansion of retrieval scope or diversity through multiple branches. Therefore, following the expansion to multiple branches, the Fusion module is relied upon to merge multiple answers.` - Possibility Ensemble * `The fusion method is based on the weighted values of different tokens generated from multiple beranches, leading to the comprehensive selection of the final output. Weighted averaging is predominantly employed.` * https://arxiv.org/pdf/2301.12652 - Reciprocal Rank Fusion * `RRF, is a technique that combines the rankings of multiple search result lists to generate a single unified ranking. Developed in collaboration with the University of Waterloo (CAN) and Google, RRF produces results that are more effective than reordering chunks under any single branch.` * https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1 * https://safjan.com/implementing-rank-fusion-in-python/ * https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking - Semantic dissonance * `the discordance between your task’s intended meaning, the RAG’s understanding of it, and the underlying knowledge that’s stored.` - Poor explainability of embeddings - Semantic Search tends to be directionally correct but inherently fuzzy * Good for finding top-k results - Significance of Dimensionality in Vector Embeddings * `The dimensionality of a vector, which is the length of the vector, plays a crucial role. Higher-dimensional vectors can capture more information and subtle nuances of the data, leading to more accurate models. However, higher dimensionality also increases computational complexity. Therefore, finding the right balance in vector dimensionality is key to efficient and effective model performance.` ### Potential Improvements when building https://gist.github.com/Donavan/62e238aa0a40ca88191255a070e356a2 - **Chunking** - Relevance & Precision - Efficiency and Performance - Quality of Generation - Scalability - **Embeddings** 1. **Encoder Fine-Tuning** * `Despite the high efficiency of modern Transformer Encoders, fine-tuning can still yield modest improvements in retrieval quality, especially when tailored to specific domains.` 2. Ranker Fine-Tuning * `Employing a cross-encoder for re-ranking can refine the selection of context, ensuring that only the most relevant text chunks are considered.` 3. LLM Fine-Tuning * `The advent of LLM fine-tuning APIs allows for the adaptation of models to specific datasets or tasks, enhancing their effectiveness and accuracy in generating responses.` - **Constructing the Search Index** 1. **Vector store index** 2. **Heirarchical Indices** * Two-tiered index, one for doc summaries the other for detailed chunks * Filter through the summaries first then search the chunks 3. **Hypothetical Questions and HyDE approach** * A novel approach involves the generation of hypothetical questions for each text chunk. These questions are then vectorized and stored, replacing the traditional text vectors in the index. This method enhances semantic alignment between user queries and stored data, potentially leading to more accurate retrievals. The HyDE method reverses this process by generating hypothetical responses to queries, using these as additional data points to refine search accuracy. - **Context Enrichment** 1. **Sentence-Window retrieval** * `This technique enhances search precision by embedding individual sentences and extending the search context to include neighboring sentences. This not only improves the relevance of the retrieved data but also provides the LLM with a richer context for generating responses.` 2. **Auto-merging Retriever** (Parent Document Retriever) * `Similar to the Sentence Window Retrieval, this method focuses on granularity but extends the context more broadly. Documents are segmented into a hierarchy of chunks, and smaller, more relevant pieces are initially retrieved. If multiple small chunks relate to a larger segment, they are merged to form a comprehensive context, which is then presented to the LLM.` 3. **Fusion Retrieval** * `The concept of fusion retrieval combines traditional keyword-based search methods, like TF-IDF or BM25, with modern vector-based search techniques. This hybrid approach, often implemented using algorithms like Reciprocal Rank Fusion (RRF), optimizes retrieval by integrating diverse similarity measures.` - **Re-Ranking & Filtering** * `After the initial retrieval of results using any of the aforementioned sophisticated algorithms, the focus shifts to refining these results through various post-processing techniques.` * `Various systems enabling the fine-tuning of retrieval outcomes based on similarity scores, keywords, metadata, or through re-ranking with additional models. These models could include an LLM, a sentence-transformer cross-encoder, or even external reranking services like Cohere. Moreover, filtering can also be adjusted based on metadata attributes, such as the recency of the data, ensuring that the most relevant and timely information is prioritized. This stage is critical as it prepares the retrieved data for the final step - feeding it into an LLM to generate the precise answer.` 1. f 2. f - **Query Transformations** 1. **(Sub-)Query Decomposition** * `For complex queries that are unlikely to yield direct comparisons or results from existing data (e.g., comparing GitHub stars between Langchain and LlamaIndex), an LLM can break down the query into simpler, more manageable sub-queries. Each sub-query can then be processed independently, with their results synthesized later to form a comprehensive response.` * Multi Query Retriever and Sub Question Query Engine - Step-back Prompting * `method involves using an LLM to generate a broader or more general query from the original, complex query. The aim is to retrieve a higher-level context that can serve as a foundation for answering the more specific original query. The contexts from both the original and the generalized queries are then combined to enhance the final answer generation.` - Query Rewriting * https://archive.is/FCiaW * `Another technique involves using an LLM to reformulate the initial query to improve the retrieval process` 2. **Reference Citations** - Direct Source Mention * Require mention of source IDs directly in generated response. - Fuzzy Matching * Align portions of the response with their corresponding text chunks in the index. - Research: - Attribution Bench: https://osu-nlp-group.github.io/AttributionBench/ * Finetuning T5 models outperform otherwise SOTA models. * Complexity of questions and data are issues. - ContextCite: https://gradientscience.org/contextcite/ * Hot shit? * https://gradientscience.org/contextcite-applications/ - Metrics - Enabling LLMs to generate text with citations paper * https://arxiv.org/abs/2305.14627 - **Chat Engine** 1. ContextChatEngine: * `A straightforward approach where the LLM retrieves context relevant to the user’s query along with any previous chat history. This history is then used to inform the LLM’s response, ensuring continuity and relevance in the dialogue.` 2. CondensePlusContextMode * ` A more advanced technique where each interaction’s chat history and the last message are condensed into a new query. This refined query is used to retrieve relevant context, which, along with the original user message, is passed to the LLM for generating a response.` - **Query Routing** * `Query routing involves strategic decision-making powered by an LLM to determine the most effective subsequent action based on the user’s query. This could include decisions to summarize information, search specific data indices, or explore multiple routes to synthesize a comprehensive answer. Query routers are crucial for selecting the appropriate data source or index, especially in systems where data is stored across multiple platforms, such as vector stores, graph databases, or relational databases.` - Query Routers * F - **Agents in RAG Systems** 1. **Multi-Document Agent Scheme** 2. **Walking RAG** - Multi-shot retrieval - Have the LLM ask for more information as needed and perform searches for said information, to loop back in to asking the LLM if there's enough info. - Things necessary to facillitate: * We need to extract partial information from retrieved pieces of source data, so we can learn as we go. * We need to find new places to look, informed by the source data as well as the question. * We need to retrieve information from those specific places. * Links: * https://olickel.com/retrieval-augmented-research-1-basics * https://olickel.com/retrieval-augmented-research-2-walking * https://olickel.com/retrieval-augmented-research-3-use-the-whole-brain 3. F - **Response Synthesizer** * `The simplest method might involve merely concatenating all relevant context with the query and processing it through an LLM. However, more nuanced approaches involve multiple LLM interactions to refine the context and enhance the quality of the final answer.` 1. Iterative Refinement * `Breaking down the retrieved context into manageable chunks and sequentially refining the response through multiple LLM interactions.` 2. Context Summarization * `Compressing the extensive retrieved context to fit within an LLM’s prompt limitations.` 3. Multi-Answer Generation * `Producing several responses from different context segments and then synthesizing these into a unified answer.` - **Evaluating RAG Performance** - Semantic + Relevance Ranking - One example: * `rank = (cosine similarity) + (weight) x (relevance score)` - Embedding models need to be fine-tuned to your data for best results * `For your Q&A system built on support docs, you very well may find that question→question comparisons will materially improve performance opposed to question→support doc. Pragmatically, you can ask ChatGPT to generate example questions for each support doc and have a human expert curate them. In essence you’d be pre-populating your own Stack Overflow.` - Can create semi-synthetic training data based on your documents - Want to take this “Stack Overflow” methodology one step further? 1. For each document, ask ChatGPT to generate a list of 100 questions it can answer 2. These questions won’t be perfect, so for each question you generate, compute cosine similarities with each other document 3. Filter those questions which would rank the correct document #1 against every other document 4. Identify the highest-quality questions by sorting those which have the highest difference between cosine similarity of the correct document and the second ranked document 5. Send to human for further curation - **Balancing Precision vs Recall** - List of: 1. Threshold Tuning: Adjusting the threshold for deciding whether a document is relevant or not can shift the balance between precision and recall. Lowering the threshold may increase recall but decrease precision, and vice versa. 2. Query Expansion and Refinement: Enhancing the query with additional keywords (query expansion) can increase recall by retrieving a broader set of documents. Conversely, refining the query by adding more specific terms can improve precision. 3. Relevance Feedback: Incorporating user feedback into the retrieval process can help refine the search results. Users' interactions with the results (clicks, time spent on a document, etc.) can provide valuable signals to adjust the balance between precision and recall. 4. Use of Advanced Models: Employing more sophisticated models like deep neural networks can improve both precision and recall. These models are better at understanding complex queries and documents, leading to more accurate retrieval. 5. Customizing Based on Use Case: Different applications may require a different balance of precision and recall. For instance, in a legal document search, precision might be more important to ensure that all retrieved documents are highly relevant. In a medical research scenario, recall might be prioritized to ensure no relevant studies are missed. - **Prompt Complexity** 1. Single Fact Retrieval 2. Multi-Fact Retrieval 3. Discontigous multi-fact retrieval 4. Simple Analysis questions 5. Complex Analysis 6. Research Level Questions