# RAG Notes


Limit context size
Order preservation of chunks
            Retrieve the top k chunks based on similarity scores, but preserve the original order of the chunks as they appeared in the document.


Unsorted
	https://levelup.gitconnected.com/testing-18-rag-techniques-to-find-the-best-094d166af27f
https://www.jeremykun.com/2015/04/06/markov-chain-monte-carlo-without-all-the-bullshit/
https://medium.com/ai-exploration-journey/how-hirag-turns-data-chaos-into-structured-knowledge-magic-ai-innovations-and-insights-35-d637b9a58d80
https://arxiv.org/pdf/2506.00054
https://arxiv.org/abs/2511.09803
https://arxiv.org/abs/2508.10419
https://arxiv.org/abs/2511.08505
https://arxiv.org/abs/2511.08245
https://arxiv.org/abs/2510.20260
https://github.com/grantflow-ai/grantflow
https://arxiv.org/abs/2507.05093
https://arxiv.org/abs/2507.02962
https://arxiv.org/abs/2507.05713
https://docs.kiln.tech/docs/evaluations/evaluate-rag-accuracy-q-and-a-evals
https://arxiv.org/html/2507.06554v2#S5
https://arxiv.org/abs/2507.07426
https://arxiv.org/html/2507.04055v2#S4
https://arxiv.org/html/2511.07328v1
https://jxnl.co/writing/2025/09/11/the-rag-mistakes-that-are-killing-your-ai-skylar-payne/


https://huggingface.co/datasets/isaacus/open-australian-legal-corpus
https://huggingface.co/blog/adlumal/lightning-fast-vector-search-for-legal-documents
https://github.com/hhy-huang/HiRAG
https://archive.is/W64q3
https://github.com/trustgraph-ai/trustgraph
https://archive.is/G9rmp
https://github.com/merveenoyan/smol-vision/blob/main/Image_Search_with_MetaCLIP2.ipynb
https://huggingface.co/merve/smol-vision/blob/main/Image_Search_with_MetaCLIP2.ipynb

Propositional Chunking
    https://deepwiki.com/realyinchen/RAG/2.3-proposition-chunking
    https://eclabs.ai/proposition-based-chunking
    https://chamomile.ai/reliable-rag-with-data-preprocessing/
    https://discovery.graphsandnetworks.com/graphAI/rags.html


Factual Analysis
    https://arxiv.org/html/2404.12065v1
    https://github.com/farrukhrashid1997/Fathom
    https://aclanthology.org/2024.fever-1.5/
    https://github.com/uhh-hcds/UHH-at-AVeriTeC
    https://aclanthology.org/2024.fever-1.6/
    https://aclanthology.org/2024.fever-1.8/
    https://arxiv.org/abs/2407.17023
    https://arxiv.org/abs/2401.15498
    https://aclanthology.org/2025.fever-1.20.pdf
    https://aclanthology.org/2025.fever-1.19/
    https://arxiv.org/abs/2504.14175
    https://arxiv.org/abs/2503.01670
    https://arxiv.org/pdf/2305.13117
    https://arxiv.org/html/2506.19607v1
    https://huggingface.co/datasets/UCSC-IRKM/RAGuard
    https://arxiv.org/abs/2402.10735
    https://arxiv.org/abs/2502.09083
    https://fever.ai/task.html
    https://github.com/ssu-humane/HerO2
    https://github.com/heruberuto/FEVER-8-Shared-Task/tree/main
    https://gitlab.kit.edu/utjwp/Fact-Checking_GoT_RAG_LLM
    https://github.com/Raldir/FEVER-8-Shared-Task
    https://medium.com/data-science-collective/comprehensive-guide-to-fine-tuning-llm-4a8fd4d0e0af
    https://apxml.com/courses/fine-tuning-adapting-large-language-models/chapter-6-evaluation-analysis-fine-tuned-models/assessing-factual-accuracy
    https://deepgram.com/learn/truthfulqa-llm-benchmark-guide
    https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
    https://huggingface.co/FacebookAI/roberta-large-mnli
    https://huggingface.co/models?sort=modified&search=nli
    https://huggingface.co/cross-encoder/nli-deberta-v3-base
    https://huggingface.co/cross-encoder/nli-deberta-v3-large
    https://aclanthology.org/volumes/2023.fever-1/
    https://aclanthology.org/volumes/2024.fever-1/
    https://www.microsoft.com/en-us/research/blog/claimify-extracting-high-quality-claims-from-language-model-outputs/
    https://www.microsoft.com/en-us/research/publication/towards-effective-extraction-and-evaluation-of-factual-claims/
    https://arxiv.org/html/2312.06648v3#S3
    https://raw.githubusercontent.com/NirDiamant/RAG_Techniques/refs/heads/main/all_rag_techniques/HyDe_Hypothetical_Document_Embedding.ipynb
    https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb
    https://arxiv.org/pdf/2502.10855
    https://huggingface.co/datasets/microsoft/claimify-dataset/viewer/default/train?row=0&views%5B%5D=train
    https://gist.github.com/deshwalmahesh/55ad890a8e4bac8155aaf4058f06cfdb
    https://arxiv.org/pdf/2406.19803

https://iopscience.iop.org/article/10.1088/2632-2153/ad7228
https://github.com/lamm-mit/GraphReasoning
https://arxiv.org/abs/2409.13740
https://arxiv.org/html/2503.10150v2
https://kaitchup.substack.com/p/multimodal-rag-with-colpali-and-qwen2?utm_campaign=posts-open-in-app&triedRedirect=true
https://bytevagabond.com/post/how-to-build-enterprise-ai-rag/
https://archive.is/dnSIi
https://github.com/Future-House/paper-qa
https://medium.com/@richardhightower/stop-the-hallucinations-hybrid-retrieval-with-bm25-pgvector-embedding-rerank-llm-rubric-rerank-895d8f7c7242
https://arxiv.org/html/2510.02657v2#S6


https://www.lesswrong.com/posts/bT8yyJHpK64v3nh2N/ai-safety-chatbot
https://github.com/tbroadley/ai-safety-conversational-agent
https://github.com/yichuan-w/LEANN?tab=readme-ov-file
https://medium.com/@bravekjh/rag-mcp-supercharging-ai-agents-with-retrieval-augmented-generation-and-model-context-protocol-775ff5a7206f
https://github.com/Aquiles-ai/Aquiles-RAG
https://www.reddit.com/r/Rag/comments/1n98ua7/building_a_productiongrade_rag_on_a_900page/
https://integral-business-intelligence.github.io/archivist/
https://huggingface.co/blog/thebajajra/rexbert-encoders
https://cloud.google.com/generative-ai-app-builder/docs/check-grounding#claim-level-score-response-examples
https://universalrag.github.io/
https://engineeringblog.yelp.com/2025/02/search-query-understanding-with-LLMs.html
https://github.com/facebookresearch/ReasonIR/tree/main
https://huggingface.co/lightonai/Reason-ModernColBERT
https://arxiv.org/pdf/2407.12883
https://arxiv.org/pdf/2505.22095
https://github.com/hzy312/knowledge-r1/blob/main/README_Search-R1.md
https://arxiv.org/html/2503.21729v3
https://github.com/yale-nlp/MCTS-RAG


https://www.tutorialspoint.com/basic-understanding-of-cure-algorithm
	https://cloud.google.com/generative-ai-app-builder/docs/check-grounding

https://arxiv.org/abs/2508.13107
https://github.com/SciPhi-AI/R2R/
https://github.com/Aquiles-ai/Aquiles-RAG
https://aclanthology.org/2024.acl-long.585/
https://dl.acm.org/doi/10.1145/3626772.3657853
https://aclanthology.org/2024.findings-emnlp.449/
https://aclanthology.org/2024.acl-long.540/
https://github.com/calubkk/RAAT
https://github.com/microsoft/LMOps/tree/main/corag
https://www.figma.com/slides/eq8UzeTluT3qLKOpzlLj8y/LlamaIndex-Talk--Document-Agents-?node-id=51004-814&t=7F9MAlkLnukNb3f6-0
https://ai.plainenglish.io/visual-grounding-rag-with-docling-9696d02054f2
https://lukew.com/ff/entry.asp?2106?ref=sidebar
https://github.com/topoteretes/cognee
https://arxiv.org/abs/2412.02592
https://github.com/Bessouat40/RAGLight
https://github.com/Aquiles-ai/Aquiles-RAG
https://pub.towardsai.net/demystifying-googles-data-gemma-f07a470c2a39?gi=1e0e15c7b54a
https://github.com/CODE-AXION/rag-best-practices/tree/main?tab=readme-ov-file#prompt
https://github.com/MinghoKwok/DeepSieve
https://arxiv.org/abs/2508.13107
https://archive.is/15vEV#selection-224.0-332.0
https://github.com/BevinV/Interactive-Rag
https://www.aryn.ai/
https://landing.ai/
https://github.com/OpenBMB/UltraRAG
https://github.com/bRAGAI/bRAG-langchain
https://docs.llamaindex.ai/en/stable/examples/memory/Mem0Memory/
https://github.com/run-llama/llama_parse/blob/main/examples/advanced_rag/dynamic_section_retrieval.ipynb
https://m3docrag.github.io/
https://medium.com/knowledge-nexus-ai/introducing-lightrag-a-new-era-in-retrieval-augmented-generation-57c20d6081fa
https://arxiv.org/abs/2409.10038
https://arxiv.org/html/2410.22349v1
https://mltechniques.com/2024/10/08/building-a-ranking-system-to-enhance-prompt-results/
https://openragmoe.github.io/
https://github.com/Ariya12138/CORAL
ttps://arxiv.org/abs/2410.07176
https://arxiv.org/abs/2410.04343
https://github.com/Ancientshi/ERM4
https://github.com/QingFei1/LongRAG
https://www.reddit.com/r/LangChain/comments/1dzfp48/agent_retrieval_how_we_almost_always_find_the/
https://www.reddit.com/r/LangChain/comments/1dtr49t/agent_rag_parallel_quotes_how_we_built_rag_on/


https://arxiv.org/abs/2505.23052
https://help.openai.com/en/articles/10169521-projects-in-chatgpt
https://github.com/ag2ai/SimpleDoc
https://help.openai.com/en/articles/10169521-projects-in-chatgpt
https://medium.com/@evgeni.n.rusev/multi-agent-ai-with-hybrid-search-cutting-document-review-time-by-80-f7367a9b1361
https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/
https://www.midswirl.com/blog/road-to-sqlite-vec-exploring-sqlite-as-a-rag-vector-database
https://github.com/bytedance/Dolphin
https://github.com/AppFlowy-IO/AppFlowy
https://levelup.gitconnected.com/creating-the-best-rag-finder-pipeline-for-your-dataset-88062a6fa45e
https://medium.com/data-science-collective/let-users-talk-to-your-databases-build-a-rag-powered-sql-assistant-with-streamlit-4fc9a2ee3960
https://www.reddit.com/r/LocalLLaMA/comments/1khjrtj/building_llm_workflows_some_observations/
https://abdullin.com/ilya/how-to-build-best-rag/
https://github.com/IlyaRice/RAG-Challenge-2
https://github.com/shredEngineer/Archive-Agent/tree/main/archive_agent/core
https://arxiv.org/html/2504.01818v1
https://github.com/waetr/KET-RAG
	https://arxiv.org/abs/2502.09304
https://github.com/CornelliusYW/Multimodal-RAG-Implementation
https://github.com/mixedbread-ai/maxsim-cpu
https://arxiv.org/html/2506.23026v1#S4
https://arxiv.org/html/2504.15629v2#bib
https://arxiv.org/abs/2504.15629v2
https://arxiv.org/abs/2505.17238v2
https://arxiv.org/abs/2404.19543v2
https://arxiv.org/html/2506.16035v2#S6
https://arxiv.org/abs/2507.02962v3
https://github.com/DavidZWZ/Awesome-RAG-Reasoning
https://arxiv.org/abs/2507.12455v1
https://medium.com/data-science/rags-with-query-routing-5552e4e41c54
https://huggingface.co/vidore/colqwen-omni-v0.1
https://www.ragie.ai/blog/how-we-built-multimodal-rag-for-audio-and-video
https://arxiv.org/abs/2507.12455v1
https://github.com/jolibrain/colette/tree/main
https://zenn.dev/microsoft/articles/rag_textbook
https://arxiv.org/abs/2507.09477
https://github.com/jolibrain/colette
https://contextual.ai/blog/document-parser-for-rag/
https://docs.contextual.ai/api-reference/parse/parse-file

https://arxiv.org/pdf/2506.00054


Factual Extraction RAG
	https://medium.com/@eliot64/bridging-legal-ai-and-trust-how-we-won-the-llm-x-law-hackathon-45081a8681d9
	https://github.com/pkargupta/claimspect

Models
	https://huggingface.co/THUDM/GLM-Z1-32B-0414
	https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e

Agentic RAG
	https://arxiv.org/pdf/2501.09136
	https://github.com/asinghcsu/AgenticRAG-Survey


Graphs (GraphRAG / KnowledgeGraphs)
	GARLIC
		https://arxiv.org/pdf/2410.04790v1
	101
		https://www.byhand.ai/p/beginners-guide-to-graph-rag
		https://towardsdatascience.com/graph-rag-a-conceptual-introduction-41cd0d431375/
	About
		https://github.com/zjukg/KG-LLM-Papers
		https://arxiv.org/pdf/2502.11371
		https://generativeai.pub/graph-rag-has-awesome-potential-but-currently-has-serious-flaws-c052a8a3107e?gi=3852d6a64701
		https://news.ycombinator.com/item?id=34605772
		https://medium.com/@dickson.lukose/ontology-modelling-and-engineering-4df8b6b9f3a5
		https://arxiv.org/html/2411.15671v1
		https://ai.plainenglish.io/metagraphs-and-hypergraphs-for-complex-ai-agent-memory-and-rag-717f6f3589f5
		https://generativeai.pub/knowledge-graph-extraction-visualization-with-local-llm-from-unstructured-text-a-history-example-94c63b366fed?gi=876173cfaae4
		https://medium.com/@dickson.lukose/ontology-reasoning-imperative-for-intelligent-graphrag-part-1-2-0018265b987c
		https://medium.com/@ianormy/microsoft-graphrag-with-an-rdf-knowledge-graph-part-2-d8d291a39ed1
	Creation of
			https://arxiv.org/abs/2412.03589
		https://iopscience.iop.org/article/10.1088/2632-2153/ad7228/pdf
		https://arxiv.org/abs/2411.19539
	Knowledge-Augmented Gen
		https://github.com/OpenSPG/KAG
		https://pub.towardsai.net/kag-graph-multimodal-rag-llm-agents-powerful-ai-reasoning-b3da38d31358
	Implement
		https://towardsdatascience.com/how-to-build-a-graph-rag-app-b323fc33ba06/
		https://blog.gopenai.com/llm-ontology-prompting-for-knowledge-graph-extraction-efdcdd0db3a1?gi=23252228d718
		https://pub.towardsai.net/building-a-knowledge-graph-from-unstructured-text-data-a-step-by-step-guide-c14c926c2229
		https://arxiv.org/abs/2408.04187
		https://towardsdatascience.com/how-to-query-a-knowledge-graph-with-llms-using-grag-38bfac47a322/
		https://github.com/zjunlp/OneKE
		https://ai.plainenglish.io/unified-knowledge-graph-model-rdf-rdf-vs-lpg-the-end-of-war-a7c14d6ac76f
		https://neuml.hashnode.dev/advanced-rag-with-graph-path-traversal
		https://www.youtube.com/watch?v=g6xBklAIrsA
		https://towardsdatascience.com/how-to-implement-graph-rag-using-knowledge-graphs-and-vector-databases-60bb69a22759
		https://towardsdatascience.com/text-to-knowledge-graph-made-easy-with-graph-maker-f3f890c0dbe8
		https://towardsdatascience.com/how-to-convert-any-text-into-a-graph-of-concepts-110844f22a1a/
		https://towardsdatascience.com/building-a-knowledge-graph-from-scratch-using-llms-f6f677a17f07/
		https://medium.com/thoughts-on-machine-learning/building-dynamic-knowledge-graphs-using-open-source-llms-06a870e1bc4f
		https://ai.plainenglish.io/modeling-ai-semantic-memory-with-knowledge-graphs-1ce06f683433
		https://towardsdatascience.com/how-to-convert-any-text-into-a-graph-of-concepts-110844f22a1a/
		https://arxiv.org/abs/2412.04119
		https://medium.com/@infiniflowai/how-our-graphrag-reveals-the-hidden-relationships-of-jon-snow-and-the-mother-of-dragons-bd89084f64ec
		https://pub.towardsai.net/exploring-and-comparing-graph-based-rag-approaches-microsoft-graphrag-vs-neo4j-langchain-3837cd3dddef
		https://medium.com/@researchgraph/dynamic-knowledge-graphs-a-next-step-for-data-representation-c35a205a520a
		https://arxiv.org/abs/2501.02157
		https://arxiv.org/abs/2412.03589
		https://github.com/circlemind-ai/fast-graphrag
		https://github.com/gusye1234/nano-graphrag
	Models
		https://www.arxiv.org/abs/2502.13339
	Retrieval
		https://generativeai.pub/advanced-rag-retrieval-strategies-using-knowledge-graphs-12c9ce54d2da

Context Relevancy
	https://arxiv.org/abs/2404.10198
	https://github.com/kevinwu23/StanfordClashEval
```
@Rock-star-007 from reddit: I added 3 nodes in my agent - query intent identifier, query rephraser , search_scope identifier. You try to identify what what the user is intending to ask. The rephrase the question as per the intent. You need to write prompts for rephrasing accordingly. Then identify what specific documents the user might be talking about. Then perform retrieval step with appropriate scope, then do generation.
```


Creating QA Knowledgebase based off chats;
```
For every chat conversation you generate an LLM summary with in two part (problem/solution)

Then you you do an embedding of the problem part for your RAG.

Next time a user explain a issue, you RAG it (each summaries) and you got a collection of similar case.

----------


    dump chats out, extract timestamps and sender lines with a regex (which breaks every third export for reasons unknown). You want that in the metadata.

    split into chunks, toss each chunk at an LLM with a prompt like: “give me 3-4 questions someone might ask that are answered in this chunk, and the matching answers” so you’re generating Qs for the indexed content, not just summarizing. sometimes it hallucinated wild stuff, but it’s faster than hand-labeling

    batch QA pairs into a big ugly CSV, feed to vector DB

-------


prepare an llm(system prompt, proper model and few shot prompts of human generated summaries are enough) for summarizing chats and make it output a json object in a schema, including prompt-response pairs(problem/question - fix/response)

```


Improvements
	https://arxiv.org/pdf/2501.07391
	https://cobusgreyling.medium.com/four-levels-of-rag-research-from-microsoft-fdc54388f0ff
	https://arxiv.org/html/2412.00239v1#S6
	https://towardsdatascience.com/advanced-retrieval-techniques-for-better-rags-c53e1b03c183/
	https://towardsdatascience.com/5-proven-query-translation-techniques-to-boost-your-rag-performance-47db12efe971/
	https://arxiv.org/abs/2412.19442
	https://pub.towardsai.net/revisiting-chunking-in-the-rag-pipeline-9aab8b1fdbe7


Bias
	https://arxiv.org/pdf/2502.17390


Fine-tuning RAG Models
	https://arxiv.org/abs/2412.15563
	CORAG
		https://arxiv.org/abs/2501.14342
	PA-RAG
		https://arxiv.org/abs/2412.14510
	RAFT
		https://arxiv.org/abs/2403.10131


Multi-Modal
	https://arxiv.org/pdf/2502.08826v2


RAG 'Styles'
	Adaptive
		https://arxiv.org/pdf/2505.22095
		https://arxiv.org/abs/2503.21729
	Agentic
		https://arxiv.org/pdf/2501.09136v1
		https://arxiv.org/pdf/2412.17149
		https://research.google/blog/chain-of-agents-large-language-models-collaborating-on-long-context-tasks/
		MMOA-RAG
			https://arxiv.org/abs/2501.15228
		Search-O1
			https://github.com/sunnynexus/Search-o1
	AssistRAG
		https://github.com/smallporridge/AssistRAG
	Cache-Augmented-Generation
		https://arxiv.org/abs/2412.15605
	CRAG - Corrective RAG
		https://arxiv.org/abs/2401.15884
	DeepRAG
		https://arxiv.org/abs/2502.01142
	FACT - Multi-fact retrieval
		https://arxiv.org/abs/2410.21012
	FlexRAG
		https://arxiv.org/pdf/2409.15699v1
	GEAR
		https://arxiv.org/abs/2501.02772
	HTMLRag - Retrieval works better with structured text like HTML vs unstructured plaintext
		https://github.com/plageon/HtmlRAG
		https://arxiv.org/abs/2411.02959v1
	Knowledge-Agumented Retrieval
		https://medium.com/@samarrana407/why-knowledge-augmented-generation-kag-is-the-best-approach-to-rag-2e7820228087
	Knowledge-Aware-Retrieval
		https://arxiv.org/abs/2410.13765
	MAIN-rag
		https://arxiv.org/abs/2501.00332
	MCTS
		https://github.com/yale-nlp/MCTS-RAG
	RankCOT
		https://arxiv.org/pdf/2502.17888
		https://github.com/NEUIR/RankCoT
	RARE - Retrieval-Augmented Reasoning Enhancement for Large Language Models
		https://arxiv.org/abs/2412.02830
	RaRE - RAG with in-contex examples(?)
		https://arxiv.org/abs/2410.20088
		Speculative RAG
			https://research.google/blog/speculative-rag-enhancing-retrieval-augmented-generation-through-drafting/
	ReACT
		https://arxiv.org/abs/2210.03629
	Review-then-Refine
		https://arxiv.org/abs/2412.15101
	TableRAG
		https://arxiv.org/pdf/2410.04739v1
	TrustRAG
		https://arxiv.org/pdf/2501.00879
	VLM
		https://github.com/HKUDS/LightRAG
		https://github.com/llm-lab-org/Multimodal-RAG-Survey
		https://lascari.ai/writing/2025/02/10/image-gen-tagging/
		https://arxiv.org/abs/2502.08826
		https://github.com/Lokesh-Chimakurthi/vision-rag
		https://github.com/AhmedAl93/multimodal-agentic-RAG
		https://github.com/tjmlabs/ColiVara
		https://huggingface.co/vidore/colqwen2-v0.1
		https://colivara.com/
		https://github.com/lesteroliver911/pdf-analyzer-with-page-citations
		https://pub.towardsai.net/multimodal-rag-unveiled-a-deep-dive-into-cutting-edge-advancements-0eeb514c3ac4
		https://arxiv.org/abs/2502.20964


Ranking
	https://arxiv.org/pdf/2409.11598

Retrieval
	https://arxiv.org/abs/2412.03736
	https://arxiv.org/abs/2409.14924
	https://weaviate.io/blog/late-chunking
	https://softwaredoug.com/blog/2025/02/08/elasticsearch-hybrid-search
	https://arxiv.org/abs/2412.03736
	https://softwaredoug.com/blog/2024/08/06/i-made-search-worse-elasticsearch
	https://arxiv.org/pdf/2501.00332
	DRIFT
		https://www.microsoft.com/en-us/research/blog/introducing-drift-search-combining-global-and-local-search-methods-to-improve-quality-and-efficiency/
	FunnelRAG
		https://arxiv.org/pdf/2410.10293v1
	GoldenRetriever
		https://ai.plainenglish.io/a-deep-dive-into-golden-retriever-eea3396af3b4
	Mixture-of-PageRanks
		https://arxiv.org/abs/2412.06078
		https://www.zyphra.com/post/the-mixture-of-pageranks-retriever-for-long-context-pre-processing
	DRAGIN
		https://arxiv.org/abs/2403.10081
		https://github.com/oneal2000/DRAGIN/tree/main
	MBA-RAG
		https://arxiv.org/abs/2412.01572
		https://github.com/FUTUREEEEEE/MBA


Vector-related
	https://www.louisbouchard.ai/indexing-methods/


### Links
- RAG 101
  * https://www.youtube.com/watch?v=nc0BupOkrhI
  * https://arxiv.org/abs/2401.08406
  * https://github.com/NirDiamant/RAG_Techniques?tab=readme-ov-file
  * https://github.com/jxnl/n-levels-of-rag
  * https://winder.ai/llm-architecture-rag-implementation-design-patterns/
  * https://medium.com/@yufan1602/modular-rag-and-rag-flow-part-%E2%85%B0-e69b32dc13a3
	https://news.ycombinator.com/item?id=42174829

- RAG 201
	* https://medium.com/@cdg2718/why-your-rag-doesnt-work-9755726dd1e9
    * https://www.cazton.com/blogs/technical/advanced-rag-techniques
    * https://medium.com/@krtarunsingh/advanced-rag-techniques-unlocking-the-next-level-040c205b95bc
    * https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6
    * https://winder.ai/llm-architecture-rag-implementation-design-patterns/
    * https://towardsdatascience.com/17-advanced-rag-techniques-to-turn-your-rag-app-prototype-into-a-production-ready-solution-5a048e36cdc8
    * https://medium.com/@samarrana407/mastering-rag-advanced-methods-to-enhance-retrieval-augmented-generation-4b611f6ca99a
    * https://generativeai.pub/advanced-rag-retrieval-strategy-query-rewriting-a1dd61815ff0
    * https://medium.com/@yufan1602/modular-rag-and-rag-flow-part-%E2%85%B0-e69b32dc13a3
    * https://pub.towardsai.net/rag-architecture-advanced-rag-3fea83e0d189?gi=47c0b76dbee0
    * https://towardsdatascience.com/3-advanced-document-retrieval-techniques-to-improve-rag-systems-0703a2375e1c

- Articles
	* https://posts.specterops.io/summoning-ragnarok-with-your-nemesis-7c4f0577c93b?gi=7318858af6c3
    * https://blog.demir.io/advanced-rag-implementing-advanced-techniques-to-enhance-retrieval-augmented-generation-systems-0e07301e46f4
    * https://arxiv.org/abs/2312.10997
    * https://jxnl.co/writing/2024/05/22/systematically-improving-your-rag/
    * https://www.arcus.co/blog/rag-at-planet-scale
    * https://d-star.ai/embeddings-are-not-all-you-need

- Architecture Design
	- https://medium.com/@yufan1602/modular-rag-and-rag-flow-part-ii-77b62bf8a5d3
	- https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1
		* https://github.com/ray-project/llm-applications
	- https://levelup.gitconnected.com/testing-18-rag-techniques-to-find-the-best-094d166af27f

Papers
	- Rags to Riches - https://huggingface.co/papers/2406.12824
		* LLMs will use foreign knowledge sooner than parametric information.
	- Lit Search
		* https://arxiv.org/pdf/2407.18940
		* https://arxiv.org/abs/2407.18940


- Building
	* https://techcommunity.microsoft.com/t5/microsoft-developer-community/building-the-ultimate-nerdland-podcast-chatbot-with-rag-and-llm/ba-p/4175577
    * https://medium.com/@LakshmiNarayana_U/advanced-rag-techniques-in-ai-retrieval-a-deep-dive-into-the-chroma-course-d8b06118cde3
    * https://rito.hashnode.dev/building-a-multi-hop-qa-with-dspy-and-qdrant
    * https://blog.gopenai.com/advanced-retrieval-augmented-generation-rag-techniques-5abad385ac66?gi=09e684acab4d
    * https://www.youtube.com/watch?v=bNqSRNMgwhQ
    * https://www.youtube.com/watch?v=7h6uDsfD7bg
    * https://www.youtube.com/watch?v=Balro-DxFyk&list=PLwPYSl1MQp4FpIzn48ypesKYzLvUBQpPF&index=5
    * https://github.com/jxnl/n-levels-of-rag
    * https://rito.hashnode.dev/building-a-multi-hop-qa-with-dspy-and-qdrant

- Chunking
	* https://archive.is/h0oBZ
	* https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/


- Evals
	* https://github.com/Kain-90/RAG-Play


- Multi-Modal RAG
	* https://docs.llamaindex.ai/en/v0.10.17/examples/multi_modal/multi_modal_pdf_tables.html
	* https://archive.is/oIhNp
	* https://arxiv.org/html/2407.01449v2

- Query Expansion
	* https://arxiv.org/abs/2305.03653

- Cross-Encoder Ranking
	* Deep Neural network that processes two input sequences together as a single input. Allows the model to directly compare and contrast the inputs, understanding their relationship in a more integrated and nuanced manner.
	* https://www.sbert.net/examples/applications/retrieve_rerank/README.html

Existing
	https://github.com/superlinear-ai/raglite


### Aligning with the Money
1. Why is it needed in the first place?
2. Identify & Document the Context
	* What is the business objective?
	* What led to this objective being identified?
	* Why is this the most ideal solution?
	* What other solutions have been evaluated?
3. Identify the intended use patterns
	* What questions will it answer?
	* What answers and what kinds of answers are users expecting?
4. Identify the amount + type of data to be archived/referenced
	* Need to identify what methods of metadata creation and settings will be the most cost efficient in time complexity.
	* How will you be receiving the data?
	* Will you be receiving the data or is it on you to obtain it?
5. What does success look like and how will you know you've achieved it?
	* What are the key metrics/values to measure/track?
	* How are these connected to the 'Success State'?


### Building my RAG Solution
- **Outline**
	* Modular architecture design
- **Pre-Retrieval**
	* F
- **Retrieval**
	* F
- **Post-Retrieval**
	1. Stuff
	2. Query Strategy
	    * Factual Strategy: Focuses on retrieving precise facts and figures.
	    * Analytical Strategy: Aims for comprehensive coverage of a topic, exploring different aspects.
  	  * Opinion Strategy: Tries to gather diverse viewpoints on a subjective issue.
    	* Contextual Strategy: Incorporates user-specific context to tailor the retrieval.
- **Generation & Post-Generation**
	- Prompt Compression
		* https://github.com/microsoft/LLMLingua
	- **Citations**
		* Contextcite: https://github.com/MadryLab/context-cite


### RAG Process
1. Pre-Retrieval
	- Raw data creation / Preparation
		1. Prepare data so that text-chunks are self-explanatory
2. **Retrieval**
	1. **Chunk Optimization**
		- Naive - Fixed-size (in characters) Overlapping Sliding windows
			* `limitations include imprecise control over context size, the risk of cutting words or sentences, and a lack of semantic consideration. Suitable for exploratory analysis but not recommended for tasks requiring deep semantic understanding.`
		- Recursive Structure Aware Splitting
			* `A hybrid method combining fixed-size sliding window and structure-aware splitting. It attempts to balance fixed chunk sizes with linguistic boundaries, offering precise context control. Implementation complexity is higher, with a risk of variable chunk sizes. Effective for tasks requiring granularity and semantic integrity but not recommended for quick tasks or unclear structural divisions.`
		- Structure Aware Splitting (by sentence/paragraph)
			* ` Respecting linguistic boundaries preserves semantic integrity, but challenges arise with varying structural complexity. Effective for tasks requiring context and semantics, but unsuitable for texts lacking defined structural divisions.`
		- Context-Aware Splitting (Markdown/LaTeX/HTML)
			* `ensures content types are not mixed within chunks, maintaining integrity. Challenges include understanding specific syntax and unsuitability for unstructured documents. Useful for structured documents but not recommended for unstructured content.`
		- NLP Chunking: Tracking Topic Changes
			* `based on semantic understanding, dividing text into chunks by detecting significant shifts in topics. Ensures semantic consistency but demands advanced NLP techniques. Effective for tasks requiring semantic context and topic continuity but not suitable for high topic overlap or simple chunking tasks.`
	2. **Enhancing Data Quality**
		- Abbreviations/technical terms/links
			* `To mitigate that issue, we can try to ingest that necessary additional context while processing the data, e.g. replace abbreviations with the full text by using an abbreviation translation table.`
	3. **Meta-data**
		- You can add metadata to your vector data in all vector databases. Metadata can later help to (pre-)filter the entire vector database before we perform a vector search.
	4. **Optimize Indexing Structure**
		* `Full Search vs. Approximate Nearest Neighbor, HNSW vs. IVFPQ`
		- Types of Data:
			1. Text
				* Chunked and turned into vector embeddings
			2. Images and Diagrams
				* Turn into vector embeddings using a multi-modal/vision model
			3. Tables
				* Summarized with an LLM, descriptions embedded and used for indexing
				* After retrieval, table is used as is.
			4. Code snippets
				* Chunked using ?
				* Turned into vector embeddings using an embedding model
		1. Chunk Optimization
			- Semantic splitter - optimize chunk size used for embedding
			- Small-to-Big
			- Sliding Window
			- Summary of chunks
			- Metadata Attachment
		2. **Multi-Representation Indexing** - Convert into compact retrieval units (i.e. summaries)
			1. Parent Document
			2. Dense X
		3. **Specialized Embeddings**
			1. Fine-tuned
			2. ColBERT
		4. **Heirarchical Indexing** - Tree of document summarization at various abstraction levels
			1. **RAPTOR** - Recursive Abstractive Processing for Tree-Organized Retrieval
				* https://arxiv.org/pdf/2401.18059
				* `RAPTOR is a novel tree-based retrieval system designed for recursively embedding, clustering, and summarizing text segments. It constructs a tree from the bottom up, offering varying levels of summarization. During inference, RAPTOR retrieves information from this tree, incorporating data from longer documents at various levels of abstraction.`
				* https://archive.is/Zgb13 - README
		5. **Knowledge Graphs / GraphRAG** - Use an LLM to construct a graph-based text index
			* https://arxiv.org/pdf/2404.16130
			* https://github.com/microsoft/graphrag
			- Occurs in two steps:
				1. Derives a knowledge graph from the source documents
				2. Generates community summaries for all closely connected entity groups
			* Given a query, each community summary contributes to a partial response. These partial responses are then aggregated to form the final global answer.
			- Workflow:
				1. Chunk Source documents
				2. Construct a knowledge graph by extracting entities and their relationships from each chunk.
				3. Simultaneously, Graph RAG employs a multi-stage iterative process. This process requires the LLM to determine if all entities have been extracted, similar to a binary classification problem.
				4. Element Instances → Element Summaries → Graph Communities → Community Summaries
					* Graph RAG employs community detection algorithms to identify community structures within the graph, incorporating closely linked entities into the same community.
					* `In this scenario, even if LLM fails to identify all variants of an entity consistently during extraction, community detection can help establish the connections between these variants. Once grouped into a community, it signifies that these variants refer to the same entity connotation, just with different expressions or synonyms. This is akin to entity disambiguation in the field of knowledge graph.`
					* `After identifying the community, we can generate report-like summaries for each community within the Leiden hierarchy. These summaries are independently useful in understanding the global structure and semantics of the dataset. They can also be used to comprehend the corpus without any problems.`
				5. Community Summaries → Community Answers → Global Answer
		6. **HippoRAG**
			* https://github.com/OSU-NLP-Group/HippoRAG
			* https://arxiv.org/pdf/2405.14831
			* https://archive.is/Zgb13#selection-2093.24-2093.34
		7. **spRAG/dsRAG** - README
			* https://github.com/D-Star-AI/dsRAG
	5. **Choosing the right embedding model**
		* F.
	6. **Self query**
		* https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/self_query/
	7. **Hybrid & Filtered Vector Search**
		* Perform multiple search methods and combine results together
		1. Keyword Search(BM25) + Vector
		2. f
       	3. * https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking
	8. **Query Construction**
		- Create a query to interact with a specific DB
		1. Text-to-SQL
			* Relational DBs
			* Rewrite a query into a SQL query
		2. Text-to-Cyber
			* Graph DBs
			* Rewrite a query into a cypher query
		3. Self-query Retriever
			* Vector DBs
			* Auto-generate metadata filters from query
	9. **Query Translation**
		1. Query Decomposition - Decompose or re-phrase the input question
			1. Multi-Query
				* https://archive.is/5y4iI
				- Sub-Question Querying
					* `The core idea of the sub-question strategy is to generate and propose sub-questions related to the main question during the question-answering process to better understand and answer the main question. These sub-questions are usually more specific and can help the system to understand the main question more deeply, thereby improving retrieval accuracy and providing correct answers.`
					1. First, the sub-question strategy generates multiple sub-questions from the user query using LLM (Large Language Model).
					2. Then, each sub-question undergoes the RAG process to obtain its own answer (retrieval generation).
					3. Finally, the answers to all sub-questions are merged to obtain the final answer.
					4. Sub Question prompt: - https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/question_gen/llama-index-question-gen-openai/llama_index/question_gen/openai/base.py#L18-L45
					```
					You are a world class state of the art agent.

					You have access to multiple tools, each representing a different data source or API.
					Each of the tools has a name and a description, formatted as a JSON dictionary.
					The keys of the dictionary are the names of the tools and the values are the \
					descriptions.
					Your purpose is to help answer a complex user question by generating a list of sub \
					questions that can be answered by the tools.

					These are the guidelines you consider when completing your task:
					* Be as specific as possible
					* The sub questions should be relevant to the user question
					* The sub questions should be answerable by the tools provided
					* You can generate multiple sub questions for each tool
					* Tools must be specified by their name, not their description
					* You don't need to use a tool if you don't think it's relevant

					Output the list of sub questions by calling the SubQuestionList function.
					```
			2. Step-Back Prompting
				* http://arxiv.org/pdf/2310.06117
				* `technique that guides LLM to extract advanced concepts and basic principles from specific instances through abstraction, using these concepts and principles to guide reasoning. This approach significantly improves LLM’s ability to follow the correct reasoning path to solve problems.`
				- Flow:
					1. Take in a question - `Estella Leopold went to what school in Aug 1954 and Nov 1954?`
					2. Create a(or multiple) stepback question - `What was Estella Leopold's education history?`
					3. Answer Stepback answer
					4. Perform reasoning using stepback question + answer to create final answer
			3. RAG-Fusion - Combining multiple data sources in one RAG (Walking RAG?)
				- 3 parts:
					1. Query Generation - Generate multiple sub-queries from the user’s input to capture diverse perspectives and fully understand the user’s intent.
					2. Sub-query Retrieval - Retrieve relevant information for each sub-query from large datasets and repositories, ensuring comprehensive and in-depth search results.
					3. Reciprocal Rank Fusion - Merge the retrieved documents using Reciprocal Rank Fusion (RRF) to combine their ranks, prioritizing the most relevant and comprehensive results.
		2. Pseudo-Documents - Hypothetical documents
			1. HyDE
				* https://arxiv.org/abs/2212.10496
	10. **Query Enhancement / Rewriting**
		- Replacing Acronyms with full phrasing
		- Providing synonyms to industry terms
		- Literally just ask the LLM to do it.
	11. **Query Extension**
	12. **Query Expansion**
		*
		1. Query Expansion with a generated answer
			* Paper: https://arxiv.org/abs/2212.10496
			* `We use the LLM to generate an answer, before performing the similarity search. If it is a question that can only be answered using our internal knowledge, we indirectly ask the model to hallucinate, and use the hallucinated answer to search for content that is similar to the answer and not the user query itself.`
			* Given an input query, this method first instructs an LLM to provide a hypothetical answer, whatever its correctness.
				* Then, the query and the generated answer are combined in a prompt and sent to the retrieval system.
			- Implementations:
				- HyDE (Hypothetical Document Embeddings)
				- Rewrite-Retrieve-Read
				- Step-Back Prompting
				- Query2Doc
				- ITER-RETGEN
				- Others?
		2. Query Expansion with multiple related questions
			* We ask the LLM to generate N questions related to the original query and then send them all to the retrieval system
			*
	13. **Multiple System Prompts**
		* Generate multiple prompts, consolidate answer
	14. **Query Routing** - Let LLM decide which datastore to use for information retrieval based on user's query
		1. Logical Routing - Let LLM choose DB based on question
		2. Semantic Routing - embed question and choose prompt based on similarity
	15. **Response Summarization** - Using summaries of returned items
	16. **Ranking***
			1. Re-Rank
				* https://div.beehiiv.com/p/advanced-rag-series-retrieval
			2. RankGPT
			3. RAG-Fusion
		17. **Refinement**
			1. CRAG
				* https://arxiv.org/pdf/2401.15884
				* https://medium.com/@kbdhunga/corrective-rag-c-rag-and-control-flow-in-langgraph-d9edad7b5a2c
				* https://medium.com/@djangoist/how-to-create-accurate-llm-responses-on-large-code-repositories-presenting-cgrag-a-new-feature-of-e77c0ffe432d
	18. **Active Retrieval** - re-retrieve and or retrieve from new data sources if retrieved documents are not relevant.
		1. CRAG
3. **Post-Retrieval**
	1. **Context Enrichment**
		1. Sentence Window Retriever
			* `The text chunk with the highest similarity score represents the best-matching content found. Before sending the content to the LLM we add the k-sentences before and after the text chunk found. This makes sense since the information has a high probability to be connected to the middle part and maybe the piece of information in the middle text chunk is not complete.`
		2. Auto-Merging Retriever
			* `The text chunk with the highest similarity score represents the best-matching content found. Before sending the content to the LLM we add  each small text chunk's assigned  “parent” chunks, which do not necessarily have to be the chunk before and after the text chunk found.`
			* We can build on top of that concept and set up a whole hierarchy like a decision tree with different levels of Parent Nodes, Child Nodes and Leaf Nodes. We could for example have 3 levels, with different chunk sizes - See https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/
4. **Generation & Post-Generation**
	1. **Self-Reflective RAG / Self-RAG**
		- Fine-tuned models/first paper on it:
			* https://arxiv.org/abs/2310.11511
			* https://github.com/AkariAsai/self-rag
		- Articles
			* https://blog.langchain.dev/agentic-rag-with-langgraph/
		- Info:
			* We can use outside systems to quantify the quality of retrieval items and generations, and if necessary, re-perform the query or retrieval with a modified input.
	2. **Corrective RAG**
		- Key Pieces:
			1. Retrieval Evaluator:
				* A lightweight retrieval evaluator is introduced to assess the relevance of retrieved documents.
    			- It assigns a confidence score and triggers one of three actions:
 			    	* Correct: If the document is relevant, refine it to extract key knowledge.
        			* Incorrect: If the document is irrelevant, discard it and perform a web search for external knowledge.
        			* Ambiguous: If the evaluator is uncertain, combine internal and external knowledge sources.
			2. Decompose-then-Recompose Algorithm:
				* A process to refine retrieved documents by breaking them down into smaller knowledge strips, filtering irrelevant content, and recomposing important information.
			3. Web Search for Corrections:
				* When incorrect retrieval occurs, the system leverages large-scale web search to find more diverse and accurate external knowledge.`
	3. **Rewrite-Retrieve-Read (RRR)**
		* https://arxiv.org/pdf/2305.14283
	4. **Choosing the appropriate/correct model**
	5. **Agents**
	6. **Evaluation**
		- Metrics:
			- Generation
				1. Faithfulness - How factually accurate is the generated answer?
				2. Answer Relevancy - How relevant is the generated answer to the question?
			- Retrieval
				1. Context Precision
				2. Context Recall
			- Others
				1. Answer semantic Similarity
				2. Answer correctness
		1. Normalized Discounted Cumulative Gain (NDCG)
			* https://www.evidentlyai.com/ranking-metrics/ndcg-metric#:~:text=DCG%20measures%20the%20total%20item,ranking%20quality%20in%20the%20dataset.
		2. Existing RAG Eval Frameworks
			* RAGAS - https://archive.is/I8f2w
		3. LLM as a Judge
			* We generate an evaluation dataset -> Then define a so-called critique agent with suitable criteria we want to evaluate -> Set up a test pipeline that automatically evaluates the responses of the LLMs based on the defined criteria.
		4. Usage Metrics
			* Nothing beats real-world data.
5. **Delivery**


RAG-Fusion - Combining multiple data source in one RAG search


JSON file store Vector indexing

### Chunking - https://github.com/D-Star-AI/dsRAG'
- **Improvements/Ideas**
	* As part of chunk header summary, include where in the document this chunk is located, besides chunk #x, so instead this comes from the portion of hte document talking about XYZ in the greater context
- Chunk Headers
	* The idea here is to add in higher-level context to the chunk by prepending a chunk header. This chunk header could be as simple as just the document title, or it could use a combination of document title, a concise document summary, and the full hierarchy of section and sub-section titles.
- Chunks -> segments*
	* Large chunks provide better context to the LLM than small chunks, but they also make it harder to precisely retrieve specific pieces of information. Some queries (like simple factoid questions) are best handled by small chunks, while other queries (like higher-level questions) require very large chunks.
	* We break documents up into chunks with metadata at the head of each chunk to help categorize it to the document/align it with the greater context
- **Semantic Sectioning**
	* Semantic sectioning uses an LLM to break a document into sections. It works by annotating the document with line numbers and then prompting an LLM to identify the starting and ending lines for each “semantically cohesive section.” These sections should be anywhere from a few paragraphs to a few pages long. The sections then get broken into smaller chunks if needed. The LLM is also prompted to generate descriptive titles for each section. These section titles get used in the contextual chunk headers created by AutoContext, which provides additional context to the ranking models (embeddings and reranker), enabling better retrieval.
	1. Identify sections
	2. Split sections into chunks
	3. Add metadata header to each chunk
		* `Document: X`
		* `Section: X1`
		* Alt: `Concise parent document summary`
		* Other approaches/bits of info can help/experiment...
- **AutoContext**
	* `AutoContext creates contextual chunk headers that contain document-level and section-level context, and prepends those chunk headers to the chunks prior to embedding them. This gives the embeddings a much more accurate and complete representation of the content and meaning of the text. In our testing, this feature leads to a dramatic improvement in retrieval quality. In addition to increasing the rate at which the correct information is retrieved, AutoContext also substantially reduces the rate at which irrelevant results show up in the search results. This reduces the rate at which the LLM misinterprets a piece of text in downstream chat and generation applications.`
- **Relevant Segment Extraction**
	* Relevant Segment Extraction (RSE) is a query-time post-processing step that takes clusters of relevant chunks and intelligently combines them into longer sections of text that we call segments. These segments provide better context to the LLM than any individual chunk can. For simple factual questions, the answer is usually contained in a single chunk; but for more complex questions, the answer usually spans a longer section of text. The goal of RSE is to intelligently identify the section(s) of text that provide the most relevant information, without being constrained to fixed length chunks.
- **Topic Aware Chunking by Sentence**
	* https://blog.gopenai.com/mastering-rag-chunking-techniques-for-enhanced-document-processing-8d5fd88f6b72?gi=2f39fdede29b


### Vector DBs
- Indexing mechanisms
	*  Locality-Sensitive Hashing (LSH)
	* Hierarchical Graph Structure
	* Inverted File Indexing
	* Product Quantization
	* Spatial Hashing
	* Tree-Based Indexing variations
- Embedding algos
	* Word2Vec
	* GloVe
	* Ada
	* BERT
	* Instructor
- Similarity Measurement Algos
	* Cosine similarity - measuring the cosine of two angles
	* Euclidean distance - measuring the distance between two points
- Indexing and Searching Algos
	- Approximate Nearest Neighbor (ANN)
		* FAISS
		* Annoy
		* IVF
		* HNSW (Heirarchical Navigable small worlds)
- Vector Similarity Search
	- `Inverted File (IVF)` - `indexes are used in vector similarity search to map the query vector to a smaller subset of the vector space, reducing the number of vectors compared to the query vector and speeding up Approximate Nearest Neighbor (ANN) search. IVF vectors are efficient and scalable, making them suitable for large-scale datasets. However, the results provided by IVF vectors are approximate, not exact, and creating an IVF index can be resource-intensive, especially for large datasets.`
	- `Hierarchical Navigable Small World (HNSW)` - `graphs are among the top-performing indexes for vector similarity search. HNSW is a robust algorithm that produces state-of-the-art performance with fast search speeds and excellent recall. It creates a multi-layered graph, where each layer represents a subset of the data, to quickly traverse these layers to find approximate nearest neighbors. HNSW vectors are versatile and suitable for a wide range of applications, including those that require high-dimensional data spaces. However, the parameters of the HNSW algorithm can be tricky to tune for optimal performance, and creating an HNSW index can also be resource intensive.`
- **Vectorization Process**
	- Usually several stages:
		1. Data Pre-processing
			* `The initial stage involves preparing the raw data. For text, this might include tokenization (breaking down text into words or phrases), removing stop words, and normalizing the text (like lowercasing). For images, preprocessing might involve resizing, normalization, or augmentation.`
		2. Feature Extraction
			* `The system extracts features from the preprocessed data. In text, features could be the frequency of words or the context in which they appear. For images, features could be various visual elements like edges, textures, or color histograms.`
		3. Embedding Generation
			* `Using algorithms like Word2Vec for text or CNNs for images, the extracted features are transformed into numerical vectors. These vectors capture the essential qualities of the data in a dense format, typically in a high-dimensional space.`
		4. Dimensionality Reduction
			* `Sometimes, the generated vectors might be very high-dimensional, which can be computationally intensive to process. Techniques like PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding) are used to reduce the dimensionality while preserving as much of the significant information as possible.`
		5. Normalization
			* `Finally, the vectors are often normalized to have a uniform length. This step ensures consistency across the dataset and is crucial for accurately measuring distances or similarities between vectors.`


### Semantic Re-Ranker
* `enhances retrieval quality by re-ranking search results based on deep learning models, ensuring the most relevant results are prioritized.`
- General Steps
	1. Initial Retrieval: a query is processed, and a set of potentially relevant results is fetched. This set is usually larger and broader, encompassing a wide array of documents or data points that might be relevant to the query.
    2. LLM / ML model used to identify relevance
    3. Re-Ranking Process: In this stage, the retrieved results are fed into the deep learning model along with the query. The model assesses each result for its relevance, considering factors such as semantic similarity, context matching, and the query's intent.
    4. Generating a Score: Each result is assigned a relevance score by the model. This scoring is based on how well the content of the result matches the query in terms of meaning, context, and intent.
    5. Sorting Results: Based on the scores assigned, the results are then sorted in descending order of relevance. The top-scoring results are deemed most relevant to the query and are presented to the user.
    6. Continuous Learning and Adaptation: Many Semantic Rankers are designed to learn and adapt over time. By analyzing user interactions with the search results (like which links are clicked), the Ranker can refine its scoring and sorting algorithms, enhancing its accuracy and relevance.
- **Relevance Metrics**
- List of:
    1. Precision and Recall: These are fundamental metrics in information retrieval. Precision measures the proportion of retrieved documents that are relevant, while recall measures the proportion of relevant documents that were retrieved. High precision means that most of the retrieved items are relevant, and high recall means that most of the relevant items are retrieved.
    2. F1 Score: The F1 Score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall, useful in scenarios where it's important to find an equilibrium between finding as many relevant items as possible (recall) and ensuring that the retrieved items are mostly relevant (precision).
    3. Normalized Discounted Cumulative Gain (NDCG): Particularly useful in scenarios where the order of results is important (like web search), NDCG takes into account the position of relevant documents in the result list. The more relevant documents appearing higher in the search results, the better the NDCG.
    4. Mean Average Precision (MAP): MAP considers the order of retrieval and the precision at each rank in the result list. It’s especially useful in tasks where the order of retrieval is important but the user is likely to view only the top few results.


### Issues in RAG
1. Indexing
	- Issues:
		1. Chunking
			1. Relevance & Precision
				* `Properly chunked documents ensure that the retrieved information is highly relevant to the query. If the chunks are too large, they may contain a lot of irrelevant information, diluting the useful content. Conversely, if they are too small, they might miss the broader context, leading to accurate responses but not sufficiently comprehensive.`
			2. Efficiency & Performance
				* `The size and structure of the chunks affect the efficiency of the retrieval process. Smaller chunks can be retrieved and processed more quickly, reducing the overall latency of the system. However, there is a balance to be struck, as too many small chunks can overwhelm the retrieval system and negatively impact performance.`
			3. Quality of Generation
				* `The quality of the generated output heavily depends on the input retrieved. Well-chunked documents ensure that the generator has access to coherent and contextually rich information, which leads to more informative, coherent, and contextually appropriate responses.`
			4. Scalability
				* `As the corpus size grows, chunking becomes even more critical. A well-thought-out chunking strategy ensures that the system can scale effectively, managing more documents without a significant drop in retrieval speed or quality.`
		1. Incomplete Content Representation
			* `The semantic information of chunks is influenced by the segmentation method, resulting in the loss or submergence of important information within longer contexts.`
		2. Inaccurate Chunk Similarity Search.
			* `As data volume increases, noise in retrieval grows, leading to frequent matching with erroneous data, making the retrieval system fragile and unreliable.`
		3. Unclear Reference Trajectory.
			* `The retrieved chunks may originate from any document, devoid of citation trails, potentially resulting in the presence of chunks from multiple different documents that, despite being semantically similar, contain content on entirely different topics.`
	- Potential Solutions
		- Chunk Optimization
			- Sliding window
				* overlapping chunks
			- Small to Big
 				* Retrieve small chunks then collect parent from meta data
 			- Enhance data granularity - apply data cleaning techniques, like removing irrelevant information, confirming factual accuracy, updating outdated information, etc.
 			- Adding metadata, such as dates, purposes, or chapters, for filtering purposes.
		- Structural Organization
			- Heirarchical Index
				* `In the hierarchical structure of documents, nodes are arranged in parent-child relationships, with chunks linked to them. Data summaries are stored at each node, aiding in the swift traversal of data and assisting the RAG system in determining which chunks to extract. This approach can also mitigate the illusion caused by block extraction issues.`
				- Methods for constructing index:
					1. Structural awareness - paragraph and sentence segmentation in docs
					2. Content Awareness - inherent structure in PDF, HTML, Latex
					3. Semantic Awareness - Semantic recognition and segmentation of text based on NLP techniques, such as leveraging NLTK.
					4. Knowledge Graphs
2. Pre-Retrieval
	- Issues:
		- Poorly worded queries
		- Language complexity and ambiguity
	- Potential Solutions:
		- Multi-Query - Expand original question into multiple
		- Sub-Query - `The process of sub-question planning represents the generation of the necessary sub-questions to contextualize and fully answer the original question when combined. `
		- Chain-of-Verification(CoVe) - The expanded queries undergo validation by LLM to achieve the effect of reducing hallucinations. Validated expanded queries typically exhibit higher reliability.
			* https://arxiv.org/abs/2309.11495
		- Query Transformation
			- Rewrite
				* The original queries are not always optimal for LLM retrieval, especially in real-world scenarios. Therefore, we can prompt LLM to rewrite the queries.
			- HyDE
				* `When responding to queries, LLM constructs hypothetical documents (assumed answers) instead of directly searching the query and its computed vectors in the vector database. It focuses on embedding similarity from answer to answer rather than seeking embedding similarity for the problem or query. In addition, it also includes Reverse HyDE, which focuses on retrieval from query to query.`
				* https://medium.aiplanet.com/advanced-rag-improving-retrieval-using-hypothetical-document-embeddings-hyde-1421a8ec075a?gi=b7fa45dc0f32&source=post_page-----e69b32dc13a3--------------------------------
			- Reverse HyDE
				*
			- Step-back prompting
				* https://arxiv.org/abs/2310.06117
				* https://cobusgreyling.medium.com/a-new-prompt-engineering-technique-has-been-introduced-called-step-back-prompting-b00e8954cacb
		- Query Routing
			* Based on varying queries, routing to distinct RAG pipeline,which is suitable for a versatile RAG system designed to accommodate diverse scenarios.
			- Metadata Router/Filter
				* `involves extracting keywords (entity) from the query, followed by filtering based on the keywords and metadata within the chunks to narrow down the search scope.`
			- Semantic Router
				* https://medium.com/ai-insights-cobet/beyond-basic-chatbots-how-semantic-router-is-changing-the-game-783dd959a32d
			- CoVe
				* https://sourajit16-02-93.medium.com/chain-of-verification-cove-understanding-implementation-e7338c7f4cb5
				* https://www.domingosenise.com/artificial-intelligence/chain-of-verification-cove-an-approach-for-reducing-hallucinations-in-llm-outcomes.html
			- Multi-Query
			- SubQuery
		- Query Construction
			- Text-to-Cypher
			- Text-to-SQL
			* https://blog.langchain.dev/query-construction/?source=post_page-----e69b32dc13a3--------------------------------
3. Retrieval
	- 3 Main considerations:
		1. Retrieval Efficiency
		2. Embedding Quality
		3. Alignment of tasks, data and models
	- Sparse Retreiver
		* EX: BM25, TF-IDF
	- Dense Retriever
		* ColBERT
		* BGE/Cohere embedding/OpenAI-Ada-002
	- Retriever Fine-tuning
		- SFT
		- LSR (LM-Supervised Retriever)
		- Reinforcement learning
		- Adapter
			* https://arxiv.org/pdf/2310.18347
			* https://arxiv.org/abs/2305.17331
		`
4. Post-Retrieval
	- Primary Challenges:
		1. Lost in the middle
		2. Noise/anti-fact chunks
		3. Context windows.
	- Potential Solutions
		- Re-Rank
			* Re-rank implementation: https://towardsdatascience.com/enhancing-rag-pipelines-in-haystack-45f14e2bc9f5
			- Rule-based re-rank
				* According to certain rules, metrics are calculated to rerank chunks.
				* Some: Diversity; Relevance; MRR (Maximal Marginal Relevance, 1998)
			- Model based rerank
				* Utilize a language model to reorder the document chunks
		- Compression & Selection
			- LLMLingua
				* https://github.com/microsoft/LLMLingua
				* https://llmlingua.com/
			- RECOMP
				* https://arxiv.org/pdf/2310.04408
			- Selective Context
				* https://aclanthology.org/2023.emnlp-main.391.pdf
			- Tagging Filter
				* https://python.langchain.com/v0.1/docs/use_cases/tagging/
			- LLM Critique
5. Generator
	* Utilize the LLM to generate answers based on the user’s query and the retrieved context information.
	- Finetuning
		* SFT
		* RL
		* Distillation
		- Dual FT
			* `In the RAG system, fine-tuning both the retriever and the generator simultaneously is a unique feature of the RAG system. It is important to note that the emphasis of system fine-tuning is on the coordination between the retriever and the generator. Fine-tuning the retriever and the generator separately separately belongs to the combination of the former two, rather than being part of Dual FT.`
			* https://arxiv.org/pdf/2310.01352
6. Orchestration
	* `Orchestration refers to the modules used to control the RAG process. RAG no longer follows a fixed process, and it involves making decisions at key points and dynamically selecting the next step based on the results.`
	- Scheduling
		* `The Judge module assesses critical point in the RAG process, determining the need to retrieve external document repositories, the satisfaction of the answer, and the necessity of further exploration. It is typically used in recursive, iterative, and adaptive retrieval.`
		- `Rule-base`
			* `The next course of action is determined based on predefined rules. Typically, the generated answers are scored, and then the decision to continue or stop is made based on whether the scores meet predefined thresholds. Common thresholds include confidence levels for tokens.`
		- `Prompt-base`
			* `LLM autonomously determines the next course of action. There are primarily two approaches to achieve this. The first involves prompting LLM to reflect or make judgments based on the conversation history, as seen in the ReACT framework. The benefit here is the elimination of the need for fine-tuning the model. However, the output format of the judgment depends on the LLM’s adherence to instructions.`
			* https://arxiv.org/pdf/2305.06983
		- Tuning based
			 * The second approach entails LLM generating specific tokens to trigger particular actions, a method that can be traced back to Toolformer and is applied in RAG, such as in Self-RAG.
			 * https://arxiv.org/pdf/2310.11511
	- Fusion
		* `This concept originates from RAG Fusion. As mentioned in the previous section on Query Expansion, the current RAG process is no longer a singular pipeline. It often requires the expansion of retrieval scope or diversity through multiple branches. Therefore, following the expansion to multiple branches, the Fusion module is relied upon to merge multiple answers.`
		- Possibility Ensemble
			* `The fusion method is based on the weighted values of different tokens generated from multiple beranches, leading to the comprehensive selection of the final output. Weighted averaging is predominantly employed.`
			* https://arxiv.org/pdf/2301.12652
		- Reciprocal Rank Fusion
			* `RRF, is a technique that combines the rankings of multiple search result lists to generate a single unified ranking. Developed in collaboration with the University of Waterloo (CAN) and Google, RRF produces results that are more effective than reordering chunks under any single branch.`
			* https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1
			* https://safjan.com/implementing-rank-fusion-in-python/
			* https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking
- Semantic dissonance
	* `the discordance between your task’s intended meaning, the RAG’s understanding of it, and the underlying knowledge that’s stored.`
- Poor explainability of embeddings
- Semantic Search tends to be directionally correct but inherently fuzzy
	* Good for finding top-k results
- Significance of Dimensionality in Vector Embeddings
	* `The dimensionality of a vector, which is the length of the vector, plays a crucial role. Higher-dimensional vectors can capture more information and subtle nuances of the data, leading to more accurate models. However, higher dimensionality also increases computational complexity. Therefore, finding the right balance in vector dimensionality is key to efficient and effective model performance.`


### Potential Improvements when building
https://gist.github.com/Donavan/62e238aa0a40ca88191255a070e356a2
- **Chunking**
	- Relevance & Precision
	- Efficiency and Performance
	- Quality of Generation
	- Scalability
- **Embeddings**
	1. **Encoder Fine-Tuning**
		* `Despite the high efficiency of modern Transformer Encoders, fine-tuning can still yield modest improvements in retrieval quality, especially when tailored to specific domains.`
	2. Ranker Fine-Tuning
		* `Employing a cross-encoder for re-ranking can refine the selection of context, ensuring that only the most relevant text chunks are considered.`
	3. LLM Fine-Tuning
		* `The advent of LLM fine-tuning APIs allows for the adaptation of models to specific datasets or tasks, enhancing their effectiveness and accuracy in generating responses.`
- **Constructing the Search Index**
	1. **Vector store index**
	2. **Heirarchical Indices**
		* Two-tiered index, one for doc summaries the other for detailed chunks
		* Filter through the summaries first then search the chunks
	3. **Hypothetical Questions and HyDE approach**
		* A novel approach involves the generation of hypothetical questions for each text chunk. These questions are then vectorized and stored, replacing the traditional text vectors in the index. This method enhances semantic alignment between user queries and stored data, potentially leading to more accurate retrievals. The HyDE method reverses this process by generating hypothetical responses to queries, using these as additional data points to refine search accuracy.
- **Context Enrichment**
	1. **Sentence-Window retrieval**
		* `This technique enhances search precision by embedding individual sentences and extending the search context to include neighboring sentences. This not only improves the relevance of the retrieved data but also provides the LLM with a richer context for generating responses.`
	2. **Auto-merging Retriever** (Parent Document Retriever)
		* `Similar to the Sentence Window Retrieval, this method focuses on granularity but extends the context more broadly. Documents are segmented into a hierarchy of chunks, and smaller, more relevant pieces are initially retrieved. If multiple small chunks relate to a larger segment, they are merged to form a comprehensive context, which is then presented to the LLM.`
	3. **Fusion Retrieval**
		* `The concept of fusion retrieval combines traditional keyword-based search methods, like TF-IDF or BM25, with modern vector-based search techniques. This hybrid approach, often implemented using algorithms like Reciprocal Rank Fusion (RRF), optimizes retrieval by integrating diverse similarity measures.`
- **Re-Ranking & Filtering**
	* `After the initial retrieval of results using any of the aforementioned sophisticated algorithms, the focus shifts to refining these results through various post-processing techniques.`
	* `Various systems enabling the fine-tuning of retrieval outcomes based on similarity scores, keywords, metadata, or through re-ranking with additional models. These models could include an LLM, a sentence-transformer cross-encoder, or even external reranking services like Cohere. Moreover, filtering can also be adjusted based on metadata attributes, such as the recency of the data, ensuring that the most relevant and timely information is prioritized. This stage is critical as it prepares the retrieved data for the final step - feeding it into an LLM to generate the precise answer.`
	1. f
	2. f
- **Query Transformations**
	1. **(Sub-)Query Decomposition**
		* `For complex queries that are unlikely to yield direct comparisons or results from existing data (e.g., comparing GitHub stars between Langchain and LlamaIndex), an LLM can break down the query into simpler, more manageable sub-queries. Each sub-query can then be processed independently, with their results synthesized later to form a comprehensive response.`
		* Multi Query Retriever and Sub Question Query Engine
		- Step-back Prompting
			* `method involves using an LLM to generate a broader or more general query from the original, complex query. The aim is to retrieve a higher-level context that can serve as a foundation for answering the more specific original query. The contexts from both the original and the generalized queries are then combined to enhance the final answer generation.`
		- Query Rewriting
			* https://archive.is/FCiaW
			* `Another technique involves using an LLM to reformulate the initial query to improve the retrieval process`
	2. **Reference Citations**
		- Direct Source Mention
			* Require mention of source IDs directly in generated response.
		- Fuzzy Matching
			* Align portions of the response with their corresponding text chunks in the index.
		- Research:
			- Attribution Bench: https://osu-nlp-group.github.io/AttributionBench/
				* Finetuning T5 models outperform otherwise SOTA models.
				* Complexity of questions and data are issues.
			- ContextCite: https://gradientscience.org/contextcite/
				* Hot shit?
				* https://gradientscience.org/contextcite-applications/
			- Metrics - Enabling LLMs to generate text with citations paper
				* https://arxiv.org/abs/2305.14627
- **Chat Engine**
	1. ContextChatEngine:
		* `A straightforward approach where the LLM retrieves context relevant to the user’s query along with any previous chat history. This history is then used to inform the LLM’s response, ensuring continuity and relevance in the dialogue.`
	2. CondensePlusContextMode
		* ` A more advanced technique where each interaction’s chat history and the last message are condensed into a new query. This refined query is used to retrieve relevant context, which, along with the original user message, is passed to the LLM for generating a response.`
- **Query Routing**
	* `Query routing involves strategic decision-making powered by an LLM to determine the most effective subsequent action based on the user’s query. This could include decisions to summarize information, search specific data indices, or explore multiple routes to synthesize a comprehensive answer. Query routers are crucial for selecting the appropriate data source or index, especially in systems where data is stored across multiple platforms, such as vector stores, graph databases, or relational databases.`
	- Query Routers
		* F
- **Agents in RAG Systems**
	1. **Multi-Document Agent Scheme**
	2. **Walking RAG** - Multi-shot retrieval
		- Have the LLM ask for more information as needed and perform searches for said information, to loop back in to asking the LLM if there's enough info.
		- Things necessary to facillitate:
			* We need to extract partial information from retrieved pieces of source data, so we can learn as we go.
			* We need to find new places to look, informed by the source data as well as the question.
			* We need to retrieve information from those specific places.
		* Links:
			* https://olickel.com/retrieval-augmented-research-1-basics
			* https://olickel.com/retrieval-augmented-research-2-walking
			* https://olickel.com/retrieval-augmented-research-3-use-the-whole-brain
	3. F
- **Response Synthesizer**
	* `The simplest method might involve merely concatenating all relevant context with the query and processing it through an LLM. However, more nuanced approaches involve multiple LLM interactions to refine the context and enhance the quality of the final answer.`
	1. Iterative Refinement
		* `Breaking down the retrieved context into manageable chunks and sequentially refining the response through multiple LLM interactions.`
	2. Context Summarization
		* `Compressing the extensive retrieved context to fit within an LLM’s prompt limitations.`
	3. Multi-Answer Generation
		* `Producing several responses from different context segments and then synthesizing these into a unified answer.`
- **Evaluating RAG Performance**


- Semantic + Relevance Ranking
	- One example:
		* `rank = (cosine similarity) + (weight) x (relevance score)`
- Embedding models need to be fine-tuned to your data for best results
	* `For your Q&A system built on support docs, you very well may find that question→question comparisons will materially improve performance opposed to question→support doc. Pragmatically, you can ask ChatGPT to generate example questions for each support doc and have a human expert curate them. In essence you’d be pre-populating your own Stack Overflow.`
	- Can create semi-synthetic training data based on your documents - Want to take this “Stack Overflow” methodology one step further?
		1. For each document, ask ChatGPT to generate a list of 100 questions it can answer
    	2. These questions won’t be perfect, so for each question you generate, compute cosine similarities with each other document
    	3. Filter those questions which would rank the correct document #1 against every other document
    	4. Identify the highest-quality questions by sorting those which have the highest difference between cosine similarity of the correct document and the second ranked document
    	5. Send to human for further curation
- **Balancing Precision vs Recall**
	- List of:
	    1. Threshold Tuning: Adjusting the threshold for deciding whether a document is relevant or not can shift the balance between precision and recall. Lowering the threshold may increase recall but decrease precision, and vice versa.
    	2. Query Expansion and Refinement: Enhancing the query with additional keywords (query expansion) can increase recall by retrieving a broader set of documents. Conversely, refining the query by adding more specific terms can improve precision.
    	3. Relevance Feedback: Incorporating user feedback into the retrieval process can help refine the search results. Users' interactions with the results (clicks, time spent on a document, etc.) can provide valuable signals to adjust the balance between precision and recall.
    	4. Use of Advanced Models: Employing more sophisticated models like deep neural networks can improve both precision and recall. These models are better at understanding complex queries and documents, leading to more accurate retrieval.
    	5. Customizing Based on Use Case: Different applications may require a different balance of precision and recall. For instance, in a legal document search, precision might be more important to ensure that all retrieved documents are highly relevant. In a medical research scenario, recall might be prioritized to ensure no relevant studies are missed.


- **Prompt Complexity**
	1. Single Fact Retrieval
	2. Multi-Fact Retrieval
	3. Discontigous multi-fact retrieval
	4. Simple Analysis questions
	5. Complex Analysis
	6. Research Level Questions