--- name: llama-index description: Comprehensive guide for building LLM applications with LlamaIndex, including data loaders, indexes, query engines, chat engines, vector stores, retrievers, agents, evaluation, streaming, and observability. metadata: author: mte90 version: 1.0.0 tags: - llama-index - llm - rag - ai - python - vector-database - openai - agents --- # LlamaIndex Development Complete guide for building LLM applications with LlamaIndex framework. ## Overview LlamaIndex is a data framework for LLM applications, providing tools for data ingestion, indexing, and retrieval. **Key Characteristics:** - RAG (Retrieval Augmented Generation) support - Multiple data connectors (300+) - Various index types - Query and chat engines - Vector store integrations - Agent framework ## Installation ### Setup ```bash # Basic installation pip install llama-index # With OpenAI pip install llama-index-llms-openai pip install llama-index-embeddings-openai # With vector stores pip install llama-index-vector-stores-chroma pip install llama-index-vector-stores-pinecone pip install llama-index-vector-stores-qdrant # With evaluation pip install llama-index-llms-openai ragas ``` ### Basic Configuration ```python import os from llama_index.core import Settings from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding # Set API key os.environ["OPENAI_API_KEY"] = "your-api-key" # Configure global settings Settings.llm = OpenAI(model="gpt-4o", temperature=0.0) Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small") Settings.chunk_size = 512 Settings.chunk_overlap = 50 ``` ## Quick Start ### Basic RAG Pipeline ```python from llama_index.core import VectorStoreIndex, SimpleDirectoryReader # Load documents documents = SimpleDirectoryReader("./data").load_data() # Create index index = VectorStoreIndex.from_documents(documents) # Create query engine query_engine = index.as_query_engine() # Query response = query_engine.query("What is the main topic?") print(response) ``` ### With Storage ```python from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext from llama_index.embeddings.openai import OpenAIEmbedding import chromadb from llama_index.vector_stores.chroma import ChromaVectorStore # Setup ChromaDB db = chromadb.PersistentClient(path="./chroma_db") chroma_collection = db.get_or_create_collection("my_collection") vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) # Load and index documents = SimpleDirectoryReader("./data").load_data() index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, embed_model=OpenAIEmbedding(), ) # Persist storage_context.persist() # Load from disk index = VectorStoreIndex.from_vector_store( vector_store, embed_model=OpenAIEmbedding(), ) ``` ## Data Loading ### Document Loaders ```python from llama_index.core import SimpleDirectoryReader, Document # Load from directory documents = SimpleDirectoryReader( input_dir="./data", required_exts=[".pdf", ".txt", ".md"], exclude=["*.tmp"], recursive=True, ).load_data() # Load specific files documents = SimpleDirectoryReader( input_files=["./file1.pdf", "./file2.txt"] ).load_data() # Create documents manually documents = [ Document(text="Content here", metadata={"source": "manual"}), ] # With metadata extraction def custom_metadata_func(file_path: str) -> dict: return { "file_path": file_path, "file_name": os.path.basename(file_path), } documents = SimpleDirectoryReader( input_dir="./data", file_metadata=custom_metadata_func, ).load_data() ``` ### Custom Data Connectors ```python from llama_index.core import Document from typing import List class CustomDataReader: """Custom data loader.""" def load_data(self, source: str) -> List[Document]: documents = [] # Load from custom source # Example: API, database, etc. data = self._fetch_from_source(source) for item in data: doc = Document( text=item["content"], metadata={ "source": source, "id": item["id"], "timestamp": item["timestamp"], }, ) documents.append(doc) return documents def _fetch_from_source(self, source: str): # Implement data fetching pass # Usage reader = CustomDataReader() documents = reader.load_data("api://endpoint") ``` ## Node Parsing ### Chunking Strategies ```python from llama_index.core.node_parser import ( SentenceSplitter, TokenTextSplitter, SemanticSplitterNodeParser, HierarchicalNodeParser, ) from llama_index.embeddings.openai import OpenAIEmbedding # Sentence splitter (default) splitter = SentenceSplitter( chunk_size=1024, chunk_overlap=20, paragraph_separator="\n\n", ) nodes = splitter.get_nodes_from_documents(documents) # Token splitter token_splitter = TokenTextSplitter( chunk_size=512, chunk_overlap=50, ) # Semantic splitter (uses embeddings) semantic_splitter = SemanticSplitterNodeParser( buffer_size=1, breakpoint_percentile_threshold=95, embed_model=OpenAIEmbedding(), ) # Hierarchical node parser hierarchical_parser = HierarchicalNodeParser.from_defaults( chunk_sizes=[2048, 512, 128], # Parent -> Child -> Grandchild ) ``` ### Node Processing Pipeline ```python from llama_index.core.ingestion import IngestionPipeline from llama_index.core.extractors import ( TitleExtractor, SummaryExtractor, KeywordExtractor, ) # Create pipeline pipeline = IngestionPipeline( transformations=[ SentenceSplitter(chunk_size=1024, chunk_overlap=20), TitleExtractor(), SummaryExtractor(), KeywordExtractor(), OpenAIEmbedding(), ], ) # Run pipeline nodes = pipeline.run(documents=documents) # Access metadata for node in nodes[:3]: print(f"Title: {node.metadata.get('document_title')}") print(f"Summary: {node.metadata.get('section_summary')}") print(f"Keywords: {node.metadata.get('excerpt_keywords')}") ``` ## Vector Stores ### ChromaDB ```python import chromadb from llama_index.vector_stores.chroma import ChromaVectorStore from llama_index.core import VectorStoreIndex, StorageContext # Persistent client db = chromadb.PersistentClient(path="./chroma_db") chroma_collection = db.get_or_create_collection("my_collection") # Create vector store vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) # Create index index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, ) # Load existing index = VectorStoreIndex.from_vector_store(vector_store) ``` ### Pinecone ```python import pinecone from llama_index.vector_stores.pinecone import PineconeVectorStore from llama_index.core import VectorStoreIndex, StorageContext # Initialize Pinecone pinecone.init( api_key=os.environ["PINECONE_API_KEY"], environment=os.environ["PINECONE_ENV"], ) # Create index if not exists if "my_index" not in pinecone.list_indexes(): pinecone.create_index( "my_index", dimension=1536, metric="cosine", ) pinecone_index = pinecone.Index("my_index") vector_store = PineconeVectorStore(pinecone_index=pinecone_index) storage_context = StorageContext.from_defaults(vector_store=vector_store) # Create index index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, ) ``` ### Qdrant ```python from qdrant_client import QdrantClient from llama_index.vector_stores.qdrant import QdrantVectorStore from llama_index.core import VectorStoreIndex, StorageContext # Initialize client client = QdrantClient(host="localhost", port=6333) # Create vector store vector_store = QdrantVectorStore( collection_name="my_collection", client=client, ) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, ) ``` ## Advanced Retrievers ### Hybrid Search ```python from llama_index.core import VectorStoreIndex, SimpleKeywordTableIndex from llama_index.core.retrievers import VectorIndexRetriever, KeywordTableSimpleRetriever from llama_index.core.schema import QueryBundle class HybridRetriever: """Combine vector and keyword search.""" def __init__(self, vector_index, keyword_index, mode="OR"): self.vector_retriever = VectorIndexRetriever( index=vector_index, similarity_top_k=10, ) self.keyword_retriever = KeywordTableSimpleRetriever( index=keyword_index, similarity_top_k=10, ) self.mode = mode def retrieve(self, query: str): query_bundle = QueryBundle(query_str=query) vector_nodes = self.vector_retriever.retrieve(query_bundle) keyword_nodes = self.keyword_retriever.retrieve(query_bundle) vector_ids = {n.node.node_id for n in vector_nodes} keyword_ids = {n.node.node_id for n in keyword_nodes} if self.mode == "AND": combined_ids = vector_ids.intersection(keyword_ids) else: combined_ids = vector_ids.union(keyword_ids) combined_dict = {n.node.node_id: n for n in vector_nodes} combined_dict.update({n.node.node_id: n for n in keyword_nodes}) return [combined_dict[nid] for nid in combined_ids] # Usage vector_index = VectorStoreIndex.from_documents(documents) keyword_index = SimpleKeywordTableIndex.from_documents(documents) hybrid_retriever = HybridRetriever(vector_index, keyword_index) nodes = hybrid_retriever.retrieve("search query") ``` ### Query Fusion Retriever ```python from llama_index.core.retrievers import QueryFusionRetriever, VectorIndexRetriever from llama_index.core.postprocessor import SentenceTransformerRerank # Create retrievers vector_retriever = VectorIndexRetriever( index=vector_index, similarity_top_k=20, ) # Fusion retriever with query expansion fusion_retriever = QueryFusionRetriever( retrievers=[vector_retriever], similarity_top_k=10, num_queries=3, # Generate 3 query variations mode="reciprocal_rerank", use_async=True, ) # Add reranker reranker = SentenceTransformerRerank( model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=5, ) # Use in query engine query_engine = RetrieverQueryEngine( retriever=fusion_retriever, node_postprocessors=[reranker], ) ``` ### Auto-Merging Retriever ```python from llama_index.core.node_parser import HierarchicalNodeParser from llama_index.core.retrievers import AutoMergingRetriever from llama_index.core.storage import StorageContext # Create hierarchical nodes node_parser = HierarchicalNodeParser.from_defaults( chunk_sizes=[2048, 512, 128], ) nodes = node_parser.get_nodes_from_documents(documents) storage_context = StorageContext.from_defaults(nodes=nodes) # Create index index = VectorStoreIndex(nodes, storage_context=storage_context) # Create auto-merging retriever auto_merging_retriever = AutoMergingRetriever( index.as_retriever(similarity_top_k=6), storage_context=storage_context, verbose=True, ) # Use query_engine = RetrieverQueryEngine(retriever=auto_merging_retriever) ``` ## Rerankers ### SentenceTransformer Reranker ```python from llama_index.core.postprocessor import SentenceTransformerRerank from llama_index.core import VectorStoreIndex # Create reranker reranker = SentenceTransformerRerank( model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=5, ) # Use in query engine query_engine = index.as_query_engine( similarity_top_k=20, # Retrieve more node_postprocessors=[reranker], # Then rerank ) response = query_engine.query("Your query") ``` ### Cohere Reranker ```python from llama_index.core.postprocessor import CohereRerank reranker = CohereRerank( api_key=os.environ["COHERE_API_KEY"], top_n=5, model="rerank-english-v3.0", ) query_engine = index.as_query_engine( similarity_top_k=20, node_postprocessors=[reranker], ) ``` ### LLM Reranker ```python from llama_index.core.postprocessor import LLMRerank from llama_index.llms.openai import OpenAI reranker = LLMRerank( top_n=5, llm=OpenAI(model="gpt-4o", temperature=0.0), ) query_engine = index.as_query_engine( similarity_top_k=10, node_postprocessors=[reranker], ) ``` ## Query Engines ### Router Query Engine ```python from llama_index.core.query_engine import RouterQueryEngine from llama_index.core.tools import QueryEngineTool, ToolMetadata from llama_index.core.selectors import LLMSingleSelector # Create multiple query engines summary_engine = summary_index.as_query_engine() vector_engine = vector_index.as_query_engine() # Create tools query_engine_tools = [ QueryEngineTool( query_engine=summary_engine, metadata=ToolMetadata( name="summary_tool", description="Useful for summarizing documents", ), ), QueryEngineTool( query_engine=vector_engine, metadata=ToolMetadata( name="vector_tool", description="Useful for specific questions about documents", ), ), ] # Create router router_engine = RouterQueryEngine( selector=LLMSingleSelector.from_defaults(), query_engine_tools=query_engine_tools, verbose=True, ) response = router_engine.query("What is the main topic?") ``` ### Sub-Question Query Engine ```python from llama_index.core.query_engine import SubQuestionQueryEngine # Create sub-question engine sub_question_engine = SubQuestionQueryEngine.from_defaults( query_engine_tools=query_engine_tools, use_async=True, verbose=True, ) # Automatically decomposes into sub-questions response = sub_question_engine.query( "Compare the revenue growth of Company A and Company B" ) ``` ### Multi-Step Query Engine ```python from llama_index.core.query_engine import MultiStepQueryEngine # Create multi-step engine multi_step_engine = MultiStepQueryEngine( query_engine=base_query_engine, llm=OpenAI(model="gpt-4o"), max_iterations=5, verbose=True, ) # Breaks down complex questions response = multi_step_engine.query( "What factors contributed to the market cap change?" ) ``` ## Agents ### ReAct Agent ```python from llama_index.core.agent import ReActAgent from llama_index.core.tools import FunctionTool from llama_index.llms.openai import OpenAI # Define tools def add(x: int, y: int) -> int: """Add two numbers.""" return x + y def multiply(x: int, y: int) -> int: """Multiply two numbers.""" return x * y def search_knowledge(query: str) -> str: """Search the knowledge base.""" response = query_engine.query(query) return str(response) tools = [ FunctionTool.from_defaults(add), FunctionTool.from_defaults(multiply), FunctionTool.from_defaults(search_knowledge), ] # Create agent agent = ReActAgent( llm=OpenAI(model="gpt-4o", temperature=0.0), tools=tools, max_iterations=10, verbose=True, ) # Run response = agent.chat("What is 20 + (5 * 3)?") print(response) ``` ### Function Calling Agent ```python from llama_index.core.agent import FunctionCallingAgent # Use for models with native function calling agent = FunctionCallingAgent( llm=OpenAI(model="gpt-4o"), tools=tools, max_iterations=10, verbose=True, ) response = agent.chat("Calculate and search") ``` ### Custom Tools ```python from llama_index.core.tools import BaseTool from pydantic import BaseModel, Field from typing import Optional # Pydantic input schema class SearchInput(BaseModel): query: str = Field(description="Search query") limit: Optional[int] = Field(default=10, description="Max results") category: Optional[str] = Field(default=None, description="Filter category") class CustomSearchTool(BaseTool): name = "custom_search" description = "Search with structured parameters" def __call__(self, input: SearchInput) -> str: results = self._search(input.query, input.limit, input.category) return "\n".join(results) def _search(self, query: str, limit: int, category: str): # Implementation return ["result1", "result2"] # Usage tool = CustomSearchTool() agent = ReActAgent(llm=llm, tools=[tool]) ``` ### Query Engine as Tool ```python from llama_index.core.tools import QueryEngineTool, ToolMetadata # Wrap query engine as tool query_tool = QueryEngineTool( query_engine=index.as_query_engine(), metadata=ToolMetadata( name="knowledge_base", description="Search the knowledge base for information", ), ) # Use in agent agent = ReActAgent( llm=OpenAI(model="gpt-4o"), tools=[query_tool, other_tool], ) ``` ## Streaming ### Streaming Responses ```python from llama_index.core import VectorStoreIndex # Enable streaming query_engine = index.as_query_engine(streaming=True) # Stream response response = query_engine.query("What is the main topic?") # Print tokens as they arrive for token in response.response_gen: print(token, end="", flush=True) ``` ### Async Streaming ```python import asyncio async def stream_query(query: str): streaming_response = await query_engine.aquery(query) content = "" async for token in streaming_response.async_response_gen(): print(token, end="", flush=True) content += token return content # Run asyncio.run(stream_query("Your query")) ``` ### Streaming Chat ```python from llama_index.core.memory import ChatMemoryBuffer # Create chat engine with memory memory = ChatMemoryBuffer.from_defaults(token_limit=1500) chat_engine = index.as_chat_engine( chat_mode="context", memory=memory, streaming=True, ) # Stream chat response = chat_engine.stream_chat("Tell me about topic X") for token in response.response_gen: print(token, end="", flush=True) # Continue conversation response = chat_engine.stream_chat("Can you elaborate?") for token in response.response_gen: print(token, end="", flush=True) ``` ### FastAPI Streaming ```python from fastapi import FastAPI from fastapi.responses import StreamingResponse app = FastAPI() @app.post("/stream") async def stream_response(query: str): async def generate(): streaming_response = await query_engine.aquery(query) async for token in streaming_response.async_response_gen(): yield f"data: {token}\n\n" yield "data: [DONE]\n\n" return StreamingResponse( generate(), media_type="text/event-stream", ) ``` ## Evaluation ### Faithfulness Evaluation ```python from llama_index.core.evaluation import FaithfulnessEvaluator from llama_index.llms.openai import OpenAI # Create evaluator llm = OpenAI(model="gpt-4o", temperature=0.0) evaluator = FaithfulnessEvaluator(llm=llm) # Evaluate response query_engine = index.as_query_engine() response = query_engine.query("What are the key points?") eval_result = evaluator.evaluate_response(response=response) print(f"Passing: {eval_result.passing}") print(f"Feedback: {eval_result.feedback}") ``` ### Relevancy Evaluation ```python from llama_index.core.evaluation import RelevancyEvaluator evaluator = RelevancyEvaluator(llm=llm) eval_result = evaluator.evaluate_response( query="What is the main topic?", response=response, ) print(f"Passing: {eval_result.passing}") print(f"Score: {eval_result.score}") ``` ### Ragas Integration ```python from llama_index.core.evaluation import RagasEvaluator from llama_index.core.evaluation.ragas import RagasMetric # Create evaluator evaluator = RagasEvaluator( metric=RagasMetric.FAITHFULNESS, ) # Evaluate result = evaluator.evaluate_response( query=query, response=response, contexts=[node.text for node in response.source_nodes], ) print(f"Score: {result.score}") ``` ### Batch Evaluation ```python from tqdm import tqdm def batch_evaluate(queries, responses, evaluator): results = [] for query, response in tqdm(zip(queries, responses)): result = evaluator.evaluate_response( query=query, response=response, ) results.append(result) passing_rate = sum(1 for r in results if r.passing) / len(results) avg_score = sum(r.score for r in results if r.score) / len(results) return { "passing_rate": passing_rate, "average_score": avg_score, "results": results, } ``` ## Observability ### Callbacks and Token Tracking ```python from llama_index.core import Settings from llama_index.core.callbacks import ( CallbackManager, LlamaDebugHandler, TokenCountingHandler, ) import tiktoken # Create handlers token_counter = TokenCountingHandler( tokenizer=tiktoken.encoding_for_model("gpt-4o").encode, ) debug_handler = LlamaDebugHandler(print_trace_on_end=True) # Set callback manager Settings.callback_manager = CallbackManager([token_counter, debug_handler]) # Run query response = query_engine.query("Your query") # Get token counts print(f"Total LLM tokens: {token_counter.total_llm_token_count}") print(f"Embedding tokens: {token_counter.total_embedding_token_count}") ``` ### LlamaIndex Debugging ```python from llama_index.core.callbacks import LlamaDebugHandler debug_handler = LlamaDebugHandler() Settings.callback_manager = CallbackManager([debug_handler]) # Run query response = query_engine.query("Your query") # Get events events = debug_handler.get_event_pairs() for event in events: print(f"Event: {event[0].type}") print(f"Duration: {event[1].time - event[0].time}") ``` ### LangSmith Integration ```python import os os.environ["LANGCHAIN_TRACING_V2"] = "true" os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-key" os.environ["LANGCHAIN_PROJECT"] = "my-project" # LlamaIndex automatically logs to LangSmith response = query_engine.query("Your query") ``` ## Chat Engines ### Basic Chat ```python from llama_index.core import VectorStoreIndex # Create chat engine chat_engine = index.as_chat_engine( chat_mode="condense_question", verbose=True, ) # Chat response = chat_engine.chat("Tell me about X") print(response) response = chat_engine.chat("Can you elaborate?") print(response) ``` ### Chat with Memory ```python from llama_index.core.memory import ChatMemoryBuffer # Create memory memory = ChatMemoryBuffer.from_defaults(token_limit=2000) chat_engine = index.as_chat_engine( chat_mode="context", memory=memory, system_prompt="You are a helpful assistant.", verbose=True, ) # Chat maintains context response = chat_engine.chat("What is X?") response = chat_engine.chat("Give me more details") # Reset memory chat_engine.reset() ``` ### Chat Modes ```python # condense_question - Condenses chat history + question chat_engine = index.as_chat_engine(chat_mode="condense_question") # context - Uses context from index chat_engine = index.as_chat_engine(chat_mode="context") # condense_plus_context - Both condense and context chat_engine = index.as_chat_engine(chat_mode="condense_plus_context") # react - ReAct agent mode chat_engine = index.as_chat_engine(chat_mode="react", verbose=True) # openai - OpenAI function calling chat_engine = index.as_chat_engine(chat_mode="openai") ``` ## Common Issues ### Chunk Size Problems ```python # ❌ BAD: Too small chunks lose context splitter = SentenceSplitter(chunk_size=64) # ❌ BAD: Too large chunks dilute relevance splitter = SentenceSplitter(chunk_size=8192) # ✅ GOOD: Balanced chunk size splitter = SentenceSplitter( chunk_size=512, # For embeddings chunk_overlap=50, # ~10% overlap ) ``` ### Retrieval Quality ```python # ❌ BAD: No reranking, few results query_engine = index.as_query_engine(similarity_top_k=3) # ✅ GOOD: Retrieve more, rerank query_engine = index.as_query_engine( similarity_top_k=20, node_postprocessors=[ SentenceTransformerRerank(top_n=5), ], ) ``` ### Memory Issues ```python # ❌ BAD: Load all documents at once documents = SimpleDirectoryReader("./huge_folder").load_data() # ✅ GOOD: Process in batches from llama_index.core import StorageContext for batch in document_batches: nodes = splitter.get_nodes_from_documents(batch) index.insert_nodes(nodes) ``` ### Embedding Dimension Mismatch ```python # ❌ BAD: Index created with different embedding # Pinecone index: 1536 dimensions # Using: 768 dimension embeddings # ✅ GOOD: Match dimensions embed_model = OpenAIEmbedding(model="text-embedding-3-small") # 1536 dims # Create Pinecone index with same dimensions ``` ## Best Practices 1. **Use appropriate chunk sizes** (512-1024 for most use cases) 2. **Always add overlap** (5-10% of chunk size) 3. **Use rerankers** for better retrieval quality 4. **Enable streaming** for better UX 5. **Use async** for parallel queries 6. **Implement caching** for repeated queries 7. **Monitor token usage** with callbacks 8. **Test with evaluation** before production 9. **Use hybrid search** for better recall 10. **Keep context window** in mind for chat ## Resources - **Documentation:** https://docs.llamaindex.ai/ - **GitHub:** https://github.com/run-llama/llama_index - **Examples:** https://github.com/run-llama/llama_index/tree/main/docs/examples - **Discord:** https://discord.gg/dGcwcsnxhU - **Blog:** https://blog.llamaindex.ai/ ## Quick Reference ### Common Imports ```python from llama_index.core import ( VectorStoreIndex, SimpleDirectoryReader, Document, Settings, StorageContext, ) from llama_index.core.node_parser import SentenceSplitter from llama_index.core.query_engine import RetrieverQueryEngine from llama_index.core.retrievers import VectorIndexRetriever from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding ``` ### Common Patterns ```python # Basic RAG index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine() response = query_engine.query("query") # With reranking query_engine = index.as_query_engine( similarity_top_k=20, node_postprocessors=[reranker], ) # Streaming query_engine = index.as_query_engine(streaming=True) for token in query_engine.query("query").response_gen: print(token) # Chat chat_engine = index.as_chat_engine() response = chat_engine.chat("message") # Agent agent = ReActAgent.from_tools(tools, llm=llm) response = agent.chat("message") ```