--- name: llms-generative-ai description: LLMs, prompt engineering, RAG systems, LangChain, and AI application development sasmp_version: "1.3.0" bonded_agent: 06-ml-ai-engineer bond_type: PRIMARY_BOND skill_version: "2.0.0" last_updated: "2025-01" complexity: advanced estimated_mastery_hours: 180 prerequisites: [python-programming, machine-learning] unlocks: [deep-learning, mlops] --- # LLMs & Generative AI Production-grade LLM applications with prompt engineering, RAG systems, and modern AI development patterns. ## Quick Start ```python # Production RAG System with LangChain (2024-2025) from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_community.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import RunnablePassthrough from langchain_core.output_parsers import StrOutputParser # Initialize components llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0) embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Document processing text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ". ", " ", ""] ) documents = text_splitter.split_documents(raw_documents) # Vector store vectorstore = Chroma.from_documents( documents=documents, embedding=embeddings, persist_directory="./chroma_db" ) retriever = vectorstore.as_retriever( search_type="mmr", # Maximum Marginal Relevance search_kwargs={"k": 5, "fetch_k": 10} ) # RAG chain template = """Answer the question based only on the following context: Context: {context} Question: {question} Answer thoughtfully and cite specific parts of the context.""" prompt = ChatPromptTemplate.from_template(template) rag_chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) # Query response = rag_chain.invoke("What are the key features?") print(response) ``` ## Core Concepts ### 1. Prompt Engineering Patterns ```python from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate # System prompt design system_prompt = """You are an expert data analyst assistant. CAPABILITIES: - Analyze data patterns and trends - Generate SQL queries - Explain statistical concepts CONSTRAINTS: - Only use information provided in the context - Acknowledge uncertainty when relevant - Format outputs in clear, structured way OUTPUT FORMAT: - Start with a brief summary - Use bullet points for key findings - Include confidence level (high/medium/low) """ # Few-shot prompting examples = [ {"input": "What's the average order value?", "output": "```sql\nSELECT AVG(total_amount) as avg_order_value\nFROM orders\nWHERE status = 'completed';\n```"}, {"input": "Show top customers by revenue", "output": "```sql\nSELECT customer_id, SUM(total_amount) as revenue\nFROM orders\nGROUP BY customer_id\nORDER BY revenue DESC\nLIMIT 10;\n```"} ] example_prompt = ChatPromptTemplate.from_messages([ ("human", "{input}"), ("ai", "{output}") ]) few_shot_prompt = FewShotChatMessagePromptTemplate( example_prompt=example_prompt, examples=examples ) # Chain of Thought prompting cot_prompt = """Let's solve this step by step: Question: {question} Step 1: Identify the key components Step 2: Break down the problem Step 3: Apply relevant knowledge Step 4: Synthesize the answer Reasoning:""" # Self-consistency (multiple reasoning paths) async def self_consistent_answer(question: str, n_samples: int = 5) -> str: responses = await asyncio.gather(*[ llm.ainvoke(question) for _ in range(n_samples) ]) # Majority voting or aggregation return aggregate_responses(responses) ``` ### 2. Advanced RAG Patterns ```python from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor from langchain_community.retrievers import BM25Retriever from langchain.retrievers import EnsembleRetriever # Hybrid search (dense + sparse) bm25_retriever = BM25Retriever.from_documents(documents) bm25_retriever.k = 5 chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) ensemble_retriever = EnsembleRetriever( retrievers=[bm25_retriever, chroma_retriever], weights=[0.4, 0.6] ) # Contextual compression compressor = LLMChainExtractor.from_llm(llm) compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=ensemble_retriever ) # Parent document retriever (for better context) from langchain.retrievers import ParentDocumentRetriever from langchain.storage import InMemoryStore parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) store = InMemoryStore() parent_retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter ) # Self-querying retriever from langchain.retrievers.self_query.base import SelfQueryRetriever from langchain.chains.query_constructor.base import AttributeInfo metadata_field_info = [ AttributeInfo(name="source", description="Document source", type="string"), AttributeInfo(name="date", description="Creation date", type="date"), AttributeInfo(name="category", description="Document category", type="string"), ] self_query_retriever = SelfQueryRetriever.from_llm( llm=llm, vectorstore=vectorstore, document_contents="Technical documentation", metadata_field_info=metadata_field_info ) ``` ### 3. Agents and Tool Use ```python from langchain.agents import create_openai_functions_agent, AgentExecutor from langchain.tools import Tool, StructuredTool from langchain_core.pydantic_v1 import BaseModel, Field from typing import Optional # Define tools with Pydantic schemas class SQLQueryInput(BaseModel): query: str = Field(description="SQL query to execute") limit: Optional[int] = Field(default=100, description="Max rows to return") def execute_sql(query: str, limit: int = 100) -> str: """Execute SQL query against the database.""" # Validate query (prevent injection) if any(kw in query.upper() for kw in ["DROP", "DELETE", "UPDATE", "INSERT"]): return "Error: Only SELECT queries allowed" result = db.execute(f"{query} LIMIT {limit}") return result.to_markdown() sql_tool = StructuredTool.from_function( func=execute_sql, name="sql_executor", description="Execute SQL queries against the data warehouse", args_schema=SQLQueryInput ) # Calculator tool def calculate(expression: str) -> str: """Evaluate mathematical expression.""" try: # Safe eval with limited scope allowed_names = {"abs": abs, "round": round, "sum": sum} return str(eval(expression, {"__builtins__": {}}, allowed_names)) except Exception as e: return f"Error: {e}" calc_tool = Tool.from_function( func=calculate, name="calculator", description="Evaluate mathematical expressions" ) # Create agent tools = [sql_tool, calc_tool] agent = create_openai_functions_agent(llm, tools, prompt) agent_executor = AgentExecutor( agent=agent, tools=tools, verbose=True, max_iterations=5, early_stopping_method="generate" ) result = agent_executor.invoke({"input": "What's the total revenue for Q4 2024?"}) ``` ### 4. Structured Output ```python from langchain_core.pydantic_v1 import BaseModel, Field from typing import List, Optional from langchain.output_parsers import PydanticOutputParser # Define output schema class DataInsight(BaseModel): title: str = Field(description="Brief title of the insight") description: str = Field(description="Detailed explanation") confidence: float = Field(description="Confidence score 0-1") data_points: List[str] = Field(description="Supporting data points") recommendations: Optional[List[str]] = Field(description="Action items") class AnalysisReport(BaseModel): summary: str = Field(description="Executive summary") insights: List[DataInsight] = Field(description="Key insights found") methodology: str = Field(description="Analysis approach used") # Parser parser = PydanticOutputParser(pydantic_object=AnalysisReport) prompt = ChatPromptTemplate.from_messages([ ("system", "Analyze the data and provide structured insights."), ("human", "{input}\n\n{format_instructions}") ]).partial(format_instructions=parser.get_format_instructions()) chain = prompt | llm | parser report: AnalysisReport = chain.invoke({"input": "Analyze Q4 sales trends"}) print(report.summary) for insight in report.insights: print(f"- {insight.title}: {insight.confidence:.0%} confidence") ``` ### 5. Evaluation and Monitoring ```python from langchain.evaluation import load_evaluator from langsmith import Client import openai # LangSmith for tracing client = Client() # Create evaluation dataset examples = [ {"input": "What is RAG?", "output": "Retrieval Augmented Generation..."}, {"input": "How does chunking work?", "output": "Chunking splits documents..."}, ] dataset = client.create_dataset("rag-evaluation") for ex in examples: client.create_example(inputs={"question": ex["input"]}, outputs={"answer": ex["output"]}, dataset_id=dataset.id) # Evaluators faithfulness_evaluator = load_evaluator("labeled_criteria", criteria="correctness") relevance_evaluator = load_evaluator("embedding_distance") # Custom evaluator for RAG def evaluate_rag_response(question: str, context: str, response: str) -> dict: """Evaluate RAG response quality.""" # Faithfulness: Is response grounded in context? faithfulness_prompt = f""" Context: {context} Response: {response} Is the response fully supported by the context? Score 1-5 and explain. """ # Relevance: Does response answer the question? relevance_prompt = f""" Question: {question} Response: {response} Does the response adequately answer the question? Score 1-5 and explain. """ # Get scores faithfulness_score = llm.invoke(faithfulness_prompt) relevance_score = llm.invoke(relevance_prompt) return { "faithfulness": parse_score(faithfulness_score), "relevance": parse_score(relevance_score) } # Production monitoring from prometheus_client import Counter, Histogram llm_requests = Counter("llm_requests_total", "Total LLM requests", ["model", "status"]) llm_latency = Histogram("llm_latency_seconds", "LLM request latency") token_usage = Counter("llm_tokens_total", "Total tokens used", ["type"]) @llm_latency.time() def monitored_llm_call(prompt: str) -> str: try: response = llm.invoke(prompt) llm_requests.labels(model="gpt-4", status="success").inc() token_usage.labels(type="input").inc(count_tokens(prompt)) token_usage.labels(type="output").inc(count_tokens(response)) return response except Exception as e: llm_requests.labels(model="gpt-4", status="error").inc() raise ``` ## Tools & Technologies | Tool | Purpose | Version (2025) | |------|---------|----------------| | **LangChain** | LLM application framework | 0.2+ | | **LlamaIndex** | Data framework for LLMs | 0.10+ | | **OpenAI API** | GPT-4, embeddings | Latest | | **Anthropic API** | Claude models | Latest | | **Chroma** | Vector database | 0.4+ | | **Pinecone** | Managed vector DB | Latest | | **LangSmith** | LLM observability | Latest | | **Ollama** | Local LLM running | 0.1+ | | **vLLM** | High-perf LLM serving | 0.3+ | ## Learning Path ### Phase 1: Foundations (Weeks 1-3) ``` Week 1: LLM concepts, tokenization, prompting basics Week 2: OpenAI/Anthropic APIs, prompt engineering Week 3: LangChain basics, chains, output parsers ``` ### Phase 2: RAG Systems (Weeks 4-7) ``` Week 4: Embeddings, vector databases Week 5: Document processing, chunking strategies Week 6: Retrieval strategies (hybrid, reranking) Week 7: Advanced RAG patterns ``` ### Phase 3: Agents (Weeks 8-10) ``` Week 8: Tool calling, function calling Week 9: Agent architectures, planning Week 10: Multi-agent systems ``` ### Phase 4: Production (Weeks 11-14) ``` Week 11: Evaluation frameworks Week 12: Guardrails, safety Week 13: Deployment, scaling Week 14: Monitoring, optimization ``` ## Troubleshooting Guide ### Common Failure Modes | Issue | Symptoms | Root Cause | Fix | |-------|----------|------------|-----| | **Hallucination** | Incorrect facts | No grounding | Better RAG, fact-checking | | **Context Overflow** | Truncated response | Too much context | Summarize, filter | | **Poor Retrieval** | Irrelevant chunks | Bad embeddings/chunking | Tune chunk size, reranking | | **Slow Response** | High latency | Large context, no cache | Streaming, caching | | **Rate Limits** | 429 errors | Too many requests | Backoff, batch requests | ### Debug Checklist ```python # 1. Check retrieval quality retrieved_docs = retriever.get_relevant_documents("test query") for doc in retrieved_docs: print(f"Score: {doc.metadata.get('score')}") print(f"Content: {doc.page_content[:200]}...") # 2. Validate prompt print(prompt.format(context="test", question="test")) # 3. Token counting import tiktoken enc = tiktoken.encoding_for_model("gpt-4") tokens = len(enc.encode(full_prompt)) print(f"Token count: {tokens}") # 4. Test LLM directly response = llm.invoke("Simple test prompt") print(response) # 5. Check embeddings embedding = embeddings.embed_query("test") print(f"Embedding dim: {len(embedding)}") ``` ## Unit Test Template ```python import pytest from unittest.mock import Mock, patch from your_rag_system import RAGPipeline, DocumentProcessor class TestRAGPipeline: @pytest.fixture def mock_llm(self): llm = Mock() llm.invoke.return_value = "Mocked response" return llm @pytest.fixture def rag_pipeline(self, mock_llm): return RAGPipeline(llm=mock_llm) def test_retrieves_relevant_documents(self, rag_pipeline): query = "What is machine learning?" docs = rag_pipeline.retrieve(query) assert len(docs) > 0 assert all("machine learning" in doc.page_content.lower() for doc in docs[:3]) def test_generates_grounded_response(self, rag_pipeline, mock_llm): response = rag_pipeline.query("Test question") mock_llm.invoke.assert_called_once() assert response is not None def test_handles_empty_retrieval(self, rag_pipeline): with patch.object(rag_pipeline.retriever, 'get_relevant_documents', return_value=[]): response = rag_pipeline.query("Obscure question") assert "no information" in response.lower() class TestDocumentProcessor: def test_chunks_documents_correctly(self): processor = DocumentProcessor(chunk_size=100, chunk_overlap=20) text = "A" * 250 # 250 character document chunks = processor.split(text) assert len(chunks) >= 2 assert all(len(c) <= 100 for c in chunks) def test_preserves_metadata(self): processor = DocumentProcessor() doc = Document(page_content="Test", metadata={"source": "test.pdf"}) chunks = processor.split_documents([doc]) assert all(c.metadata["source"] == "test.pdf" for c in chunks) ``` ## Best Practices ### Prompt Engineering ```python # ✅ DO: Be specific and structured prompt = """Task: Summarize the document. Format: 3 bullet points Constraints: Max 50 words per point Tone: Professional""" # ✅ DO: Include examples # ✅ DO: Set clear output format # ✅ DO: Handle edge cases in prompt # ❌ DON'T: Vague prompts # ❌ DON'T: Assume LLM knows context # ❌ DON'T: Trust LLM output without validation ``` ### RAG Systems ```python # ✅ DO: Tune chunk size for your domain # ✅ DO: Use hybrid retrieval # ✅ DO: Implement reranking # ✅ DO: Add metadata filtering # ❌ DON'T: One-size-fits-all chunking # ❌ DON'T: Skip evaluation # ❌ DON'T: Ignore retrieval quality ``` ## Resources ### Official Documentation - [LangChain Docs](https://python.langchain.com/) - [OpenAI Cookbook](https://cookbook.openai.com/) - [Anthropic Docs](https://docs.anthropic.com/) ### Courses - [DeepLearning.AI LangChain](https://www.deeplearning.ai/) - [LlamaIndex Course](https://docs.llamaindex.ai/en/stable/) ### Research - [RAG Survey Paper](https://arxiv.org/abs/2312.10997) - [Prompt Engineering Guide](https://www.promptingguide.ai/) ## Next Skills After mastering LLMs & Generative AI: - → `deep-learning` - Understand transformer internals - → `mlops` - Deploy LLM applications at scale - → `big-data` - Process training data --- **Skill Certification Checklist:** - [ ] Can build production RAG systems - [ ] Can implement effective prompt engineering - [ ] Can create tool-using agents - [ ] Can evaluate and monitor LLM applications - [ ] Can optimize for latency and cost