--- name: document-chat-interface description: Build chat interfaces for querying documents using natural language. Extract information from PDFs, GitHub repositories, emails, and other sources. Use when creating interactive document Q&A systems, knowledge base chatbots, email search interfaces, or document exploration tools. --- # Document Chat Interface Build intelligent chat interfaces that allow users to query and interact with documents using natural language, transforming static documents into interactive knowledge sources. ## Overview A document chat interface combines three capabilities: 1. **Document Processing** - Extract and prepare documents 2. **Semantic Understanding** - Understand questions and find relevant content 3. **Conversational Interface** - Maintain context and provide natural responses ### Common Applications - **PDF Q&A**: Answer questions about research papers, reports, books - **Email Search**: Find information in email archives conversationally - **GitHub Explorer**: Ask questions about code repositories - **Knowledge Base**: Interactive access to company documentation - **Contract Review**: Query legal documents with natural language - **Research Assistant**: Explore academic papers interactively ## Architecture ``` Document Source ↓ Document Processor ├→ Extract text ├→ Process content └→ Generate embeddings ↓ Vector Database ↓ Chat Interface ← User Question ├→ Retrieve relevant content ├→ Maintain conversation history └→ Generate response ``` ## Core Components ### 1. Document Sources See [examples/document_processors.py](examples/document_processors.py) for implementations: #### PDF Documents - Extract text from PDF pages - Preserve document structure and metadata - Handle scanned PDFs with OCR (pytesseract) - Extract tables (pdfplumber) #### GitHub Repositories - Extract code files from repositories - Parse repository structure - Process multiple file types #### Email Archives - Extract email metadata (from, to, subject, date) - Parse email body content - Handle multiple mailbox formats #### Web Pages - Extract page text and structure - Preserve heading hierarchy - Extract links and navigation #### YouTube/Audio - Get transcripts from YouTube videos - Transcribe audio files - Handle multiple formats ### 2. Document Processing See [examples/text_processor.py](examples/text_processor.py) for implementations: #### Text Extraction & Cleaning - Remove extra whitespace and special characters - Smart text chunking with overlap - Intelligent sentence boundary detection #### Metadata Extraction - Extract title, author, date, language - Calculate word count and document statistics - Track document source and format #### Structure Preservation - Keep heading hierarchy in chunks - Preserve section context - Enable hierarchical retrieval ### 3. Chat Interface Design See [examples/conversation_manager.py](examples/conversation_manager.py) for implementations: #### Conversation Management - Maintain conversation history with size limits - Track message metadata (timestamps, roles) - Provide context for LLM integration - Clear history as needed #### Question Refinement - Expand implicit references in questions - Handle pronouns and context references - Improve question clarity with previous context #### Response Generation - Use document context for answering - Maintain conversation history in prompts - Provide source citations - Handle out-of-scope questions ### 4. User Experience Features #### Citation & Sources ```python def format_response_with_citations(response: str, sources: List[Dict]) -> str: """Add source citations to response""" formatted = response + "\n\n**Sources:**\n" for i, source in enumerate(sources, 1): formatted += f"[{i}] Page {source['page']} of {source['source']}\n" if 'excerpt' in source: formatted += f" \"{source['excerpt'][:100]}...\"\n" return formatted ``` #### Clarifying Questions ```python def generate_follow_up_questions(context: str, response: str) -> List[str]: """Suggest follow-up questions to user""" prompt = f""" Based on this Q&A, generate 3 relevant follow-up questions: Context: {context[:500]} Response: {response[:500]} """ follow_ups = llm.generate(prompt) return follow_ups ``` #### Error Handling ```python def handle_query_failure(question: str, error: Exception) -> str: """Handle when no relevant documents found""" if isinstance(error, NoRelevantDocuments): return ( "I couldn't find information about that in the documents. " "Try asking about different topics like: " + ", ".join(get_main_topics()) ) elif isinstance(error, ContextTooLarge): return ( "The answer requires too much context. " "Can you be more specific about what you'd like to know?" ) else: return f"I encountered an issue: {str(error)[:100]}" ``` ## Implementation Frameworks ### Using LangChain ```python from langchain.document_loaders import PDFLoader from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.chat_models import ChatOpenAI from langchain.chains import ConversationalRetrievalChain # Load document loader = PDFLoader("document.pdf") documents = loader.load() # Split into chunks splitter = CharacterTextSplitter(chunk_size=1000) chunks = splitter.split_documents(documents) # Create embeddings embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(chunks, embeddings) # Create chat chain llm = ChatOpenAI(model="gpt-4") qa = ConversationalRetrievalChain.from_llm( llm=llm, retriever=vectorstore.as_retriever(), return_source_documents=True ) # Chat interface chat_history = [] while True: question = input("You: ") result = qa({"question": question, "chat_history": chat_history}) print(f"Assistant: {result['answer']}") chat_history.append((question, result['answer'])) ``` ### Using LlamaIndex ```python from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ChatMemoryBuffer from llama_index.llms import ChatMessage, MessageRole # Load documents documents = SimpleDirectoryReader("./docs").load_data() # Create index index = GPTVectorStoreIndex.from_documents(documents) # Create chat engine with memory chat_engine = index.as_chat_engine( memory=ChatMemoryBuffer.from_defaults(token_limit=3900), llm="gpt-4" ) # Chat loop while True: question = input("You: ") response = chat_engine.chat(question) print(f"Assistant: {response}") ``` ### Using RAG-Based Approach ```python from sentence_transformers import SentenceTransformer import faiss import numpy as np # Load and embed documents model = SentenceTransformer('all-MiniLM-L6-v2') documents = load_documents("document.pdf") embeddings = model.encode(documents) # Create FAISS index dimension = embeddings.shape[1] index = faiss.IndexFlatL2(dimension) index.add(np.array(embeddings).astype('float32')) # Chat function def chat(question): # Embed question q_embedding = model.encode(question) # Retrieve documents k = 5 distances, indices = index.search( np.array([q_embedding]).astype('float32'), k ) # Get relevant documents context = " ".join([documents[i] for i in indices[0]]) # Generate response response = llm.generate( f"Context: {context}\nQuestion: {question}\nAnswer:" ) return response ``` ## Best Practices ### Document Handling - ✓ Support multiple formats (PDF, TXT, docx, etc.) - ✓ Handle large documents efficiently - ✓ Preserve document structure - ✓ Extract metadata - ✓ Handle multiple languages - ✓ Implement OCR for scanned PDFs ### Conversation Quality - ✓ Maintain conversation context - ✓ Ask clarifying questions - ✓ Cite sources - ✓ Handle ambiguity - ✓ Suggest follow-up questions - ✓ Handle out-of-scope questions ### Performance - ✓ Optimize retrieval speed - ✓ Implement caching - ✓ Handle large document sets - ✓ Batch process documents - ✓ Monitor latency - ✓ Implement pagination ### User Experience - ✓ Clear response formatting - ✓ Ability to cite sources - ✓ Document browser/explorer - ✓ Search suggestions - ✓ Query history - ✓ Export conversations ## Common Challenges & Solutions ### Challenge: Irrelevant Answers **Solutions**: - Improve retrieval (more context, better embeddings) - Validate answer against context - Ask clarifying questions - Implement confidence scoring - Use hybrid search ### Challenge: Lost Context Across Turns **Solutions**: - Maintain conversation memory - Update retrieval based on history - Summarize long conversations - Re-weight previous queries ### Challenge: Handling Long Documents **Solutions**: - Hierarchical chunking - Summarize first - Question refinement - Multi-hop retrieval - Document navigation ### Challenge: Limited Context Window **Solutions**: - Compress retrieved context - Use document summarization - Hierarchical retrieval - Focus on most relevant sections - Iterative refinement ## Advanced Features ### Multi-Document Analysis ```python def compare_documents(question: str, documents: List[str]): """Analyze and compare across multiple documents""" results = [] for doc in documents: response = query_document(doc, question) results.append({ "document": doc.name, "answer": response }) # Compare and synthesize comparison = llm.generate( f"Compare these answers: {results}" ) return comparison ``` ### Interactive Document Exploration ```python class DocumentExplorer: def __init__(self, documents): self.documents = documents def browse_by_topic(self, topic): """Find documents by topic""" pass def get_related_documents(self, doc_id): """Find similar documents""" pass def get_key_terms(self, document): """Extract key terms and concepts""" pass ``` ## Resources ### Document Processing Libraries - PyPDF: PDF handling - python-docx: Word document handling - BeautifulSoup: Web scraping - youtube-transcript-api: YouTube transcripts ### Chat Frameworks - LangChain: Comprehensive framework - LlamaIndex: Document-focused - RAG libraries: Vector DB integration ## Implementation Checklist - [ ] Choose document source(s) to support - [ ] Implement document loading and processing - [ ] Set up vector database/embeddings - [ ] Build chat interface - [ ] Implement conversation management - [ ] Add source citation - [ ] Handle edge cases (large docs, OCR, etc.) - [ ] Implement error handling - [ ] Add performance monitoring - [ ] Test with real documents - [ ] Deploy and monitor ## Getting Started 1. **Start Simple**: Single PDF, basic chat 2. **Add Features**: Multi-document, conversation history 3. **Improve Quality**: Better chunking, retrieval 4. **Scale**: Support more formats, larger documents 5. **Polish**: UX improvements, error handling