--- name: gemini-embeddings description: Generate text embeddings using Gemini Embedding API via scripts/. Use for creating vector representations of text, semantic search, similarity matching, clustering, and RAG applications. Triggers on "embeddings", "semantic search", "vector search", "text similarity", "RAG", "retrieval". license: MIT version: 1.0.0 keywords: embeddings, semantic search, vector, similarity, clustering, RAG, retrieval, cosine similarity, gemini-embedding-001 --- # Gemini Embeddings Generate high-quality text embeddings for semantic search, similarity analysis, clustering, and RAG (Retrieval Augmented Generation) applications through executable scripts. ## When to Use This Skill Use this skill when you need to: - Find semantically similar documents or texts - Build semantic search engines - Implement RAG (Retrieval Augmented Generation) - Cluster or group similar documents - Calculate text similarity scores - Power recommendation systems - Enable semantic document retrieval - Create vector databases for AI applications ## Available Scripts ### scripts/embed.py **Purpose**: Generate embeddings and calculate similarity **When to use**: - Creating vector representations of text - Comparing text similarity - Building semantic search systems - Implementing RAG pipelines - Clustering documents **Key parameters**: | Parameter | Description | Example | |-----------|-------------|---------| | `texts` | Text(s) to embed (required) | `"Your text here"` | | `--model`, `-m` | Embedding model | `gemini-embedding-001` | | `--task`, `-t` | Task type | `SEMANTIC_SIMILARITY` | | `--dim`, `-d` | Output dimensionality | `768`, `1536`, `3072` | | `--similarity`, `-s` | Calculate pairwise similarity | Flag | | `--json`, `-j` | Output as JSON | Flag | **Output**: Embedding vectors or similarity scores ## Workflows ### Workflow 1: Single Text Embedding ```bash python scripts/embed.py "What is the meaning of life?" ``` - Best for: Basic embedding generation - Output: Vector with 3072 dimensions (default) - Use when: Storing single document vectors ### Workflow 2: Semantic Search ```bash # 1. Generate embedding for query python scripts/embed.py "best practices for coding" --task RETRIEVAL_QUERY > query.json # 2. Generate embeddings for documents (batch) python scripts/embed.py "Coding best practices include version control" "Clean code is essential" --task RETRIEVAL_DOCUMENT > docs.json # 3. Compare and find most similar (calculate similarity separately) ``` - Best for: Building search functionality - Task types: `RETRIEVAL_QUERY`, `RETRIEVAL_DOCUMENT` - Combines with: Similarity calculation for ranking ### Workflow 3: Text Similarity Comparison ```bash python scripts/embed.py "What is the meaning of life?" "What is the purpose of existence?" "How do I bake a cake?" --similarity ``` - Best for: Comparing multiple texts, finding duplicates - Output: Pairwise similarity scores (0-1) - Use when: Need to rank text similarity ### Workflow 4: Dimensionality Reduction for Efficiency ```bash python scripts/embed.py "Text to embed" --dim 768 ``` - Best for: Faster storage and comparison - Options: `768`, `1536`, or `3072` (default) - Trade-off: Lower dimensions = less accuracy but faster ### Workflow 5: Document Clustering ```bash # 1. Generate embeddings for multiple documents python scripts/embed.py "Machine learning is AI" "Deep learning is a subset" "Neural networks power AI" --json > embeddings.jsonl # 2. Process embeddings with clustering algorithm (your code) # Use scikit-learn, KMeans, etc. ``` - Best for: Grouping similar documents, topic discovery - Task type: `CLUSTERING` - Combines with: Clustering libraries (scikit-learn) ### Workflow 6: RAG Implementation ```bash # 1. Create document embeddings (one-time setup) python scripts/embed.py "Document 1 content" "Document 2 content" --task RETRIEVAL_DOCUMENT --dim 1536 # 2. For each query, find similar documents python scripts/embed.py "User query here" --task RETRIEVAL_QUERY # 3. Use retrieved documents in prompt to LLM (gemini-text) python skills/gemini-text/scripts/generate.py "Context: [retrieved docs]. Answer: [user query]" ``` - Best for: Building knowledge-based AI systems - Combines with: gemini-text for generation with context ### Workflow 7: JSON Output for API Integration ```bash python scripts/embed.py "Text to process" --json ``` - Best for: API responses, database storage - Output: JSON array of embedding vectors - Use when: Programmatic processing required ### Workflow 8: Batch Document Processing ```bash # 1. Create JSONL with documents echo '{"text": "Document 1"}' > docs.jsonl echo '{"text": "Document 2"}' >> docs.jsonl # 2. Process with script or custom code python3 << 'EOF' import json from google import genai client = genai.Client() texts = [] with open("docs.jsonl") as f: for line in f: texts.append(json.loads(line)["text"]) response = client.models.embed_content( model="gemini-embedding-001", contents=texts, task_type="RETRIEVAL_DOCUMENT" ) embeddings = [e.values for e in response.embeddings] print(f"Generated {len(embeddings)} embeddings") EOF ``` - Best for: Large document collections - Combines with: Vector databases (Pinecone, Weaviate) ## Parameters Reference ### Task Types | Task Type | Best For | When to Use | |-----------|----------|-------------| | `SEMANTIC_SIMILARITY` | Comparing text similarity | General comparison tasks | | `RETRIEVAL_DOCUMENT` | Embedding documents | Storing documents for retrieval | | `RETRIEVAL_QUERY` | Embedding search queries | Finding similar documents | | `CLASSIFICATION` | Text classification | Categorizing text | | `CLUSTERING` | Grouping similar texts | Document clustering | ### Dimensionality Options | Dimensions | Use Case | Trade-off | |------------|----------|-----------| | 768 | High-volume, real-time | Lower accuracy, faster | | 1536 | Balanced performance | Good accuracy/speed balance | | 3072 | Highest accuracy | Slower, more storage | ### Similarity Scores | Score | Interpretation | |-------|---------------| | 0.8 - 1.0 | Very similar (likely duplicates) | | 0.6 - 0.8 | Highly related (same topic) | | 0.4 - 0.6 | Moderately related | | 0.2 - 0.4 | Weakly related | | 0.0 - 0.2 | Unrelated | ## Output Interpretation ### Embedding Vector - Format: List of float values (768, 1536, or 3072) - Range: Typically -1.0 to 1.0 - Normalized for cosine similarity - Can be stored in vector databases ### Similarity Output ``` Pairwise Similarity: 'What is the meaning of life?...' <-> 'What is the purpose of existence?...': 0.8742 'What is the meaning of life?...' <-> 'How do I bake a cake?...': 0.1234 ``` - Higher scores = more similar - Use threshold (e.g., 0.7) for matching ### JSON Output ```json [[0.123, -0.456, 0.789, ...], [0.234, -0.567, 0.890, ...]] ``` - Array of embedding vectors - One per input text - Ready for database storage ## Common Issues ### "google-genai not installed" ```bash pip install google-genai numpy ``` ### "numpy not installed" (for similarity) ```bash pip install numpy ``` ### "Invalid task type" - Use available tasks: SEMANTIC_SIMILARITY, RETRIEVAL_DOCUMENT, RETRIEVAL_QUERY, CLASSIFICATION, CLUSTERING - Check spelling (case-sensitive) - Use correct task for your use case ### "Invalid dimension" - Options: 768, 1536, or 3072 only - Check model supports requested dimension - Default to 3072 if unsure ### "No similarity calculated" - Need multiple texts for similarity comparison - Use `--similarity` flag - Check that at least 2 texts provided ### "Embedding size mismatch" - All embeddings must have same dimensionality - Use consistent `--dim` parameter - Recompute if dimensions differ ## Best Practices ### Task Selection - **SEMANTIC_SIMILARITY**: General text comparison - **RETRIEVAL_DOCUMENT**: Storing documents for search - **RETRIEVAL_QUERY**: Querying for similar documents - **CLASSIFICATION**: Categorization tasks - **CLUSTERING**: Grouping similar content ### Dimensionality Choice - **768**: Real-time applications, high volume - **1536**: Balanced choice for most use cases - **3072**: Maximum accuracy, offline processing ### Performance Optimization - Use lower dimensions for speed - Batch multiple texts in one request - Cache embeddings for repeated queries - Precompute document embeddings for search ### Storage Tips - Use vector databases (Pinecone, Weaviate, Chroma) - Normalize vectors for consistent comparison - Store metadata with embeddings - Index for fast retrieval ### RAG Implementation - Precompute document embeddings - Use RETRIEVAL_DOCUMENT for docs - Use RETRIEVAL_QUERY for user questions - Combine top results with gemini-text ### Similarity Thresholds - **0.9+**: Exact duplicates or near-duplicates - **0.7-0.9**: Same topic/subject - **0.5-0.7**: Related concepts - **<0.5**: Different topics ## Related Skills - **gemini-text**: Generate text with retrieved context (RAG) - **gemini-batch**: Process embeddings in bulk - **gemini-files**: Upload documents for embedding - **gemini-search**: Implement semantic search (if available) ## Quick Reference ```bash # Basic embedding python scripts/embed.py "Your text here" # Semantic search python scripts/embed.py "Query" --task RETRIEVAL_QUERY # Document embedding python scripts/embed.py "Document text" --task RETRIEVAL_DOCUMENT # Similarity comparison python scripts/embed.py "Text 1" "Text 2" "Text 3" --similarity # Dimensionality reduction python scripts/embed.py "Text" --dim 768 # JSON output python scripts/embed.py "Text" --json ``` ## Reference - Get API key: https://aistudio.google.com/apikey - Documentation: https://ai.google.dev/gemini-api/docs/embeddings - Vector databases: Pinecone, Weaviate, Chroma, Qdrant - Cosine similarity: Standard for embedding comparison