# Migration Guide: v1.x to v2.0 This guide covers migrating RAGStack deployments from v1.x (deployed after December 24, 2025) to v2.0. ## Breaking Changes in v2.0 ### Architecture Changes | Component | v1.x | v2.0 | |-----------|------|------| | **S3 Prefixes** | `output/` (docs), `images/` (images) | `content/` (unified) | | **Data Sources** | 2 (TextDataSource + ImageDataSource) | 1 (DataSource) | | **Env Vars** | `TEXT_DATA_SOURCE_ID`, `IMAGE_DATA_SOURCE_ID` | `DATA_SOURCE_ID` | | **Image Metadata** | `IN_LINE_ATTRIBUTE` | `S3_LOCATION` (.metadata.json files) | | **CloudFormation Outputs** | `DataSourceId`, `TextDataSourceId`, `ImageDataSourceId` | `DataSourceId` only | ### Why Migrate? - **Simplified architecture**: Single data source reduces complexity - **Unified content handling**: All content types use consistent metadata format - **Improved metadata extraction**: New LLM-based metadata extraction for all content - **Better filtering**: `content_type` field enables filtering by document/image/web_page ## Prerequisites - Python 3.13+ with `boto3` installed - AWS CLI configured with appropriate permissions - Access to the deployed stack ## Migration Process ### Step 1: Run the Migration Script (Dry Run) First, preview what changes will be made: ```bash python scripts/migrate_v1_to_v2.py --stack-name --dry-run ``` This will show: - Files that will be copied from `output/` and `images/` to `content/` - DynamoDB tracking records that will be updated ### Step 2: Run the Actual Migration Once satisfied with the dry run output: ```bash python scripts/migrate_v1_to_v2.py --stack-name ``` The script will: 1. Copy all files from `output/` to `content/` 2. Copy all files from `images/` to `content/` 3. Update tracking table records with new S3 URIs **Note:** The script is idempotent - it skips files that already exist in `content/` and records that are already updated. ### Step 3: Deploy v2.0 Stack Pull the latest code and deploy: ```bash git pull origin main sam build sam deploy --stack-name ``` This updates: - Lambda functions with new single data source logic - Knowledge Base custom resource to create unified data source - EventBridge rules to watch `content/` prefix ### Step 4: Trigger Reindex Open the RAGStack dashboard and navigate to **Settings**: 1. Scroll to the **Knowledge Base Reindex** section 2. Click **Start Reindex** 3. Wait for the reindex to complete The reindex will: - Create a new Knowledge Base with the unified `content/` data source - Re-extract metadata for all content using the new extraction system - Ingest all documents, images, and scraped pages with fresh embeddings - Delete the old Knowledge Base ### Step 5: Verify Migration After reindex completes: 1. **Test chat**: Query the Knowledge Base and verify responses 2. **Check sources**: Ensure source attribution shows correct paths 3. **Test image search**: Verify image captions are searchable 4. **Test filters**: Use content_type filter to search specific content types ## Migration Script Details ### What the Script Does ```text migrate_v1_to_v2.py ├── Get stack outputs (bucket name, table name) ├── Step 1: Copy output/* → content/* ├── Step 2: Copy images/* → content/* └── Step 3: Update DynamoDB tracking records ├── output_s3_uri: output/ → content/ ├── input_s3_uri: images/ → content/ (for images) └── caption_s3_uri: images/ → content/ (for images) ``` ### What the Script Does NOT Do - Does NOT delete old files (output/, images/ remain intact) - Does NOT modify the Knowledge Base (reindex handles this) - Does NOT generate metadata files (reindex handles this) ### Options ```bash python scripts/migrate_v1_to_v2.py --help Options: --stack-name CloudFormation stack name (required) --region AWS region (default: us-east-1) --dry-run Preview changes without making them --verbose, -v Enable debug logging ``` ## Reindex Details The reindex process handles all content types with type-specific logic: | Type | Text Source | Metadata Extraction | Ingestion | |------|-------------|---------------------|-----------| | Documents | `output_s3_uri` | LLM extracts from text | 1 document | | Images | `caption_s3_uri` | LLM extracts from caption | 2 documents (image + caption) | | Scraped | `output_s3_uri` | Job-aware (see below) | 1 document | ### Job-Aware Scraped Content Reindex Scraped content uses a special two-level metadata extraction: 1. **Job-level metadata**: Extracted from the **seed document** (first page scraped) - Applied to ALL pages in the scrape job - Provides semantic context (e.g., "AWS Lambda documentation") 2. **Page-level metadata**: Deterministic fields for each page - `source_url`, `source_domain`, `scraped_date`, `job_id` **How it works:** ```text Scraped Page → S3 metadata → job_id ↓ ScrapeJobs table → base_url ↓ Find seed document (source_url == base_url) ↓ Re-extract job metadata from seed (using NEW settings) ↓ Merge job metadata + page metadata ``` This ensures: - All pages in a job share semantic metadata from the seed - Metadata uses the NEW extraction settings (not preserved from original scrape) - Job metadata is cached per-batch to avoid redundant LLM calls ### Common Metadata All content gets: - Fresh metadata extraction using configured LLM model - `content_type` field for filtering ("document", "image", "web_page") - Base metadata (document_id, filename, file_type) ## Rollback If issues occur after migration: 1. **Before stack deploy**: Old files still exist in `output/` and `images/` - no rollback needed 2. **After stack deploy but before reindex**: Re-deploy old code version 3. **After reindex**: The old KB is deleted; you'd need to re-upload content ## Troubleshooting ### Migration Script Errors **"Stack not found"** - Verify stack name is correct - Check you're using the right AWS region **"Access Denied" errors** - Ensure AWS credentials have S3 read/write and DynamoDB permissions - Check the IAM user/role has access to the stack's resources ### Reindex Errors **"Failed to read text"** - Check the `output_s3_uri` or `caption_s3_uri` paths are correct - Verify files were copied to `content/` prefix **"Metadata extraction failed"** - Check Bedrock model access (ensure your region supports the configured model) - Review CloudWatch logs for the ReindexKB Lambda ### Post-Migration Issues **Chat returns no results** - Wait for reindex to fully complete - Check Knowledge Base status in AWS console - Verify data source has correct `content/` prefix **Images not searchable** - Ensure caption files exist at `content/{imageId}/caption.txt` - Check the image's tracking record has `caption_s3_uri` field ### SAM Layer Caching Issue **Symptoms:** - Lambda functions fail with `No module named 'ragstack_common'` or `No module named 'crhelper'` - CloudFormation update gets stuck on `CodeBuildRun` or `WCCodeBuildRun` custom resources - Stack enters `UPDATE_ROLLBACK_FAILED` state **Cause:** SAM uses content hashing to skip S3 uploads. After a reindex creates a new Knowledge Base, if you redeploy, SAM may reuse a stale/corrupted layer artifact from S3 instead of uploading the freshly built layer. The local build shows ~121MB but S3 only has ~120KB. **Diagnosis:** ```bash # Check local build size (should be ~121MB) du -sh .aws-sam/build/RagstackCommonLayer/ # Check deployed layer size (should match, not 120KB) aws lambda get-function-configuration --function-name -sync-status-checker \ --query "Layers[0].CodeSize" --output text ``` **Fix:** 1. If stack is stuck in `UPDATE_IN_PROGRESS`, cancel and wait for rollback: ```bash aws cloudformation cancel-update-stack --stack-name --region us-east-1 ``` 2. If stack is in `UPDATE_ROLLBACK_FAILED`, continue rollback skipping failed resources: ```bash aws cloudformation continue-update-rollback --stack-name --region us-east-1 \ --resources-to-skip CodeBuildRun WCCodeBuildRun BatchProcessorFunction \ AppSyncResolverFunction ConfigurationResolverFunction ``` 3. Once stack is in `UPDATE_ROLLBACK_COMPLETE`, clear caches and redeploy: ```bash # Delete SAM build cache rm -rf .aws-sam/ # Delete stale S3 artifacts (keep UI source zips) aws s3 rm s3://-artifacts-/ --recursive --exclude "*.zip" # Fresh build and deploy sam build --parallel python publish.py --stack-name --admin-email ``` **Prevention:** After running reindex, always clear the SAM cache before redeploying: ```bash rm -rf .aws-sam/ ``` ## Support For issues: - Check [TROUBLESHOOTING.md](./TROUBLESHOOTING.md)