# Hash Verification & Data Integrity ## Table of Contents - [Overview](#overview) - [How It Works](#how-it-works) - [Enforcement Modes](#enforcement-modes) - [Benefits](#benefits) - [Hash Format](#hash-format) - [Error Messages](#error-messages) - [Testing & Validation](#testing--validation) - [Compatibility](#compatibility) - [FAQ](#faq) - [Security Considerations](#security-considerations) ## Overview gitsqlite includes automatic hash verification using SHA-256 to ensure SQL files have not been modified between operations. This prevents corruption and data integrity issues that could occur from manual edits or file corruption. **Hash verification status by operation:** - ✅ `clean` - Generates hash and appends to SQL output - ✅ `smudge` - Verifies hash before restoring to binary (optional or enforced) - ❌ `diff` - No hash (diff output is for display/comparison only) ## How It Works ### During Clean Operation (Binary → SQL) When gitsqlite converts a binary SQLite database to SQL format: 1. The SQL content is streamed to the output 2. A SHA-256 hash is computed on-the-fly as the SQL is written 3. After all SQL content is written, a hash comment is appended as the last line: ```sql -- gitsqlite-hash: sha256:a1b2c3d4e5f6... ``` This happens for: - Main data output (stdout) - Schema files (when using `-schema` or `-schema-file` flags) **Note:** The `diff` command does **not** include hash verification since it's only used for viewing/comparing, not for round-tripping back to binary. ### During Smudge Operation (SQL → Binary) When gitsqlite converts SQL format back to binary SQLite: 1. The entire SQL input is read 2. The last line is checked for the hash comment 3. The hash is extracted and the content (without hash line) is hashed 4. If the computed hash matches the stored hash, the content is verified 5. The SQL (without hash line) is restored to binary SQLite 6. **Behavior depends on enforcement mode** (see [Enforcement Modes](#enforcement-modes)) This verification happens for: - Main data input (stdin) - Schema files (when using `-schema` or `-schema-file` flags) ## Enforcement Modes ### Optional Validation (Default) By default, hash validation is **optional**: - Hash is validated if present - Warnings are logged if hash is invalid or missing - **Operation continues** even with validation failures ```bash # Default behavior - validates but doesn't fail gitsqlite smudge < database.sql > database.db # Enable logging to see validation status gitsqlite -log smudge < database.sql > database.db # Check log file for hash validation warnings ``` **Use case**: Development and testing environments where flexibility is needed. ### Enforced Validation With the `-verify-hash` flag, validation is **enforced**: - Hash must be present and valid - **Operation fails** if hash is invalid or missing - No database is created from invalid SQL ```bash # Enforced mode - fails if hash is invalid/missing gitsqlite -verify-hash smudge < database.sql > database.db ``` **Use case**: Production environments or when working with critical data. ### Git Configuration Configure your Git filter to use enforced validation: ```bash # Enforce hash verification in Git workflow git config filter.gitsqlite.smudge "gitsqlite -verify-hash smudge" git config filter.gitsqlite.clean "gitsqlite clean" ``` ## Benefits ### 1. Prevents Manual Editing The hash ensures that SQL files generated by gitsqlite are not manually edited in an editor, which could lead to: - Syntax errors that break smudge operations - Data corruption - Loss of binary database integrity - Inconsistent data types or constraints ### 2. Detects File Corruption If a SQL file is corrupted during: - File system operations - Git operations - File transfers - Storage issues The hash will catch the corruption before it's restored to binary format. ### 3. Data Integrity Assurance The hash provides assurance that the SQL content has not been tampered with, intentionally or accidentally. In enforced mode, this prevents corrupted data from entering your database. ## Hash Format The hash line is always the last line of the SQL file and follows this format: ``` -- gitsqlite-hash: sha256:<64-character-hex-digest> ``` Example: ```sql PRAGMA foreign_keys=OFF; BEGIN TRANSACTION; CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT); INSERT INTO users VALUES(1,'Alice'); INSERT INTO users VALUES(2,'Bob'); COMMIT; -- gitsqlite-hash: sha256:a7f3e8c2d1b6904523f1e0c8a9d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2 ``` ## Error Messages ### Enforced Mode Errors When using `-verify-hash`, you may see these errors: **Missing Hash:** ``` data hash verification failed: missing gitsqlite hash signature (expected last line to start with '-- gitsqlite-hash: sha256:') ``` **Hash Mismatch:** ``` data hash verification failed: hash verification failed: expected a7f3e8c2d1..., got b8e4f9d3e2... (file may have been modified) ``` **Schema File Hash Errors:** ``` schema hash verification failed: hash verification failed: expected a7f3e8c2d1..., got b8e4f9d3e2... (file may have been modified) ``` ### Optional Mode Warnings When not using `-verify-hash`, similar messages appear as warnings in the log file but don't prevent the operation from completing. ## Testing & Validation ### Manual Testing To manually test hash verification: ```bash # Create a test database sqlite3 test.db < test.sql # View the hash at the end tail -1 test.sql # Verify it works (enforced mode) gitsqlite -verify-hash smudge < test.sql > test2.db # Test hash verification failure - edit the SQL file sed -i "s/Alice/Charlie/" test.sql # This should fail with hash verification error (in enforced mode) gitsqlite -verify-hash smudge < test.sql > test3.db # In optional mode, this succeeds with a warning gitsqlite -log smudge < test.sql > test4.db # Check log file for warning message ``` ### Testing Optional vs Enforced Modes ```bash # Create valid SQL file sqlite3 test.db "CREATE TABLE test(id INTEGER); INSERT INTO test VALUES(1);" gitsqlite clean < test.db > valid.sql # Create invalid SQL file (corrupt hash) cat > invalid.sql << 'EOF' PRAGMA foreign_keys=OFF; BEGIN TRANSACTION; CREATE TABLE test(id INTEGER); INSERT INTO test VALUES(1); COMMIT; -- gitsqlite-hash: sha256:0000000000000000000000000000000000000000000000000000000000000000 EOF # Test 1: Valid hash with enforcement - should succeed gitsqlite -verify-hash smudge < valid.sql > test1.db && echo "✓ Success" # Test 2: Invalid hash with enforcement - should fail gitsqlite -verify-hash smudge < invalid.sql > test2.db 2>&1 || echo "✓ Correctly failed" # Test 3: Invalid hash without enforcement - should succeed with warning gitsqlite -log smudge < invalid.sql > test3.db && echo "✓ Success (warning logged)" # Test 4: Missing hash with enforcement - should fail head -n -1 valid.sql > no_hash.sql gitsqlite -verify-hash smudge < no_hash.sql > test4.db 2>&1 || echo "✓ Correctly failed" # Test 5: Missing hash without enforcement - should succeed with warning gitsqlite smudge < no_hash.sql > test5.db && echo "✓ Success (warning logged)" ``` ### Integration Testing Test the complete Git workflow: ```bash # Create test repository mkdir /tmp/hash-verification-test && cd /tmp/hash-verification-test git init git config user.name "Test" git config user.email "test@example.com" # Configure filters (with or without enforcement) echo '*.db filter=gitsqlite' > .gitattributes # Without enforcement (default) git config filter.gitsqlite.clean "gitsqlite clean" git config filter.gitsqlite.smudge "gitsqlite smudge" # OR with enforcement (recommended) # git config filter.gitsqlite.smudge "gitsqlite -verify-hash smudge" # Create and commit database sqlite3 app.db "CREATE TABLE logs(id INTEGER, msg TEXT); INSERT INTO logs VALUES(1,'start');" git add .gitattributes app.db git commit -m "Initial database" # VERIFY: Git stored SQL with hash git show HEAD:app.db | tail -1 # Expected: -- gitsqlite-hash: sha256: # Test checkout (smudge with verification) rm app.db git checkout HEAD -- app.db sqlite3 app.db "SELECT * FROM logs;" # Expected: 1|start # Cleanup cd - && rm -rf /tmp/hash-verification-test ``` ### Unit Tests Run the hash package unit tests: ```bash go test -v ./internal/hash/ ``` Expected tests: - `TestHashWriter` - Basic functionality - `TestHashWriterDeterministic` - Consistency check - `TestVerifyAndStripHash` - Successful verification - `TestVerifyAndStripHashInvalidHash` - Wrong hash detection - `TestVerifyAndStripHashMissingHash` - Missing hash detection - `TestVerifyAndStripHashModifiedContent` - Tampering detection - `TestVerifyAndStripHashEmptyInput` - Edge case - `TestRoundTrip` - End-to-end verification ## Compatibility ### Backward Compatibility ⚠️ **Important**: SQL files generated by older versions of gitsqlite (without hash) will work differently depending on mode: - **Optional mode (default)**: Files without hash work but generate warnings - **Enforced mode (`-verify-hash`)**: Files without hash will fail To migrate existing repositories to enforced mode: 1. Check out the repository with files in SQL format 2. Re-run clean operation to regenerate SQL with hashes: ```bash git add -f *.db git commit -m "Regenerate SQLite dumps with hash verification" ``` 3. Update Git filter configuration to use `-verify-hash` flag ### Future Versions The hash format is designed to be extensible. If a different hash algorithm is needed in the future, the prefix format allows for versioning: - Current: `-- gitsqlite-hash: sha256:...` - Future: `-- gitsqlite-hash: sha512:...` or `-- gitsqlite-hash-v2: ...` ## FAQ **Q: What if I want to manually edit the SQL?** A: Don't edit the SQL files directly. Instead, use SQLite tools to edit the binary database, then re-run clean to regenerate the SQL with a valid hash. **Q: Can I merge SQL files manually?** A: Merging SQL files manually is risky (see README caveats). If you must merge, after merging you'll need to restore to binary, then clean again to generate a valid hash. **Q: Does the hash slow down operations?** A: The performance impact is minimal - hash computation happens while streaming data. Only the smudge operation needs to read the full content into memory. **Q: What if I'm using schema/data separation?** A: Both the schema file and data file will have their own independent hashes. Both must verify successfully during smudge (in enforced mode). **Q: Can I verify a hash manually?** A: Yes, you can use standard tools: ```bash # Extract content without hash line head -n -1 file.sql | sha256sum # Compare with hash in last line tail -1 file.sql ``` **Q: Should I use enforced mode or optional mode?** A: Use optional mode (default) for development/testing. Use enforced mode (`-verify-hash`) for production or critical data where you want to guarantee integrity. **Q: Can I disable hash generation during clean?** A: No, hashes are always generated during clean. However, verification during smudge is optional by default and only enforced when using `-verify-hash`. ## Security Considerations ### Hash Algorithm - **SHA-256** is cryptographically secure and collision-resistant - Hashes are computed on the SQL content (excluding the hash line itself) - 64-character hexadecimal encoding provides 256 bits of security ### Threat Model - Hash verification prevents **accidental** modifications and corruption - **Not designed** to prevent malicious attacks where attacker has write access - An attacker with file system access could modify both content and hash - For tamper-evidence against malicious actors, use Git's commit signing (GPG) ### Best Practices 1. **Use enforced mode** (`-verify-hash`) for production databases 2. **Combine with Git security**: Enable signed commits and protected branches 3. **Log validation results**: Use `-log` flag to track validation status 4. **Review warnings**: In optional mode, periodically review logs for validation failures 5. **Test before deployment**: Validate hash verification in your CI/CD pipeline ### Technical Details - **Algorithm**: SHA-256 from Go's `crypto/sha256` package - **Encoding**: Lowercase hexadecimal (64 characters) - **Input**: All SQL content excluding the hash line itself - **Performance**: O(n) with streaming computation during clean - **Memory**: Smudge requires full file in memory for verification ## Implementation Key files: - `internal/hash/hash.go` - Core hash implementation - `internal/hash/hash_test.go` - Comprehensive unit tests - `internal/filters/clean.go` - Hash generation during clean - `internal/filters/smudge.go` - Hash verification during smudge - `internal/filters/diff.go` - **No hash** (diff is for viewing only)