--- name: large-scale-text-editing description: Provides strategies for efficiently transforming large text files (thousands to millions of lines) using text editors like Vim, sed, or awk. This skill should be used when tasks involve bulk text transformations, CSV manipulation at scale, pattern-based edits across massive files, or when keystroke/operation efficiency is constrained. Applicable to tasks requiring macros, regex substitutions, or batch processing of structured text data. --- # Large-Scale Text Editing ## Overview This skill provides guidance for efficiently transforming large text files containing thousands to millions of lines. It covers strategies for understanding transformation requirements, designing efficient solutions (particularly with Vim macros), testing approaches, and verification techniques. ## When to Use This Skill - Transforming CSV, TSV, or other delimited files at scale - Applying repetitive edits across files with millions of rows - Working within keystroke or operation count constraints - Using Vim macros, sed, awk, or similar batch processing tools - Pattern-based text transformations requiring regex ## Approach Strategy ### Phase 1: Understand the Transformation Before writing any transformation logic: 1. **Assess file size first** - Check file size with `ls -lh` or `wc -l` before attempting to read. Avoid reading multi-million line files directly. 2. **Sample strategically** - Extract samples from multiple locations: - Beginning: `head -n 100 input.csv > sample_head.csv` - Middle: `sed -n '500000,500100p' input.csv > sample_middle.csv` - End: `tail -n 100 input.csv > sample_tail.csv` 3. **Compare input and expected output** - Identify all transformations needed: - Column reordering or removal - Delimiter changes - Case transformations - Whitespace handling - Value appending or prepending - Format conversions 4. **Verify structural assumptions**: - Consistent column count across all rows - Presence of header rows - Empty lines or malformed rows - Special characters that might break regex patterns ### Phase 2: Design the Solution When designing transformations: 1. **Break complex transformations into discrete steps** - Each step should handle one logical transformation. This improves debuggability and allows independent testing. 2. **Choose the right tool for the scale**: - **Vim macros**: Excellent for complex, multi-step transformations; efficient keystroke counting - **sed**: Fast for simple substitutions across large files - **awk**: Powerful for column manipulation and conditional logic - **Perl/Python**: For complex logic that exceeds regex capabilities 3. **Design for efficiency**: - Minimize the number of passes through the file - Use line-based operations (`:%normal!` in Vim) rather than iterating with explicit loops - Leverage built-in commands (e.g., `gU` for uppercase in Vim) over manual character manipulation 4. **Document design decisions** - Record why specific approaches were chosen, especially when multiple valid alternatives exist. ### Phase 3: Test Incrementally 1. **Create a test sample** - Use a small subset (100-1000 lines) for initial testing: ```bash head -n 100 input.csv > test_input.csv head -n 100 expected.csv > test_expected.csv ``` 2. **Test each transformation independently** - Verify each macro or command produces correct output before combining. 3. **Verify with diff** - Use byte-for-byte comparison: ```bash diff test_output.csv test_expected.csv ``` 4. **Check for edge cases in test output**: - First and last lines transformed correctly - Lines with varying content lengths handled - Special characters preserved or transformed as expected ### Phase 4: Execute with Safeguards 1. **Create backups before in-place modifications**: ```bash cp input.csv input.csv.backup ``` 2. **Set appropriate timeouts** - For million-row files, allow sufficient processing time (e.g., 2-5 minutes depending on complexity). 3. **Monitor progress when possible** - Use tools that show progress or check intermediate output. 4. **Verify final output**: - Confirm row count matches: `wc -l output.csv` - Run diff against expected output - Spot-check samples from different file locations ## Vim-Specific Guidance ### Macro Design Principles - **Register allocation**: Use distinct registers (a, b, c) for different transformation stages - **Keystroke efficiency**: Prefer built-in commands over character-by-character operations - **Regex patterns**: Use non-greedy patterns and explicit delimiters to avoid over-matching ### Common Vim Patterns for Large Files | Task | Approach | |------|----------| | Apply macro to all lines | `:%normal! @a` | | Uppercase transformation | `gU` motion or `\U` in substitution | | Column manipulation | Capture groups with `\(\)` and backreferences `\1`, `\2` | | Delimiter replacement | `:s/old_delim/new_delim/g` | | Whitespace removal | `:s/\s\+//g` | ### Escaping in Vim Scripts When using `setreg()` for macro definitions: - Escape backslashes: `\\` for literal backslash - Use `\r` for carriage return - Special characters may need double-escaping ## Verification Checklist Before considering the task complete: - [ ] Output file exists and is non-empty - [ ] Row count matches expected count - [ ] Byte-for-byte diff passes against expected output (if available) - [ ] Spot-check samples from beginning, middle, and end of file - [ ] Any constraints (keystroke limits, command restrictions) are satisfied - [ ] Tool exited with success code (exit code 0) ## Common Pitfalls | Pitfall | Prevention | |---------|------------| | Reading large files directly | Always check file size first; use head/tail/sed for sampling | | No backup before in-place edit | Create backup copy before any modification | | Testing only on first few lines | Sample from multiple file locations | | Assuming uniform structure | Verify structure with samples from different positions | | Regex over-matching | Use explicit delimiters and non-greedy quantifiers | | Insufficient timeout | Calculate expected processing time for file size | | Not verifying exit codes | Check tool exit status after operations | ## Efficiency Considerations When keystroke or operation counts matter: 1. **Count accurately** - Understand what constitutes a "keystroke" in the specific context (escape sequences, special keys) 2. **Combine operations** - A single regex substitution may replace multiple simpler operations 3. **Use built-in commands** - Native commands are typically more efficient than manual equivalents 4. **Minimize redundancy** - Avoid repeated file reads or redundant transformations