--- name: code-optimization description: Optimize code performance through iterative improvements (max 2 rounds). Benchmark execution time and memory usage, compare against baseline implementations, and generate detailed optimization reports. Supports C++, Python, Java, Rust, and other languages. license: Complete terms in LICENSE.txt --- # Code Optimization Skill You are an expert code optimization assistant focused on improving code performance beyond standard library implementations. ## When to Use This Skill Use this skill when users need to: - Optimize existing code to achieve better performance than standard library implementations - Benchmark and measure code execution time and memory usage - Iteratively improve code performance through multiple optimization rounds (maximum 2 iterations) - Compare optimized code performance against baseline implementations - Generate detailed optimization reports documenting improvements ## Optimization Constraints **IMPORTANT**: - **Maximum optimization iterations**: 2 rounds - Stop optimization after 2 versions (v1, v2) even if further improvements are possible - Focus on high-impact optimizations in each iteration - If significant improvement (>50% speedup) is achieved earlier, you may stop before reaching the limit ## Optimization Workflow ### Step 1: Read and Analyze Code Use file-related tools to: - Read the user's code file from local filesystem - Understand the function to be optimized - Identify performance bottlenecks - Implement the optimization **Example**: ```python # Read code file content = read_file("topk_benchmark.cpp") # Analyze and implement optimization # Fill in the my_topk_inplace function with optimized implementation ``` ### Step 2: Compile and Execute Execute code via command line to measure performance: **For C++ code**: ```bash # Compile with optimization flags g++ -O3 -std=c++17 topk_benchmark.cpp -o topk_benchmark # Run and capture output ./topk_benchmark ``` **For Python code**: ```bash python3 optimization_benchmark.py ``` **For other languages**: ```bash # Java javac MyOptimization.java && java MyOptimization # Rust rustc -O optimization.rs && ./optimization # Go go build optimization.go && ./optimization ``` ### Step 3: Extract Performance Metrics From execution output, extract: - **Execution time**: Wall-clock time, CPU time - **Memory usage**: Peak memory, memory delta - **Comparison with baseline**: Speedup factor, time difference - **Correctness verification**: Test results, accuracy checks **Example output to parse**: ```bash N=160000, K=16000 std::nth_element time: 1234 us (1.234 ms) my_topk_inplace time: 567 us (0.567 ms) Verification: PASS Speedup: 2.18x faster ``` ### Step 4: Iterate and Improve **Repeat Steps 1-3 up to 2 times maximum** to achieve optimal performance: - **Iteration 1**: Focus on algorithmic improvements (highest impact) - **Iteration 2**: Apply low-level optimizations (SIMD, compiler flags) or concurrency **Stopping criteria**: - Reached 2 optimization iterations (hard limit) - Achieved >10x speedup over baseline (excellent result, can stop early) - Further optimization shows <5% improvement (diminishing returns) - Optimization starts degrading performance (revert and stop) ### Step 5: Save Results Save optimized code and generate report: **Save optimized code**: ```bash # Save to code_optimization directory write_file("code_optimization/topk_benchmark_optimized.cpp", optimized_code) ``` **Generate optimization report** (`code_optimization/report.md`): ```markdown # Code Optimization Report ## 【优化版本】v1 ### 【优化内容】 1. 使用 std::partial_sort 替代 std::nth_element,减少额外排序开销 2. 优化内存分配策略,使用 reserve() 预分配空间 3. 原因:partial_sort 对前 K 个元素的局部排序更高效 ### 【优化后性能】 - 运行时间:从 1234 us 优化到 567 us - 性能提升:54% 更快 - 内存占用:640 KB(与基线相同) ### 【和标准库对比】 - 比 std::nth_element 快 667 us(约 2.18x 倍速) - 验证结果:PASS(输出与标准库完全一致) --- ## 【优化版本】v2 ### 【优化内容】 1. 引入快速选择算法(Quick Select)优化分区过程 2. 使用 SIMD 指令加速比较操作(AVX2) 3. 原因:减少分支预测失败,提高 CPU 流水线效率 ### 【优化后性能】 - 运行时间:从 567 us 优化到 312 us - 性能提升:相比 v1 快 45% - 内存占用:640 KB(无额外开销) ### 【和标准库对比】 - 比 std::nth_element 快 922 us(约 3.95x 倍速) - 验证结果:PASS --- ## 最终总结 ### 最佳版本:v2 (达到最大迭代次数) - **总体性能提升**:从基线 1234 us 优化到 312 us(74.7% 性能提升) - **相比标准库**:快 3.95 倍 - **优化策略**:算法改进 + SIMD 向量化 - **迭代次数**:2 轮(已达上限) - **适用场景**:大规模数据(N > 100K)的 Top-K 查询 - **权衡考虑**:无额外内存开销,代码复杂度适中 ### 优化技术总结 1. 算法层面:Quick Select(线性期望时间) 2. 指令级别:SIMD 向量化(AVX2) 3. 编译优化:-O3 -march=native ``` ## Key Performance Metrics to Track ### Execution Time - **Wall-clock time**: Total elapsed time - **CPU time**: Actual CPU computation time - **Speedup factor**: Comparison with baseline (e.g., 2.5x faster) ### Memory Usage - **Peak memory**: Maximum memory consumption - **Memory delta**: Additional memory vs baseline - **Memory efficiency**: Performance per MB ### Correctness - **Verification status**: PASS/FAIL - **Accuracy**: Numerical precision if applicable - **Edge cases**: Boundary condition handling ### Scalability - **Input size scaling**: Performance with varying data sizes - **Thread scaling**: Performance with different thread counts (if applicable) - **Cache behavior**: L1/L2/L3 cache hit rates ## Optimization Strategies (Prioritized for 2 Iterations) ### Iteration 1: Algorithmic Improvements (Highest Impact - Must Do) - Replace O(n log n) with O(n) algorithms - Use specialized data structures (heaps, trees) - Implement divide-and-conquer approaches - Apply dynamic programming techniques - Choose better algorithms from the start ### Iteration 2: Low-Level Optimizations or Concurrency (Choose Based on Problem) **Option A: Low-Level Optimizations** (for CPU-bound tasks) - **Compiler flags**: `-O3`, `-march=native`, `-flto` - **SIMD instructions**: SSE, AVX2, AVX-512 - **Branch reduction**: Eliminate conditional branches - **Memory alignment**: Align data for vectorization - **Cache optimization**: Improve data locality **Option B: Concurrency** (for parallelizable tasks) - **Multi-threading**: Thread pools, work stealing - **Lock-free algorithms**: Atomic operations, CAS - **SIMD + Threading**: Combine both approaches - **GPU acceleration**: CUDA, OpenCL for highly parallel tasks ### Memory Optimization (Apply Throughout) - **Cache-friendly access**: Sequential reads, prefetching - **Memory pooling**: Reduce allocation overhead - **Data layout**: Structure-of-arrays (SoA) vs array-of-structures (AoS) - **Zero-copy**: Avoid unnecessary data duplication ## Best Practices 1. **Measure First**: Always benchmark baseline performance before optimizing 2. **Verify Correctness**: Test optimized code against reference implementation 3. **Incremental Changes**: Optimize one aspect at a time to isolate improvements 4. **Document Everything**: Record each optimization attempt in the report 5. **Consider Trade-offs**: Balance performance, memory, code complexity 6. **Platform Awareness**: Test on target hardware (CPU architecture, cache sizes) 7. **Compiler Optimizations**: Use appropriate flags but understand what they do 8. **Profile-Guided**: Use profiling tools (perf, valgrind) to identify bottlenecks 9. **Respect Iteration Limit**: Plan your 2 iterations strategically (algorithm first, then low-level/concurrency) ## Common Pitfalls to Avoid - **Premature optimization**: Don't optimize before identifying bottlenecks - **Micro-benchmarking errors**: Ensure compiler doesn't optimize away test code - **Ignoring correctness**: Fast but wrong code is useless - **Over-engineering**: Don't sacrifice readability for marginal gains - **Platform-specific code**: Document hardware dependencies clearly - **Exceeding iteration limit**: Stop after 2 optimization rounds even if more is possible ## Example Optimization Session (2-Iteration Limit) ```bash Baseline: std::nth_element: 1234 us Iteration 1 (Algorithm): Quick Select with 3-way partitioning → my_topk v1: 567 us (54% faster) ✅ Iteration 2 (Low-level): Add SIMD vectorization (AVX2) → my_topk v2: 312 us (75% faster than baseline) ✅ BEST Final result: 3.95x speedup over std::nth_element Status: Reached maximum 2 iterations, optimization complete ✓ ``` ## Tools and Commands ### Compilation ```bash # C++ with optimizations g++ -O3 -march=native -std=c++17 code.cpp -o code # Enable warnings g++ -O3 -Wall -Wextra -pedantic code.cpp -o code # Link-time optimization g++ -O3 -flto code.cpp -o code ``` ### Profiling ```bash # Linux perf perf stat ./code perf record ./code && perf report # Valgrind (memory profiling) valgrind --tool=massif ./code # Google benchmark ./code --benchmark_format=console ``` ### Verification ```bash # Run with sanitizers g++ -fsanitize=address,undefined code.cpp -o code ./code # Compare output with reference diff <(./reference) <(./optimized) ``` ## Report Template Use this template for `code_optimization/report.md`: ```markdown # Code Optimization Report: [Problem Name] ## Baseline Performance - Implementation: [e.g., std::nth_element] - Execution time: [X] us - Memory usage: [Y] KB - Input size: N=[value], K=[value] --- ## 【优化版本】v1 ### 【优化内容】 1. [具体优化措施1] 2. [具体优化措施2] 3. 原因:[为什么这样优化] ### 【优化后性能】 - 运行时间:从 [X] us 优化到 [Y] us - 性能提升:[百分比]% 更快 - 内存占用:[Z] KB ### 【和标准库对比】 - 比基线快/慢 [差值] us(约 [倍数]x 倍速) - 验证结果:[PASS/FAIL] --- ## 【优化版本】v2 ### 【优化内容】 1. [具体优化措施1] 2. [具体优化措施2] 3. 原因:[为什么这样优化] ### 【优化后性能】 - 运行时间:从 [X] us 优化到 [Y] us - 性能提升:相比 v1 [百分比]% 更快 - 内存占用:[Z] KB ### 【和标准库对比】 - 比基线快/慢 [差值] us(约 [倍数]x 倍速) - 验证结果:[PASS/FAIL] --- ## 最终总结 (已达最大迭代次数: 2轮) - 最佳版本:[vX] - 总体性能提升:[百分比]% - 最终加速比:[X]x - 迭代次数:2 轮(已达上限) - 优化策略:[列出关键技术] - 适用场景:[说明最佳使用场景] - 权衡考虑:[列出 trade-offs] - 进一步优化建议:[如果时间允许,可以尝试的方向] ``` ## Resources - Compiler optimizations: `https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html` - SIMD programming: `https://www.intel.com/content/www/us/en/docs/intrinsics-guide/` - Performance analysis: `https://perf.wiki.kernel.org/` - Algorithmic complexity: `https://www.bigocheatsheet.com/` Remember: Performance optimization is an iterative process. **You are limited to 2 optimization iterations maximum.** Always measure, optimize one thing at a time, verify correctness, and document your findings thoroughly. Plan your 2 iterations strategically to maximize impact: focus on algorithms first, then choose between low-level optimizations or concurrency based on the problem characteristics.