# SemRay - AI-Powered Semantic Analysis for IDA Pro

SemRay is a powerful IDA Pro plugin that leverages Google's Gemini AI to provide intelligent semantic analysis of binary code. It automatically suggests meaningful function names, detailed comments, and descriptive variable renames based on deep contextual understanding of your code.

## Features

- **Intelligent Function Naming**: Generate concise, descriptive function names that encode role and domain (e.g., `crc32_checksum`, `parse_http_header`)
- **Comprehensive Comments**: Automatically create detailed multi-line comments explaining function behavior
- **Variable Renaming**: Suggest meaningful names for local variables and function arguments
- **Context-Aware Analysis**: Analyzes callers, callees, and cross-references to understand function relationships
- **Flexible Analysis Modes**:
  - Analyze single functions
  - Analyze all functions in context
  - Analyze functions within N levels of call depth
- **Multiple Content Modes**: Choose between decompiled C code or raw assembly for LLM analysis
- **Optional CodeDumper Integration**: Enhanced context discovery with virtual calls, jump tables, and PTN provenance annotations
- **Interactive UI**: Review and selectively apply suggested changes through an intuitive tabbed interface

## Requirements

### Essential

- **IDA Pro 7.6+** with Python 3 and PyQt5 support
- **Hex-Rays Decompiler** (for decompilation mode and PTN analysis)
- **Python Libraries**:
  ```bash
  pip install google-genai pydantic
  ```
- **Google AI API Key**: Required for Gemini API access

## Installation

### 1. Plugin Installation

Copy the plugin directory to your IDA plugins folder:

```bash
# Linux
cp -r semray ~/.idapro/plugins/semray

# Windows
copy semray "C:\Users\YourName\AppData\Roaming\Hex-Rays\IDA Pro\plugins\semray"

# macOS
cp -r semray ~/Library/Application\ Support/Hex-Rays/IDA\ Pro/plugins/semray
```

Alternatively, you can place it directly in the IDA installation's plugins directory:

```bash
# Example for Linux
cp -r semray /opt/ida-pro/plugins/semray
```

### 2. Python Dependencies

Install required Python libraries in IDA's Python environment:

```bash
# If using system Python (ensure it matches IDA's Python version)
pip install google-genai pydantic

# If using IDA's bundled Python
/path/to/ida/python3 -m pip install google-genai pydantic
```

### 3. API Key Configuration

Set your Google AI API key as an environment variable:

```bash
# Linux/macOS - Add to ~/.bashrc or ~/.zshrc
export GOOGLE_API_KEY="your-api-key-here"

# Windows - Set as system environment variable
setx GOOGLE_API_KEY "your-api-key-here"
```

To obtain a Google AI API key:
1. Visit [Google AI Studio](https://makersuite.google.com/app/apikey)
2. Sign in with your Google account
3. Create a new API key
4. Copy and set it as the `GOOGLE_API_KEY` environment variable

### 4. Verify Installation

Start IDA Pro and check the output window for:
```
Initializing SemRay (Google AI Semantic Analysis) plugin.
SemRay: CodeDumper integration enabled.  # (if CodeDumper is available)
SemRay (Google AI Semantic Analysis) initialized successfully.
```

## Usage

### Quick Start

1. **Navigate to a function** in IDA Pro (Disassembly or Pseudocode view)
2. **Right-click** to open the context menu
3. Select **SemRay Analysis** from the menu
4. Choose your analysis mode:
   - **Analyze CURRENT Func Only**: Analyzes only the selected function
   - **Analyze ALL Funcs in Context**: Analyzes the function plus all callers/callees in context
   - **Analyze Current + N Levels**: Analyzes functions within N depth levels

### Configuration Prompts

When you trigger an analysis, you'll be prompted for:

1. **Content Mode**: Choose between:
   - **Decompiled**: Uses Hex-Rays decompiled C pseudocode (recommended)
   - **Assembly**: Uses raw disassembly (useful when decompilation fails)

2. **Context Depths**:
   - **Caller Depth**: How many levels of calling functions to include (default: 1)
   - **Callee Depth**: How many levels of called functions to include (default: 1)
   - **Analysis Depth**: (Depth-limited mode only) How many function levels to analyze

### Analysis Workflow

1. **Context Collection**: The plugin gathers code, call graphs, cross-references, and literals
2. **LLM Processing**: Sends the context to Google's Gemini model for semantic analysis
3. **Results Display**: Opens a tabbed UI showing suggestions for each function
4. **Review & Apply**: 
   - Review each suggestion
   - Uncheck items you don't want to apply
   - Click **Apply Selected** to update your IDB
   - Click **Close** to dismiss without changes

### Results UI

The results window displays tabs for each analyzed function, showing:

- **Suggested Function Name**: With reasoning and evidence
- **Suggested Comment**: Multi-line documentation of function behavior
- **Variable Renames**: Original → New name mappings with explanations

Each suggestion has a checkbox - uncheck to exclude it from being applied.

### Batch Analysis

You can also analyze multiple functions at once:

1. Go to **Edit → Plugins → SemRay (Google AI Semantic Analysis)**
2. Enter comma-separated function names or addresses:
   ```
   sub_401000, parse_header, 0x402340
   ```
3. Choose content mode and context depths
4. Review and apply results

## How It Works

### Context Building

SemRay builds rich context for the LLM by collecting:

1. **Code Content**:
   - Decompiled C pseudocode (via Hex-Rays)
   - Or raw disassembly with labels and addresses

2. **Call Graph**:
   - Direct calls
   - Indirect calls
   - Virtual calls (with CodeDumper)
   - Jump tables (with CodeDumper)
   - Tail calls

3. **Semantic Hints**:
   - String literals referenced in functions
   - Large constant values
   - Function prototypes/signatures

4. **PTN Annotations** (with CodeDumper):
   - Provenance tracking of data flows
   - Virtual table analysis
   - Enhanced cross-reference context

### LLM Processing

The plugin sends a carefully crafted prompt to Google's Gemini model that includes:

- **Persona**: "You are an expert reverse engineer"
- **Naming Contract**: Rules for meaningful, non-generic names
- **Call Graph**: Relationships between functions
- **Code Context**: All relevant source code
- **Schema Enforcement**: Structured JSON output via response schema

The LLM analyzes the code holistically and provides:
- Function names that encode purpose and domain
- Detailed comments explaining behavior
- Evidence-based reasoning for each suggestion
- Variable renames that clarify intent

### Name Validation

The plugin filters out generic/unhelpful names using regex patterns:
- Rejects: `var5`, `tmp`, `foo`, `bar`, `helper`, `unused`
- Only accepts: meaningful, descriptive identifiers

### IDB Updates

When you apply changes, the plugin:
1. Sets function comments (with word wrapping)
2. Renames functions (with collision checking)
3. Renames local variables (using Hex-Rays API)
4. Marks affected functions as dirty
5. Refreshes pseudocode views automatically

## Configuration

### Model Selection

By default, SemRay uses `gemini-flash-latest` for speed and cost-efficiency. To change the model, edit `semray.py`:

```python
DEFAULT_GEMINI_MODEL = "gemini-flash-latest"  # or "gemini-pro-latest"
MODELS_TO_REGISTER = [DEFAULT_GEMINI_MODEL]
```

### Default Depths

Customize default analysis depths:

```python
DEFAULT_CONTEXT_CALLER_DEPTH = 1
DEFAULT_CONTEXT_CALLEE_DEPTH = 1
DEFAULT_ANALYSIS_DEPTH = 1
```

### Cross-Reference Types

Control which reference types are considered (when using CodeDumper):

```python
DEFAULT_XREF_TYPES = {
    'direct_call',
    'indirect_call',
    'data_ref',
    'immediate_ref',
    'tail_call_push_ret',
    'virtual_call',
    'jump_table',
}
```

### Safety Settings

The plugin disables Google AI's content filtering to avoid blocking reverse engineering content. Adjust in `semray.py` if needed:

```python
DEFAULT_SAFETY_SETTINGS = [
    types.SafetySetting(category='HARM_CATEGORY_HATE_SPEECH', threshold='BLOCK_NONE'),
    types.SafetySetting(category='HARM_CATEGORY_DANGEROUS_CONTENT', threshold='BLOCK_NONE'),
    types.SafetySetting(category='HARM_CATEGORY_HARASSMENT', threshold='BLOCK_NONE'),
    types.SafetySetting(category='HARM_CATEGORY_SEXUALLY_EXPLICIT', threshold='BLOCK_NONE'),
]
```

## CodeDumper Integration

SemRay can optionally use the CodeDumper plugin for enhanced capabilities:

### With CodeDumper

- Advanced virtual call resolution via v-table analysis
- Jump table detection and analysis
- Detailed cross-reference reasons
- PTN (Provenance Tracking Network) annotations showing data flow
- More comprehensive context discovery

### Without CodeDumper

- Falls back to standard IDA API functions
- Basic call graph analysis
- Direct and indirect call tracking
- Fully functional but with less contextual information

The plugin automatically detects CodeDumper and enables integration if available.

## Troubleshooting

### Plugin Not Loading

**Check IDA Output Window** for error messages:
- "PyQt5 not found": Install PyQt5 in IDA's Python environment
- "pydantic not found": Install pydantic (`pip install pydantic`)
- "google-genai not found": Install google-genai (`pip install google-genai`)

### API Key Issues

**"GOOGLE_API_KEY environment variable not set"**:
- Verify the environment variable is set: `echo $GOOGLE_API_KEY` (Linux/Mac) or `echo %GOOGLE_API_KEY%` (Windows)
- Restart IDA Pro after setting the variable
- Check for typos in the variable name

### Empty or Blocked Responses

**"Google AI response was empty or blocked"**:
- Check Google AI API quota/billing
- Review safety settings if content is being filtered
- Try a simpler function first to verify API connectivity

### Decompilation Failures

**"Decompilation FAILED"**:
- Ensure Hex-Rays decompiler is installed and licensed
- Try **Assembly** mode instead of **Decompiled** mode
- Some functions may not decompile due to code complexity

### Variable Rename Failures

**Variables not renamed**:
- Ensure the function can be decompiled
- Check that variable names match exactly (case-sensitive)
- Some variables may be compiler-generated and cannot be renamed

### Performance Issues

**Slow analysis**:
- Reduce caller/callee depth (try 1 or 2 instead of higher values)
- Analyze fewer functions at once
- Use "Analyze CURRENT Func Only" for individual functions
- Consider using `gemini-flash-latest` instead of `gemini-pro`

## Best Practices

1. **Start Small**: Begin with single function analysis to verify setup and understand results
2. **Iterative Refinement**: Analyze high-level functions first, then drill down into details
3. **Context Balance**: More context improves accuracy but increases cost and time
   - Depth 1-2: Fast, good for focused analysis
   - Depth 3+: Slower, better for understanding complex relationships
4. **Review Carefully**: Always review suggestions before applying - AI can make mistakes
5. **Backup Your IDB**: Keep backups before applying large batch changes
6. **Use Decompiled Mode**: Generally provides better results than assembly
7. **Check Naming Contract**: Ensure suggested names follow your team's conventions

## Architecture

### Plugin Structure

```
semray/
├── semray.py              # Main plugin file
├── ida-plugin.json        # IDA plugin metadata
└── codedump/              # Optional CodeDumper integration
    ├── codedump.py        # Context discovery utilities
    ├── micro-analyzer.py  # Micro-architectural analysis
    └── ptn_utils.py       # PTN provenance tracking
```

### Key Components

1. **Configuration** (Lines 127-173): Constants, API settings, models
2. **Data Models** (Lines 186-219): Pydantic schemas for validation
3. **Context Builder** (Lines 337-496): Gathers code, call graphs, semantics
4. **Analysis Orchestrator** (Lines 501-647): Manages the analysis pipeline
5. **UI Components** (Lines 650-908): PyQt5 widgets for results display
6. **IDA Integration** (Lines 911-1207): Actions, hooks, plugin lifecycle

### Execution Flow

```
User Action (Right-click menu)
    ↓
CtxActionHandler.activate()
    ↓
async_call() orchestrates:
    ↓
1. build_context_material() [IDA main thread]
    ↓
2. Construct LLM prompt with context
    ↓
3. do_google_ai_analysis() [Background thread]
    ↓
4. Parse & validate JSON response
    ↓
5. do_show_ui() [UI thread]
    ↓
User reviews and clicks "Apply Selected"
    ↓
_perform_ida_updates() [IDA main thread]
    ↓
IDB updated, views refreshed
```

## License

This plugin is provided as-is for reverse engineering and security research purposes.

## Contributing

Contributions are welcome! Key areas for improvement:
- Support for additional LLM providers (OpenAI, Claude, etc.)
- Enhanced prompt engineering for better results
- Additional context extraction strategies
- UI/UX improvements
- Performance optimizations

## Changelog

### Current Version
- Initial release with Google AI (Gemini) integration
- Support for decompiled and assembly analysis modes
- Optional CodeDumper integration
- Interactive UI for reviewing suggestions
- Concurrent analysis prevention
- Comprehensive error handling and fallbacks

## Support

For issues, questions, or feature requests, please check the output window in IDA Pro for diagnostic information and error messages.

## Credits

- Built on IDA Pro's powerful reverse engineering platform
- Leverages Google's Gemini AI for semantic understanding
- Integrates with CodeDumper plugin for enhanced context (optional)
- Uses Pydantic for robust data validation