---
name: python-programming
description: Python fundamentals, data structures, OOP, and data science libraries (Pandas, NumPy). Use when writing Python code, data manipulation, or algorithm implementation.
sasmp_version: "1.3.0"
bonded_agent: 01-python-data-science
bond_type: PRIMARY_BOND
---

# Python Programming for Data Science

Master Python from fundamentals to advanced data science applications.

## Quick Start

### Essential Libraries
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
```

### Data Manipulation
```python
# Read data
df = pd.read_csv('data.csv')

# Explore
print(df.head())
print(df.info())
print(df.describe())

# Filter
df_filtered = df[df['age'] > 18]

# Group and aggregate
summary = df.groupby('category')['sales'].agg(['sum', 'mean', 'count'])

# Vectorized operations (FAST!)
df['new_col'] = df['col1'] * 2  # Instead of loops
```

## Core Concepts

### 1. Data Structures
- **Lists**: `[1, 2, 3]` - ordered, mutable
- **Dictionaries**: `{'key': 'value'}` - key-value pairs
- **Tuples**: `(1, 2, 3)` - immutable
- **Sets**: `{1, 2, 3}` - unique elements

### 2. List Comprehensions
```python
# Instead of loops
squares = [x**2 for x in range(10)]
filtered = [x for x in data if x > 0]
```

### 3. NumPy Arrays
```python
arr = np.array([1, 2, 3, 4, 5])
arr * 2  # [2, 4, 6, 8, 10]
arr.mean()  # 3.0
```

### 4. Pandas DataFrames
```python
df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'salary': [50000, 60000]
})
```

## Performance Tips

**Vectorization over Loops (10-100x faster)**:
```python
# Bad (slow)
result = []
for x in data:
    result.append(x * 2)

# Good (fast)
result = np.array(data) * 2
```

## Common Patterns

### Reading Files
```python
# CSV
df = pd.read_csv('file.csv')

# Excel
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')

# JSON
df = pd.read_json('file.json')

# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query("SELECT * FROM table", conn)
```

### Missing Data
```python
df.dropna()  # Remove rows
df.fillna(0)  # Fill with value
df.fillna(df.mean())  # Fill with mean
```

### Merging Data
```python
# Join DataFrames
merged = pd.merge(df1, df2, on='id', how='left')

# Concatenate
combined = pd.concat([df1, df2], axis=0)
```

## Best Practices

1. Use vectorized operations
2. Optimize data types
3. Avoid loops when possible
4. Use built-in functions
5. Profile before optimizing

## Troubleshooting

### Common Issues

**Problem: MemoryError with large DataFrames**
```python
# Solution 1: Use chunking
for chunk in pd.read_csv('large.csv', chunksize=10000):
    process(chunk)

# Solution 2: Optimize dtypes
df['int_col'] = df['int_col'].astype('int32')  # Instead of int64
df['cat_col'] = df['cat_col'].astype('category')  # For repeated strings
```

**Problem: Slow DataFrame operations**
```python
# Debug: Profile your code
%timeit df.apply(func)  # Compare with vectorized

# Solution: Use vectorized operations
df['result'] = np.where(df['x'] > 0, df['x'] * 2, 0)  # Instead of apply
```

**Problem: Import errors**
```bash
# Solution: Check environment
pip list | grep pandas
pip install --upgrade pandas numpy

# Virtual environment best practice
python -m venv venv
source venv/bin/activate  # Linux/Mac
pip install -r requirements.txt
```

**Problem: Data type mismatches**
```python
# Debug: Check types
print(df.dtypes)

# Solution: Convert types explicitly
df['date'] = pd.to_datetime(df['date'])
df['price'] = pd.to_numeric(df['price'], errors='coerce')
```

### Debug Checklist
- [ ] Check Python and library versions
- [ ] Verify data types with `df.dtypes`
- [ ] Profile with `%timeit` before optimizing
- [ ] Use `df.info()` for memory usage
- [ ] Check for NaN values with `df.isna().sum()`