---
name: data-ai-ml-skill
description: Master machine learning, data engineering, AI engineering, MLOps, and prompt engineering. Build intelligent systems from data pipelines to production AI applications with LLMs, agents, and modern frameworks.
---

# Data, AI & ML Skill

Complete guide to building intelligent systems using data science, machine learning, and artificial intelligence.

## Quick Start

### Choose Your Path

```
Data → ML → Production
  ↓      ↓      ↓
Pandas SQL  Models
NumPy  ETL  Deployment
```

### Get Started in 5 Steps

1. **Python Fundamentals** (2-3 weeks)
   - NumPy, Pandas basics
   - Data manipulation

2. **Statistics & Math** (4-6 weeks)
   - Probability, distributions
   - Hypothesis testing
   - Linear algebra basics

3. **Machine Learning Algorithms** (6-8 weeks)
   - Supervised learning
   - Unsupervised learning
   - Scikit-learn library

4. **Deep Learning** (8-12 weeks)
   - Neural networks
   - PyTorch or TensorFlow

5. **Production & Deployment** (ongoing)
   - MLOps practices
   - Model serving
   - Monitoring

---

## Data Fundamentals

### **NumPy for Numerical Computing**

```python
import numpy as np

# Array creation
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])
zeros = np.zeros((3, 3))
ones = np.ones(5)
range_arr = np.arange(0, 10, 2)

# Basic operations
arr + 5  # [6, 7, 8, 9, 10]
arr * 2  # [2, 4, 6, 8, 10]
np.sum(arr)  # 15
np.mean(arr)  # 3.0
np.std(arr)   # Standard deviation

# Indexing and slicing
arr[0]  # 1
arr[1:4]  # [2, 3, 4]
matrix[0, 1]  # 2

# Linear algebra
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
np.dot(A, B)  # Matrix multiplication
np.linalg.inv(A)  # Matrix inverse
```

### **Pandas for Data Analysis**

```python
import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000]
})

# Selecting data
df['name']  # Column
df.loc[0]  # Row by label
df.iloc[0]  # Row by position

# Filtering
df[df['age'] > 25]  # Age greater than 25
df[(df['age'] > 25) & (df['salary'] > 55000)]

# Aggregation
df.groupby('age')['salary'].mean()
df.describe()  # Summary statistics

# Missing data
df.isnull()  # Check for nulls
df.fillna(0)  # Fill nulls
df.dropna()  # Remove nulls

# Data transformation
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 60])
df['name_upper'] = df['name'].str.upper()

# Merging
merged = pd.merge(df1, df2, on='id')
combined = pd.concat([df1, df2])
```

### **Data Visualization**

```python
import matplotlib.pyplot as plt
import seaborn as sns

# Line plot
plt.plot(df['year'], df['sales'])
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

# Scatter plot
plt.scatter(df['age'], df['salary'])

# Bar chart
df['category'].value_counts().plot(kind='bar')

# Histogram
plt.hist(df['age'], bins=10)

# Seaborn (higher level)
sns.scatterplot(x='age', y='salary', data=df, hue='department')
sns.heatmap(correlation_matrix, annot=True)

# Plotly (interactive)
import plotly.express as px
fig = px.scatter(df, x='age', y='salary', color='department')
fig.show()
```

---

## Machine Learning

### **Supervised Learning**

**Classification:**
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Load data
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']  # 0 or 1

# Split: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
```

**Regression:**
```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
```

### **Unsupervised Learning**

**Clustering:**
```python
from sklearn.cluster import KMeans

# Determine optimal clusters
inertias = []
for k in range(1, 10):
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(X)
    inertias.append(model.inertia_)

# Elbow method (plot and find elbow)
plt.plot(range(1, 10), inertias)
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()

# Train final model
model = KMeans(n_clusters=3, random_state=42)
clusters = model.fit_predict(X)
```

**Dimensionality Reduction:**
```python
from sklearn.decomposition import PCA

# Reduce to 2 dimensions
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

print(f"Explained variance: {pca.explained_variance_ratio_}")
```

### **Feature Engineering**

```python
# Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Label encoding (ordinal)
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])

# One-hot encoding (nominal)
df_encoded = pd.get_dummies(df, columns=['category'])

# Feature selection
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(k=5)
X_selected = selector.fit_transform(X, y)
```

---

## Deep Learning

### **Neural Networks with PyTorch**

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Define model
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.sigmoid(self.fc3(x))
        return x

# Initialize
model = SimpleNN()
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.BCELoss()

# Training loop
for epoch in range(100):
    for X_batch, y_batch in train_loader:
        # Forward pass
        predictions = model(X_batch)
        loss = loss_fn(predictions, y_batch.unsqueeze(1))

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
```

### **Convolutional Neural Networks (CNN)**

```python
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 56 * 56, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * 56 * 56)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
```

---

## AI Engineering & LLMs

### **Working with Large Language Models**

**OpenAI API:**
```python
import openai

openai.api_key = "your-api-key"

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain machine learning in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)
```

**LangChain (LLM Framework):**
```python
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = OpenAI(temperature=0.7)

template = """
You are an expert {expertise}.
{question}
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["expertise", "question"]
)

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(expertise="data scientist", question="What is feature engineering?")
```

### **Prompt Engineering**

**Few-Shot Learning:**
```python
prompt = """
Classify the sentiment: positive, negative, or neutral.

Examples:
"I love this product!" → positive
"This is terrible." → negative
"It's okay." → neutral

Classify: "Best purchase ever!"
"""
```

**Chain of Thought:**
```python
prompt = """
Let's think step by step.

Question: If a train leaves at 2 PM going 60 mph, and another at 3 PM at 80 mph, when does the second catch up?

Step 1: Set up equations
Step 2: Solve for time
Step 3: Verify answer
"""
```

### **Building AI Agents**

```python
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.llms import OpenAI

# Define tools
tools = [
    Tool(
        name="Calculator",
        func=lambda x: str(eval(x)),
        description="Useful for math"
    ),
    Tool(
        name="Search",
        func=google_search,
        description="Search the internet"
    )
]

agent = initialize_agent(
    tools,
    OpenAI(temperature=0),
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

result = agent.run("What is 45 * 3? Then search for the capital of France.")
```

---

## MLOps (Production)

### **Model Versioning with MLflow**

```python
import mlflow
import mlflow.sklearn

mlflow.set_experiment("iris-classification")

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)

    # Train model
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)

    # Log metrics
    accuracy = model.score(X_test, y_test)
    mlflow.log_metric("accuracy", accuracy)

    # Log model
    mlflow.sklearn.log_model(model, "random_forest_model")

# Later, load model
model = mlflow.sklearn.load_model("runs:/model_id/random_forest_model")
```

### **Model Serving with FastAPI**

```python
from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load('model.pkl')

@app.post("/predict")
async def predict(data: dict):
    features = [data['feature1'], data['feature2'], data['feature3']]
    prediction = model.predict([features])
    return {"prediction": prediction[0]}

# Run: uvicorn app:app --reload
```

### **Monitoring Models**

```python
# Detect data drift
from evidentlyai.dashboard import Dashboard
from evidentlyai.tabs import DataDriftTab

dashboard = Dashboard(tabs=[DataDriftTab()])
dashboard.calculate(reference_data, current_data)
dashboard.save("data_drift_report.html")
```

---

## Learning Roadmap

- [ ] Python fundamentals (NumPy, Pandas)
- [ ] Statistics and probability
- [ ] Data visualization
- [ ] Machine learning basics (Scikit-learn)
- [ ] Supervised learning (classification, regression)
- [ ] Unsupervised learning (clustering, dimensionality reduction)
- [ ] Deep learning (PyTorch or TensorFlow)
- [ ] Build 2-3 ML projects
- [ ] Learn MLOps basics
- [ ] LLMs and prompt engineering
- [ ] Deploy a model to production
- [ ] Ready for ML engineer role!

---

**Source**: https://roadmap.sh/machine-learning, https://roadmap.sh/ai-engineer, https://roadmap.sh/data-engineer