# Instagram Scraper for Patrick Henry High School

This automated Instagram scraper downloads and stores content from Patrick Henry High School's Instagram accounts in Supabase storage.

## 🎯 Features

- **Dual Mode Operation**: Initialization (3 months) vs Hourly Updates
- **Caption Storage**: Automatically extracts and saves post captions as text files
- **Smart Content Filtering**: Downloads posts, stories, and profile pictures (excludes reels)
- **Organized Storage**: Files are categorized into separate storage buckets
- **Automated Execution**: Runs every hour via GitHub Actions

## 📁 Storage Structure

The scraper organizes content into **4 storage buckets**:

\`\`\`
instagram-posts/          # Images and videos from posts
├── username1/
│   ├── phhs_athletics_2025-01-20_14-30-45_GraphImage_B1a2C3d4.jpg
│   └── pathenry2026_2025-01-20_15-45-12_GraphVideo_C2b3D4e5.mp4
└── username2/...

instagram-stories/        # Story content (24-hour expiring)
├── username1/
│   ├── phhs_athletics_2025-01-20_16-20-10_StoryImage_D3c4E5f6.jpg
│   └── phhsmun_2025-01-20_17-10-30_StoryVideo_E4d5F6g7.mp4
└── username2/...

instagram-profile-pics/   # Profile pictures
├── username1/
│   └── phhs_athletics_2025-01-20_12-00-00_profile_pic.jpg
└── username2/...

instagram-captions/       # Post captions as text files
├── username1/
│   ├── phhs_athletics_2025-01-20_14-30-45_caption_B1a2C3d4.txt
│   └── pathenry2026_2025-01-20_15-45-12_caption_C2b3D4e5.txt
└── username2/...
\`\`\`

## 🔧 Setup Instructions

### 1. Create Supabase Storage Buckets

**Go to**: [Supabase Dashboard Storage](https://app.supabase.com/project/zofjzjdtqksqugahotcs/storage/buckets)

**Create these 4 buckets** (click "New bucket" for each):

1. **`instagram-posts`**
   - Name: `instagram-posts`
   - Public: ✅ (checked)
   - File size limit: 50MB
   - Allowed MIME types: Leave empty (all types)

2. **`instagram-stories`**
   - Name: `instagram-stories`
   - Public: ✅ (checked)
   - File size limit: 50MB
   - Allowed MIME types: Leave empty (all types)

3. **`instagram-profile-pics`**
   - Name: `instagram-profile-pics`
   - Public: ✅ (checked)
   - File size limit: 10MB
   - Allowed MIME types: Leave empty (all types)

4. **`instagram-captions`**
   - Name: `instagram-captions`
   - Public: ✅ (checked)
   - File size limit: 1MB
   - Allowed MIME types: `text/plain`

> **Important**: All buckets must be set to **Public** so the URLs are accessible.

### 2. Database Setup

The scraper uses these existing tables:
- `schools` - Contains Patrick Henry High School info
- `usernames` - Contains all Instagram usernames to scrape

## 🚀 Usage

### Local Initialization (Run Once)

\`\`\`bash
python instagram_scraper.py
\`\`\`

**Interactive prompts:**
- Choose single account or all accounts (34 total)
- Downloads last 3 months of content
- Uploads everything to Supabase storage
- Saves captions as separate text files

### Automatic Hourly Updates

- GitHub Actions runs automatically every hour
- Downloads only new content (fast-update mode)
- Processes all accounts from the Supabase database
- No user interaction required

## 📋 What Gets Downloaded

### ✅ Included Content:
- **Posts**: Images and videos from regular posts
- **Stories**: Current stories (24-hour content)
- **Profile Pictures**: Current profile images
- **Captions**: Post captions saved as `.txt` files

### ❌ Excluded Content:
- **Reels**: Explicitly excluded from downloads
- **Comments**: Not downloaded
- **IGTV**: Not specifically targeted
- **Highlights**: Not included

## 🗂️ File Naming Convention

All files follow this pattern:
\`\`\`
{username}_{date_utc}_{type}_{shortcode}.{extension}

Examples:
- phhs_athletics_2025-01-20_14-30-45_GraphImage_B1a2C3d4.jpg
- pathenry2026_2025-01-20_15-45-12_caption_C2b3D4e5.txt
- phhsmun_2025-01-20_16-20-10_StoryVideo_D3c4E5f6.mp4
\`\`\`

## 🛡️ Error Handling

- **Individual Failures**: Script continues if one account fails
- **Storage Fallbacks**: Continues even if some uploads fail
- **Duplicate Prevention**: Fast-update mode prevents re-downloading
- **Comprehensive Logging**: Detailed logs for troubleshooting

## 📊 Monitoring

### Check GitHub Actions:
1. Go to the **Actions** tab in your repository
2. View the "scrape-instagram" workflow
3. Check logs for any errors or successful runs

### Check Storage Usage:
1. Go to [Supabase Storage](https://app.supabase.com/project/zofjzjdtqksqugahotcs/storage/buckets)
2. Click on each bucket to see uploaded files
3. Monitor storage usage and costs

## 🔄 Profile Changes

The system automatically handles:
- **Username Changes**: Instaloader detects and handles renames
- **Profile Pictures**: New profile pics are downloaded
- **Content Updates**: New posts and stories are captured

## 📈 Scaling

Currently configured for **34 Instagram accounts** from Patrick Henry High School. To add more accounts:

1. Add usernames to the `usernames` table in Supabase
2. The scraper will automatically include them in the next run

## 🎮 Testing

### Test Locally:
\`\`\`bash
python instagram_scraper.py
\`\`\`

### Test GitHub Actions:
1. Go to **Actions** tab
2. Click "scrape-instagram" workflow
3. Click "Run workflow" button
4. Monitor the execution logs

---

## 📋 Account List

The scraper currently processes these Patrick Henry High School accounts:

- `pathenryasb`, `phhspatriotsvolleyball`, `pathenry2026`, `phhsmun`
- `pathenry2028`, `henrymodelun`, `patrick_henry_wrestling`, `phhs_athletics`
- `phhsboyslacrosse`, `henryyearbook`, `henryfootball1`, `phhscollege25`
- `phhs.youngwomeninmedicine`, `pathenrymascot`, `pathenry2025`, `phhswomensoccer`
- `phhs.girlsflag`, `phhsenviro`, `henry_powderpuff`, `phhsmarchingband`
- `patrickhenrywaterboys`, `phhsavid`, `phhs.blackboxtheaterco`, `phhscolorguard`
- `phhscheer`, `phhs.swear`, `patrickhenrypolomamis`, `phhsdanceteam`
- `phhs.mocktrial`, `phhsthriftstore`, `phhs.swim`, `linkcrewphhs`
- `phhs_theatre`, `phhs.patsplace`

Ready to capture your school's digital memories! 🎓📸