---
name: comp-scout-scrape
description: Scrape competition websites, extract structured data, and auto-persist to GitHub issues. Creates issues for new competitions, adds comments for duplicates.
---

# Competition Scraper

Scrape creative writing competitions from Australian aggregator sites and **automatically persist to GitHub**.

## What This Skill Does

1. Scrapes competitions.com.au and netrewards.com.au
2. Extracts structured data (dates, prompts, prizes)
3. **Checks for duplicates** against existing GitHub issues (by URL and title similarity)
4. Creates issues for **NEW** competitions only
5. Adds comments to existing issues when same competition found on another site
6. Skips competitions that are already tracked

**The scraper already filters out sponsored/lottery ads. Your job is to check for duplicates, then persist only new competitions.**

## What Counts as "New"

A competition is NEW if:
- Its URL is not found in any existing issue body (check the full body text, not just the primary URL field)
- AND its normalized title is <80% similar to all existing issue titles

A competition is a DUPLICATE if:
- Its URL appears anywhere in an existing issue (body text, comments) → already tracked, skip
- Its normalized title is >80% similar to an existing issue title → likely same competition, skip
- Same competition found on a different aggregator site → add comment to existing issue noting the alternate URL

**Note:** An issue body may contain multiple URLs (one per aggregator site). When checking for duplicates, search the entire issue body for the scraped URL, not just a specific field.

## Word Limit Clarification

**"25WOL" is a category name, NOT a filter.** Competitions with 25, 50, or 100 word limits are all valid creative writing competitions - persist them all (if new).

## Prerequisites

```bash
pip install playwright
playwright install chromium
```

Also requires:
- `gh` CLI authenticated
- Target repository for competition data (not this skills repo)

## Workflow

### Step 1: Determine Target Repository

The target repo stores competition issues. Specify or get from config:

```bash
# From workspace config (if hiivmind-pulse-gh initialized)
TARGET_REPO=$(yq '.repositories[0].full_name' .hiivmind/github/config.yaml 2>/dev/null)

# Or use default/specified
TARGET_REPO="${TARGET_REPO:-discreteds/competition-data}"
```

### Step 2: Scrape Listings

Run the scraper to get structured competition data:

```bash
python skills/comp-scout-scrape/scraper.py listings
```

**Output:**
```json
{
  "competitions": [
    {
      "url": "https://competitions.com.au/win-example/",
      "site": "competitions.com.au",
      "title": "Win a $500 Gift Card",
      "normalized_title": "500 gift card",
      "brand": "Example Brand",
      "prize_summary": "$500",
      "prize_value": 500,
      "closing_date": "2024-12-31"
    }
  ],
  "scrape_date": "2024-12-09",
  "errors": []
}
```

### Step 3: Check for Existing Issues

For each scraped competition, check if it already exists:

```bash
# Get all open competition issues
gh issue list -R "$TARGET_REPO" \
  --label "competition" \
  --state open \
  --json number,title,body \
  --limit 200
```

**Match by:**
1. URL in issue body (exact match = definite duplicate)
2. Normalized title similarity (>80% = likely duplicate)

### Step 4: Fetch Details for New Competitions

For competitions not already tracked, get full details:

```bash
python skills/comp-scout-scrape/scraper.py detail "https://competitions.com.au/win-example/"
```

For multiple new competitions, use batch mode:

```bash
echo '{"urls": ["url1", "url2", ...]}' | python skills/comp-scout-scrape/scraper.py details-batch
```

### Step 4.5: Apply Auto-Tagging Rules (NOT Filtering)

**IMPORTANT: Auto-tagging is for LABELING issues, not for skipping/excluding competitions.**

Check competitions against user preferences from the data repo's CLAUDE.md to determine which labels to apply.

1. Fetch preferences:
```bash
gh api repos/$TARGET_REPO/contents/CLAUDE.md -H "Accept: application/vnd.github.raw" 2>/dev/null
```

2. Parse the Detection Keywords section for tagging rules

3. For each competition, check if title/prize matches any keywords:
```
For each tag_rule in [for-kids, cruise]:
  For each keyword in tag_rule.keywords:
    If keyword.lower() in (competition.title + competition.prize_summary).lower():
      Add tag_rule.label to issue labels
```

4. **ALL competitions are ALWAYS persisted as issues.** Tagged competitions:
   - Get the relevant label applied (e.g., `for-kids`, `cruise`)
   - Are closed immediately with explanation comment
   - But they ARE STILL CREATED as issues (for record-keeping and potential review)

### Step 5: Auto-Persist Results

#### For New Competitions → Create Issue

```bash
gh issue create -R "$TARGET_REPO" \
  --title "$TITLE" \
  --label "competition" \
  --label "25wol" \
  --body "$(cat <<'EOF'
## Competition Details

**URL:** {url}
**Brand:** {brand}
**Prize:** {prize_summary}
**Word Limit:** {word_limit} words
**Closes:** {closing_date}
**Draw Date:** {draw_date}
**Winners Notified:** {notification_info}

## Prompt

> {prompt}

---
*Scraped from {site} on {scrape_date}*
EOF
)"
```

Then set milestone by closing month:
```bash
gh issue edit $ISSUE_NUMBER -R "$TARGET_REPO" --milestone "December 2024"
```

#### For Duplicates → Add Comment

If competition URL found on another site:

```bash
gh issue comment $EXISTING_ISSUE -R "$TARGET_REPO" --body "$(cat <<'EOF'
### Also found on {other_site}

**URL:** {url}
**Title on this site:** {title}
*Discovered: {date}*
EOF
)"
```

#### For Filtered Competitions → Create Issue + Close

If competition matched auto-filter keywords:

```bash
# Create the issue first (for record-keeping)
ISSUE_URL=$(gh issue create -R "$TARGET_REPO" \
  --title "$TITLE" \
  --label "competition" \
  --label "25wol" \
  --label "$FILTER_LABEL" \
  --body "...")

# Extract issue number
ISSUE_NUMBER=$(echo "$ISSUE_URL" | grep -oE '[0-9]+$')

# Close with explanation
gh issue close $ISSUE_NUMBER -R "$TARGET_REPO" --comment "$(cat <<'EOF'
Auto-filtered: matches '$KEYWORD' in $FILTER_RULE preferences.

See CLAUDE.md in this repository for filter settings.
EOF
)"
```

### Step 6: Report Results

Present confirmation to user:

```
✅ Scrape complete!

**Created 3 new issues:**
- #42: Win a $500 Coles Gift Card (closes Dec 31)
- #43: Win a Trip to Bali (closes Jan 15)
- #44: Win a Year's Supply of Coffee (closes Dec 20)

**Auto-filtered 2 (created + closed):**
- #45: Win Lego Set (for-kids: matched "Lego")
- #46: Win P&O Cruise (cruise: matched "P&O")

**Found 2 duplicates (added as comments):**
- #38: Win Woolworths Gift Cards (also on netrewards.com.au)
- #39: Win Dreamworld Experience (also on netrewards.com.au)

**Skipped 7 already tracked**
```

**IMPORTANT:** Do NOT ask "Would you like me to analyze these?" at the end. When invoked by `comp-scout-daily`, the workflow will automatically invoke analyze/compose skills next. Report results and stop.

## Output Fields

### Listing Output

| Field | Type | Description |
|-------|------|-------------|
| url | string | Full URL to competition detail page |
| site | string | Source site (competitions.com.au or netrewards.com.au) |
| title | string | Competition title as displayed |
| normalized_title | string | Lowercase, prefixes stripped, for matching |
| brand | string | Sponsor/brand name (if available) |
| prize_summary | string | Prize description or value badge |
| prize_value | int/null | Numeric value in dollars |
| closing_date | string/null | YYYY-MM-DD format |

### Detail Output

All listing fields plus:

| Field | Type | Description |
|-------|------|-------------|
| prompt | string | The actual competition question/prompt |
| word_limit | int | Maximum words (default 25) |
| entry_method | string | How to submit entry |
| winner_notification | object/null | Notification details from JSON-LD |
| scraped_at | string | ISO timestamp of scrape |

### Winner Notification Object

| Field | Type | Description |
|-------|------|-------------|
| notification_text | string | Raw notification text |
| notification_date | string/null | Specific date if mentioned |
| notification_days | int/null | Days after close/draw |
| selection_text | string | How winners are selected |
| selection_date | string/null | When judging occurs |

## Title Normalization

Titles are normalized for deduplication:

1. Lowercase
2. Strip prefixes: "Win ", "Win a ", "Win an ", "Win the ", "Win 1 of "
3. Remove punctuation
4. Collapse whitespace

**Example:**
```
Original: "Win a $500 Coles Gift Card"
Normalized: "500 coles gift card"
```

## Example Session

```
User: Scrape competitions

Claude: I'll scrape competitions and persist new ones to GitHub.

[Runs: python skills/comp-scout-scrape/scraper.py listings]

Found 12 competitions from both sites.

[Runs: gh issue list -R discreteds/competition-data --label competition --json number,title,body]

Checking against 45 existing issues...
- 3 are new
- 2 are duplicates (same competition, different source)
- 7 already tracked

Fetching details for 3 new competitions...

[Creates issues and adds comments]

✅ Scrape complete!

**Created 3 new issues:**
- #46: Win a $500 Coles Gift Card (closes Dec 31)
  - Milestone: December 2024
- #47: Win a Trip to Bali (closes Jan 15)
  - Milestone: January 2025
- #48: Win a Year's Supply of Coffee (closes Dec 20)
  - Milestone: December 2024

**Added 2 duplicate comments:**
- #38: Also found on netrewards.com.au
- #39: Also found on netrewards.com.au

```

## CLI Commands Reference

```bash
# Scrape all listing pages
python skills/comp-scout-scrape/scraper.py listings

# Get full details for one competition
python skills/comp-scout-scrape/scraper.py detail "URL"

# Get full details for multiple competitions (batch mode)
echo '{"urls": ["url1", "url2"]}' | python skills/comp-scout-scrape/scraper.py details-batch

# Debug: just get URLs
python skills/comp-scout-scrape/scraper.py urls
```

### Batch Details Output

```json
{
  "details": [
    {
      "url": "...",
      "title": "...",
      "prompt": "Tell us in 25 words...",
      "word_limit": 25,
      ...
    }
  ],
  "scrape_date": "2024-12-09",
  "errors": []
}
```

## Persistence Details

This skill handles all GitHub persistence. The separate `comp-scout-persist` skill is **deprecated** - its functionality is merged here.

### Issue Creation Template

```markdown
## Competition Details

**URL:** {url}
**Brand:** {brand}
**Prize:** {prize_summary}
**Word Limit:** {word_limit} words
**Closes:** {closing_date}
**Draw Date:** {draw_date}
**Winners Notified:** {notification_info}

## Prompt

> {prompt}

---
*Scraped from {site} on {scrape_date}*
```

### Labels

| Label | Description | Auto-applied |
|-------|-------------|--------------|
| `competition` | All competition issues | Always |
| `25wol` | 25 words or less type | Always |
| `for-kids` | Auto-filtered (kids competitions) | When keyword matches |
| `cruise` | Auto-filtered (cruise competitions) | When keyword matches |
| `closing-soon` | Closes within 3 days | By separate check |
| `entry-drafted` | Entry has been composed | By comp-scout-compose |
| `entry-submitted` | Entry has been submitted | Manually |

### Milestones

Issues are assigned to milestones by closing date month:
- "December 2024"
- "January 2025"
- etc.

```bash
# Create milestone if needed
gh api repos/$TARGET_REPO/milestones \
  --method POST \
  --field title="$MONTH_YEAR" \
  --field due_on="$LAST_DAY_OF_MONTH"

# Assign to issue
gh issue edit $ISSUE_NUMBER -R "$TARGET_REPO" --milestone "$MONTH_YEAR"
```

### Duplicate Comment Template

```markdown
### Also found on {other_site}

**URL:** {url}
**Title on this site:** {title}
*Discovered: {date}*
```

### Filtered Issue Handling

When a competition matches filter keywords:
1. Issue is created (for record-keeping)
2. Filter label is applied (e.g., `for-kids`)
3. Issue is immediately closed with explanation

```bash
gh issue close $ISSUE_NUMBER -R "$TARGET_REPO" \
  --comment "Auto-filtered: matches '$KEYWORD' in $FILTER_RULE preferences."
```

## Integration

This skill is invoked by `comp-scout-daily` as the first step in the workflow.

After scraping, you can:
- Use **comp-scout-analyze** to generate entry strategies
- Use **comp-scout-compose** to write actual entries
- Both will auto-persist their results as comments on the issue