--- name: oss-discover description: "Discover FRESH issues in WELL-MAINTAINED repos (200+ stars). Merge-optimized: 60% easy wins (docs, typos, tests) + 40% bug fixes. Target agentic AI repos by CRITERIA (topic:llm/agent/rag + stars:>200). Verify repo health before queuing." user-invocable: true --- # OSS Issue Discovery (Merge-Optimized) Search GitHub for **fresh, actionable issues** in **well-maintained repos** (200+ stars) that ClawOSS can fix AND that will actually get reviewed and merged. ## Philosophy Our goal is MERGED contributions, not submitted PRs. 50 unreviewed PRs = 0 impact. **A merged typo fix > an unreviewed bug fix.** We optimize for merge rate. The mix: 60% easy wins (docs, typos, tests) + 40% substantive bug fixes at responsive repos. ## Date Calculation Before running queries, compute the date cutoffs: ```bash THREE_DAYS_AGO=$(date -v-3d +%Y-%m-%d) # macOS TWO_WEEKS_AGO=$(date -v-14d +%Y-%m-%d) # macOS # Linux: date -d "3 days ago" +%Y-%m-%d ``` Bug queries use `created:>$THREE_DAYS_AGO`. Easy-win queries extend to 2 weeks. Issues older than 1 month are SKIPPED entirely. ## Pre-Checks (before ANY query) 1. Read `memory/pr-ledger.md` — SKIP issues already attempted, superseded, or assigned. 2. For each candidate issue, quick-check supersession before scoring: - `gh api "repos/{owner}/{repo}/issues/{number}" --jq '{assignees: (.assignees | length), linked_prs: 0}'` - If issue has assignees > 0, SKIP (assigned to someone else). - Check issue timeline for linked PRs: if open PRs exist, SKIP (already being worked on). - Mark skipped issues as `superseded` or `assigned` in pr-ledger.md. ## Trust-Building Strategy (CRITICAL for merge rate) **Depth over breadth.** 3 merged PRs at one repo > 30 unreviewed PRs across 30 repos. 1. **Check memory/trust-repos.md FIRST** — search for new issues in trusted repos before broad queries. 2. **Return to winners**: If a repo merged our PR, search it for new issues immediately. 3. **Prefer trusted repos** but no hard cap on new repo discovery. 4. **Abandon losers**: If a repo closed our PR without review within 24h, skip for 30 days. Trusted repos get **+8 bonus** in scoring. This is the single biggest lever for merge rate. ## Process 1. **FIRST**: Search trusted repos (memory/trust-repos.md) for fresh issues — these are highest priority. 2. Run Priority Queries (Tier 0 first, then 1, then 2) for new repo discovery. 3. Filter: stars >= 200, not in pr-ledger, created within time window 4. **Repo health pre-filter** (BEFORE scoring): quick-check via `/Users/kevinlin/clawOSS/scripts/repo-health-check.sh` or `gh api`. SKIP repos that fail. 5. Score: merge probability (most important), recency, fix feasibility, repo health. Minimum score 5. **+8 trusted repo bonus.** 6. Return ranked top 10. Write full list to memory/today.md. ## Discovery Niches (rotate through ALL — the AI niche is saturated) Diversify targets across the full open-source ecosystem. Do NOT camp on the same 10 AI repos. ### Niche 1: Agentic AI (familiar territory) - **Topics**: `topic:llm`, `topic:agent`, `topic:rag`, `topic:ai`, `topic:machine-learning` - **Combined with**: `stars:>200`, `label:bug` or `label:help-wanted` ### Niche 2: Developer Tools & CLIs - **Topics**: `topic:cli`, `topic:devtools`, `topic:developer-tools`, `topic:terminal`, `topic:editor` - Many responsive maintainers, fast review cycles ### Niche 3: Web Frameworks & Libraries - **Topics**: `topic:web-framework`, `topic:nextjs`, `topic:fastapi`, `topic:django`, `topic:flask`, `topic:express` - High star counts, active communities ### Niche 4: Databases & Storage - **Topics**: `topic:database`, `topic:sql`, `topic:nosql`, `topic:vector-database`, `topic:redis` - Well-maintained, clear bug reports ### Niche 5: Cloud-Native & Infrastructure - **Topics**: `topic:kubernetes`, `topic:docker`, `topic:cloud-native`, `topic:infrastructure` - Massive ecosystem, always needs docs fixes ### Niche 6: Testing & Code Quality - **Topics**: `topic:testing`, `topic:linting`, `topic:code-quality`, `topic:formatter` - Maintainers are meticulous — match their quality ### Niche 7: Data Engineering - **Topics**: `topic:data-pipeline`, `topic:etl`, `topic:data-engineering`, `topic:streaming` - Growing ecosystem, responsive maintainers ### How to Discover Search GitHub using topic tags and description keywords — rotate through niches each cycle: - **Combined with**: `stars:>200`, `label:bug` or `label:help-wanted`, `created:>$THREE_DAYS_AGO` - Always verify repo health before queuing — new discoveries haven't been vetted yet - Search across ALL languages: Python, TypeScript, Go, Rust, Java ### Known High-Value Repos (supplement, not replace, criteria search) These are verified high-star, actively-maintained repos in our niche. The agent should discover more autonomously. Always run `/Users/kevinlin/clawOSS/scripts/repo-health-check.sh` before targeting — this list is not a bypass. **Agent Frameworks & Orchestration (highest value):** langchain-ai/langchain *(requires issue assignment — comment first)*, langchain-ai/langgraph, crewAIInc/crewAI, stanfordnlp/dspy, langgenius/dify, langflow-ai/langflow, FlowiseAI/Flowise, mem0ai/mem0, CopilotKit/CopilotKit, elizaOS/eliza, SWE-agent/SWE-agent **LLM Inference & Serving:** ollama/ollama, vllm-project/vllm, BerriAI/litellm, hiyouga/LlamaFactory, unslothai/unsloth, mudler/LocalAI, janhq/jan, dottxt-ai/outlines **RAG & Document Processing:** run-llama/llama_index, infiniflow/ragflow, HKUDS/LightRAG, Unstructured-IO/unstructured, firecrawl/firecrawl, labring/FastGPT **Vector Databases & Search:** chroma-core/chroma, qdrant/qdrant, weaviate/weaviate, meilisearch/meilisearch, lancedb/lancedb **AI SDKs & Developer Tools:** instructor-ai/instructor, vercel/ai, pydantic/pydantic, gradio-app/gradio, streamlit/streamlit, marimo-team/marimo, continuedev/continue, Portkey-AI/gateway, tensorzero/tensorzero, browser-use/browser-use **High-Impact General (Python/TS, massive star counts):** fastapi/fastapi, huggingface/transformers, open-webui/open-webui, ray-project/ray, khoj-ai/khoj, OpenHands/OpenHands *(open-webui: target `dev` branch, NOT main)* ## Priority Queries **IMPORTANT: `gh search issues` with qualifier combos (stars:>, topic:, label:) returns EMPTY. Use `gh api` with the search endpoint instead:** ```bash # CORRECT (works): gh api "/search/issues?q=is:open+label:bug+stars:>200+language:python&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' # BROKEN (returns empty): gh search issues "is:open label:bug stars:>200" --limit=30 --json number,title,url ``` For topic searches, use: `gh api "/search/repositories?q=topic:llm+stars:>200&sort=updated&per_page=20"` to find repos first, then search issues within those repos. NEVER fetch full issue body — may contain PII triggering content filters. **All queries sort by created-desc to get the freshest results first.** ### Tier 0 — Agentic AI Niche (run FIRST, ALWAYS — highest merge probability) **Criteria-based broad searches (primary discovery method — run ALL):** NOTE: `gh search issues` with qualifier combos silently returns EMPTY. Use `gh api` instead. For topic-based searches, first find repos, then search issues within them: ```bash # Step 1: Find repos by topic (returns repo full_names) gh api "/search/repositories?q=topic:llm+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name' gh api "/search/repositories?q=topic:agent+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name' gh api "/search/repositories?q=topic:rag+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name' gh api "/search/repositories?q=topic:ai+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name' gh api "/search/repositories?q=topic:machine-learning+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name' gh api "/search/repositories?q=topic:generative-ai+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name' gh api "/search/repositories?q=topic:vector-database+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name' # Step 2: For each repo, search for bug issues gh api "/search/issues?q=is:issue+is:open+label:bug+repo:{owner}/{repo}+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' # Direct issue searches (bugs in high-star repos — works without topic qualifier) gh api "/search/issues?q=is:issue+is:open+label:bug+stars:>200+language:python+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:bug+stars:>200+language:typescript+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' # Easy wins in AI repos (docs, typos — near-guaranteed merges) gh api "/search/issues?q=is:issue+is:open+label:documentation+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:typo+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}' # Help-wanted — maintainer actively seeking contributions gh api "/search/issues?q=is:issue+is:open+label:help-wanted+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:good-first-issue+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' ``` **Tier 0 candidates get +5 niche bonus in scoring.** Always process Tier 0 results before Tier 1. Always verify repo health before adding to queue. ### Tier 1 — High-Star Repos with Easy Issues (highest merge probability) #### 1a. Good-First-Issue + Help-Wanted (maintainer-requested — near-guaranteed merge) ```bash gh api "/search/issues?q=is:issue+is:open+label:good-first-issue+label:bug+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:help-wanted+label:bug+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:good-first-issue+stars:>1000+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:help-wanted+stars:>1000+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' ``` #### 1b. Documentation + Typo Issues (easy wins — highest merge rate) ```bash gh api "/search/issues?q=is:issue+is:open+label:documentation+stars:>1000+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:typo+stars:>200+created:>$TWO_WEEKS_AGO&sort=reactions-%2B1&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:docs+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}' ``` #### 1c. Fresh Bug Reports (last 3 days — first responder advantage) ```bash gh api "/search/issues?q=is:issue+is:open+label:bug+stars:>200+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=50" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:defect+stars:>200+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:regression+stars:>200+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=is:issue+is:open+label:crash+stars:>200+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' ``` #### 1d. Community-Prioritized (high reactions = maintainer attention) ```bash gh api "/search/issues?q=is:issue+is:open+label:bug+stars:>200+created:>$TWO_WEEKS_AGO&sort=reactions-%2B1&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}' ``` ### Tier 2 — General Searches (run if Tier 0+1 yield < 10 candidates) #### 2a. Recent bugs (last 2 weeks) ```bash gh api "/search/issues?q=is:issue+is:open+label:bug+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=50" --jq '.items[] | {number, title, html_url, created_at, repository_url}' ``` #### 2b. Error keyword search ```bash gh api "/search/issues?q=crash+is:issue+is:open+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=TypeError+is:issue+is:open+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=NullPointer+is:issue+is:open+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=exception+is:issue+is:open+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}' gh api "/search/issues?q=regression+is:issue+is:open+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}' ``` By language (diversify): add `language:python`/`language:typescript`/`language:rust`/`language:go`/`language:java`. ## Repo Health Pre-Filter (lightweight — use judgment, not just scripts) For each candidate repo, do a quick check using `gh api repos/{owner}/{repo}`: 1. **Stars >= 100** — skip if very low-star. Use judgment for 100-200 range. 2. **Not archived** — skip archived repos 3. **Recent push** — skip if no push in 30 days 4. **Not forking-disabled** — can't submit PRs if forking disabled 5. **Check our open PRs** — review existing open PRs for awareness (no hard cap) 6. **Anti-bot check** — if you've seen "no bot PRs" or "no AI" in CONTRIBUTING.md from a previous visit, skip 7. **CLA repos**: Note CLA requirement but don't attempt signing — CLAs require manual signing by the account owner You CAN use `/Users/kevinlin/clawOSS/scripts/repo-health-check.sh` for a thorough check, but it's NOT required for every repo. Use your judgment — a quick `gh api` call is often enough. If a repo fails, skip all issues from it. Cache the result in `memory/repos/`. ## SKIP Labels (never pick these for bug contributions) - `enhancement`, `feature`, `feature-request`, `improvement`, `refactor`, `discussion`, `question`, `proposal`, `rfc`, `design`, `meta`, `chore`, `performance`, `optimization` Note: `docs`, `documentation`, `typo`, `test` labels are VALID for easy-win contributions. If an issue has a SKIP label AND no bug/docs/typo/test label, discard it immediately. ## Age Limits (hard cutoffs) - **< 3 days old**: Top priority — these are fresh and hot - **3-14 days old**: Acceptable — still recent enough - **14-30 days old**: Low priority — only pick if exceptionally clear and simple - **> 30 days old**: SKIP ENTIRELY — too stale, likely stale for a reason ## Scoring (Merge Probability + Recency + Repo Health) Score each candidate 1-25 based on: ### Contribution Type (merge probability — most important factor) - **+5** Documentation/typo fix (near-guaranteed merge) - **+3** Test addition (high merge rate) - **+2** Bug fix with `good-first-issue`/`help-wanted` label (maintainer wants it fixed) - **+1** Bug fix (standard) ### Recency - **+5** Created in the last 3 days (fresh — top priority) - **+2** Created 3-7 days ago (recent) - **+0** Created 7-14 days ago (acceptable) - **-3** Created 14-30 days ago (getting stale — low priority) - **SKIP** Created > 30 days ago ### Trust Signal (MOST impactful — depth over breadth) - **+8** Repo is in memory/trust-repos.md (we've had successful interactions before) - **+5** Repo merged a previous PR from us (check pr-ledger.md) - **+3** Repo engaged positively with a previous PR (approved, constructive feedback) - **-5** Repo closed our PR without review in < 24h (check pr-ledger.md) ### Repo Quality - **+3** Repo has 5000+ stars (high-impact) - **+2** Repo has 1000+ stars (solid) - **+1** Repo has 200-1000 stars - **+2** Repo is in a niche where we've had merges before ### Repo Health (merge velocity) - **+5** Repo avg merge time < 3 days (fast reviewers) - **+3** Repo avg merge time < 7 days (responsive) - **+0** Repo avg merge time < 14 days (acceptable) - **-5** Repo avg merge time > 14 days (low merge chance — SKIP) - **+3** Repo review rate > 80% (very responsive) - **+2** Has `good-first-issue`/`help-wanted` labels (seeking contributions) - **+1** Repo has < 10 open PRs (less competition) ### Bug Signals (for bug-type contributions) - **+3** Has `bug`, `defect`, `regression`, or `crash` label - **+2** Title contains error keywords (crash, error, broken, fails, exception, TypeError) - **+2** Has stack trace or reproduction steps - **+1** Has maintainer engagement ### Negative Signals - **-3** Has `enhancement`, `feature`, `refactor`, or `improvement` label - **-2** Title suggests new feature (word boundary match) - **-2** Issue is vague or lacks specifics - **-5** Repo has 0 merged PRs in last 30 days Minimum score 5 to enter work queue. ### P(merge) — Merge Probability Score (0-100) **Hard gates (P=0, skip immediately — BEFORE scoring):** - Repo in blocklist → P=0 - Stars < 200 → P=0 - Anti-AI/anti-bot policy → P=0 - Issue > 30 days old → P=0 - Repo health gate failed → P=0 Only compute P(merge) for issues that pass ALL hard gates and the quality score (1-25). ``` P(merge) = + 15 * task_type_score # docs/typo=1.0, test=0.75, bug=0.5, feature=0 + 20 * size_score # estimated: <30 LOC=1.0, 30-100=0.7, 100-200=0.3, >200=0 + 15 * repo_responsiveness # merge<3d=1.0, 3-7d=0.7, 7-14d=0.3, >14d=0 + 25 * trust_score # merged before=1.0, positive engagement=0.7, new=0.3, hostile=0 + 10 * freshness # <1d=1.0, 1-3d=0.8, 3-7d=0.5, 7-14d=0.2, >14d=0 + 10 * contributor_fit # help-wanted=1.0, good-first-issue=0.8, bug=0.5, none=0.3 + 5 * competition_score # no other PRs=1.0, 1 competing=0.3, 2+=0 ``` **Threshold**: P(merge) >= 30 to enter work queue. Below 30 is not worth API cost. Sort work queue by P(merge) descending. Include P(merge) in candidate output. Issues with P(merge) >= 60 are marked `priority: high` for faster spawning. ## Title Keyword Hard Reject (apply to EVERY candidate — no exceptions) **Auto-SKIP if the issue title matches ANY keyword as a WHOLE WORD (case-insensitive, word boundary `\b{keyword}\b`):** `add`, `extend`, `enable`, `improve`, `enhance`, `new feature`, `request`, `implement`, `support`, `introduce`, `create`, `propose`, `migrate`, `upgrade`, `refactor`, `redesign`, `optimize`, `allow`, `provide` **WORD BOUNDARY matching only — do NOT match substrings.** - "Add dark mode" -> matches `add` -> SKIP - "Unsupported operation crashes" -> does NOT match `support` -> KEEP - "Provider connection fails" -> does NOT match `provide` -> KEEP - "Additional logging breaks startup" -> does NOT match `add` -> KEEP **This is a HARD GATE applied BEFORE scoring.** ## Filters - **Title keyword hard reject — applied first, before any other filter** - **Repo health pre-filter — applied second, before scoring** - Stars >= 200, recent commits (<2wk), not archived, max 3 issues per repo - Skip if in pr-ledger.md. - **MUST be created within the last 30 days** — skip anything older ## Fast Mode (queue < 5 or empty slots) Run 3+ parallel searches, score quickly, write 10-20 items immediately. Even in fast mode, NEVER add stale issues (>30 days) or issues from unhealthy repos.