--- name: tabstack-extractor description: Extract structured data from websites using Tabstack API. Use when you need to scrape job listings, news articles, product pages, or any structured web content. Provides JSON schema-based extraction and clean markdown conversion. Requires TABSTACK_API_KEY environment variable. --- # Tabstack Extractor ## Overview This skill enables structured data extraction from websites using the Tabstack API. It's ideal for web scraping tasks where you need consistent, schema-based data extraction from job boards, news sites, product pages, or any structured content. ## Quick Start ### 1. Install Babashka (if needed) ```bash # Option A: From GitHub (recommended for sharing) curl -s https://raw.githubusercontent.com/babashka/babashka/master/install | bash # Option B: From Nix nix-shell -p babashka # Option C: From Homebrew brew install borkdude/brew/babashka ``` ### 2. Set up API Key **Option A: Environment variable (recommended)** ```bash export TABSTACK_API_KEY="your_api_key_here" ``` **Option B: Configuration file** ```bash mkdir -p ~/.config/tabstack echo '{:api-key "your_api_key_here"}' > ~/.config/tabstack/config.edn ``` **Get an API key:** Sign up at [Tabstack Console](https://console.tabstack.ai/signup) ### 3. Test Connection ```bash bb scripts/tabstack.clj test ``` ### 4. Extract Markdown (Simple) ```bash bb scripts/tabstack.clj markdown "https://example.com" ``` ### 5. Extract JSON (Start Simple) ```bash # Start with simple schema (fast, reliable) bb scripts/tabstack.clj json "https://example.com" references/simple_article.json # Try more complex schemas (may be slower) bb scripts/tabstack.clj json "https://news.site" references/news_schema.json ``` ### 6. Advanced Features ```bash # Extract with retry logic (3 retries, 1s delay) bb scripts/tabstack.clj json-retry "https://example.com" references/simple_article.json # Extract with caching (24-hour cache) bb scripts/tabstack.clj json-cache "https://example.com" references/simple_article.json # Batch extract from URLs file echo "https://example.com" > urls.txt echo "https://example.org" >> urls.txt bb scripts/tabstack.clj batch urls.txt references/simple_article.json ``` ## Core Capabilities ### 1. Markdown Extraction Extract clean, readable markdown from any webpage. Useful for content analysis, summarization, or archiving. **When to use:** When you need the textual content of a page without the HTML clutter. **Example use cases:** - Extract article content for summarization - Archive webpage content - Analyze blog post content ### 2. JSON Schema Extraction Extract structured data using JSON schemas. Define exactly what data you want and get it in a consistent format. **When to use:** When scraping job listings, product pages, news articles, or any structured data. **Example use cases:** - Scrape job listings from BuiltIn/LinkedIn - Extract product details from e-commerce sites - Gather news articles with consistent metadata ### 3. Schema Templates Pre-built schemas for common scraping tasks. See `references/` directory for templates. **Available schemas:** - Job listing schema (see `references/job_schema.json`) - News article schema - Product page schema - Contact information schema ## Workflow: Job Scraping Example Follow this workflow to scrape job listings: 1. **Identify target sites** - BuiltIn, LinkedIn, company career pages 2. **Choose or create schema** - Use `references/job_schema.json` or customize 3. **Test extraction** - Run a single page to verify schema works 4. **Scale up** - Process multiple URLs 5. **Store results** - Save to database or file **Example job schema:** ```json { "type": "object", "properties": { "title": {"type": "string"}, "company": {"type": "string"}, "location": {"type": "string"}, "description": {"type": "string"}, "salary": {"type": "string"}, "apply_url": {"type": "string"}, "posted_date": {"type": "string"}, "requirements": {"type": "array", "items": {"type": "string"}} } } ``` ## Integration with Other Skills ### Combine with Web Search 1. Use `web_search` to find relevant URLs 2. Use Tabstack to extract structured data from those URLs 3. Store results in Datalevin (future skill) ### Combine with Browser Automation 1. Use `browser` tool to navigate complex sites 2. Extract page URLs 3. Use Tabstack for structured extraction ## Error Handling Common issues and solutions: 1. **Authentication failed** - Check `TABSTACK_API_KEY` environment variable 2. **Invalid URL** - Ensure URL is accessible and correct 3. **Schema mismatch** - Adjust schema to match page structure 4. **Rate limiting** - Add delays between requests ## Resources ### scripts/ - `tabstack.clj` - **Main API wrapper in Babashka** (recommended, has retry logic, caching, batch processing) - `tabstack_curl.sh` - Bash/curl fallback (simple, no dependencies) - `tabstack_api.py` - Python API wrapper (requires requests module) ### references/ - `job_schema.json` - Template schema for job listings - `api_reference.md` - Tabstack API documentation ## Best Practices 1. **Start small** - Test with single pages before scaling 2. **Respect robots.txt** - Check site scraping policies 3. **Add delays** - Avoid overwhelming target sites 4. **Validate schemas** - Test schemas on sample pages 5. **Handle errors gracefully** - Implement retry logic for failed requests ## Teaching Focus: How to Create Schemas This skill is designed to teach agents how to use Tabstack API effectively. The key is learning to create appropriate JSON schemas for different websites. ### Learning Path 1. **Start Simple** - Use `references/simple_article.json` (4 basic fields) 2. **Test Extensively** - Try schemas on multiple page types 3. **Iterate** - Add fields based on what the page actually contains 4. **Optimize** - Remove unnecessary fields for speed See [Schema Creation Guide](references/schema_guide.md) for detailed instructions and examples. ### Common Mistakes to Avoid - **Over-complex schemas** - Start with 2-3 fields, not 20 - **Missing fields** - Don't require fields that don't exist on the page - **No testing** - Always test with example.com first, then target sites - **Ignoring timeouts** - Complex schemas take longer (45s timeout) ## Babashka Advantages Using Babashka for this skill provides: 1. **Single binary** - Easy to share/install (GitHub releases, brew, nix) 2. **Fast startup** - No JVM warmup, ~50ms startup time 3. **Built-in HTTP client** - No external dependencies 4. **Clojure syntax** - Familiar to you (Wes), expressive 5. **Retry logic & caching** - Built into the skill 6. **Batch processing** - Parallel extraction for multiple URLs ## Example User Requests **For this skill to trigger:** - "Scrape job listings from Docker careers page" - "Extract the main content from this article" - "Get structured product data from this e-commerce page" - "Pull all the news articles from this site" - "Extract contact information from this company page" - "Batch extract job listings from these 20 URLs" - "Get cached results for this page (avoid API calls)"