--- name: citedy-content-ingestion title: "Content Ingestion" description: > Turn any URL into structured content — YouTube videos (via Gemini Video API), web articles, PDFs, and audio files. Extract transcripts, summaries, and metadata for use in any LLM pipeline. Powered by Citedy. version: "1.0.0" author: Citedy tags: - content-ingestion - youtube - transcription - pdf - audio - web-scraping - data-extraction metadata: openclaw: requires: env: - CITEDY_API_KEY primaryEnv: CITEDY_API_KEY compatible_with: "citedy-seo-agent@3.2.0" privacy_policy_url: https://www.citedy.com/privacy security_notes: | API keys (prefixed citedy_agent_) authenticate against Citedy API endpoints only. All traffic is TLS-encrypted. Keys can be revoked from dashboard. --- # Content Ingestion — Skill Instructions **Connection:** REST API over HTTPS **Base URL:** `https://www.citedy.com` **Auth:** `Authorization: Bearer $CITEDY_API_KEY` --- ## Overview Turn any URL into structured content your agent can use. Pass a link — the skill extracts the full text, transcript, metadata, and summary — and returns it as clean structured data ready for your LLM pipeline. Supported content types: - **YouTube videos** — full transcription via Gemini Video API (not just captions) - **Web articles** — clean article text with metadata - **PDF documents** — text extraction from public PDF URLs - **Audio files** — transcription from MP3/WAV/M4A files Differentiator: YouTube ingestion uses the Gemini Video API for deep video understanding — it goes beyond auto-generated captions, capturing speaker intent, visual context, and structure. Use this skill as a standalone input node for any LLM pipeline. Feed the output directly into summarization, Q&A, article generation, or knowledge base indexing. --- ## When to Use Use this skill when the user: - Asks to extract, transcribe, or summarize a URL - Shares a YouTube video and wants the content analyzed or repurposed - Shares a PDF link and wants the text extracted - Wants to ingest audio content for transcription - Is building a pipeline that needs to pull content from the web --- ## Instructions ### Setup To authenticate, register your agent and obtain an API key: 1. **Register your agent:** ``` POST https://www.citedy.com/api/agent/register Content-Type: application/json { "agent_name": "My Agent" } ``` 2. **Approve the registration** — you will receive an approval URL in the response. Open it in your browser to confirm. 3. **Save your API key** — after approval, the key (prefixed `citedy_agent_`) is returned. Store it as `CITEDY_API_KEY` in your environment. All subsequent requests use `Authorization: Bearer $CITEDY_API_KEY`. --- ## Core Workflow ### Single URL Ingestion **Step 1 — Submit URL:** ``` POST /api/agent/ingest Authorization: Bearer $CITEDY_API_KEY Content-Type: application/json { "url": "https://www.youtube.com/watch?v=example" } ``` Returns `202 Accepted` with: ```json { "id": "job_abc123", "status": "processing", "poll_url": "/api/agent/ingest/job_abc123" } ``` If the URL was already ingested (cache hit), returns `200 OK` with `"cached": true` — costs 1 credit. **Step 2 — Poll for completion:** ``` GET /api/agent/ingest/{id} ``` Returns current status: `processing`, `completed`, or `failed`. Poll every 5–15 seconds. No credit cost. **Step 3 — Retrieve content:** ``` GET /api/agent/ingest/{id}/content ``` Returns the full extracted content, transcript, and metadata. No credit cost. --- ### Batch Ingestion Submit up to 20 URLs in a single request: ``` POST /api/agent/ingest/batch Authorization: Bearer $CITEDY_API_KEY Content-Type: application/json { "urls": [ "https://example.com/article", "https://www.youtube.com/watch?v=abc", "https://example.com/doc.pdf" ], "callback_url": "https://your-service.com/webhook" // optional } ``` Returns an array of job IDs. If `callback_url` is provided, a POST request is sent to it when all jobs complete. --- ### List Jobs ``` GET /api/agent/ingest?status=completed&limit=20&offset=0 ``` Filter by status, paginate with limit/offset. --- ## Examples ### Example 1 — YouTube Video **User:** "Transcribe this YouTube video: https://www.youtube.com/watch?v=dQw4w9WgXcQ" ```bash # Step 1: Submit curl -X POST https://www.citedy.com/api/agent/ingest \ -H "Authorization: Bearer $CITEDY_API_KEY" \ -H "Content-Type: application/json" \ -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}' # Step 2: Poll curl https://www.citedy.com/api/agent/ingest/job_abc123 \ -H "Authorization: Bearer $CITEDY_API_KEY" # Step 3: Get content curl https://www.citedy.com/api/agent/ingest/job_abc123/content \ -H "Authorization: Bearer $CITEDY_API_KEY" ``` Response includes full transcript, video title, duration, and chapter breakdown. --- ### Example 2 — Web Article **User:** "Extract the main content from https://techcrunch.com/2026/01/01/ai-trends" ```bash curl -X POST https://www.citedy.com/api/agent/ingest \ -H "Authorization: Bearer $CITEDY_API_KEY" \ -H "Content-Type: application/json" \ -d '{"url": "https://techcrunch.com/2026/01/01/ai-trends"}' ``` Response includes clean article text, title, author, publish date, and word count. --- ### Example 3 — Batch Ingestion **User:** "I have 5 articles to process" ```bash curl -X POST https://www.citedy.com/api/agent/ingest/batch \ -H "Authorization: Bearer $CITEDY_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "urls": [ "https://example.com/article-1", "https://example.com/article-2", "https://example.com/article-3", "https://www.youtube.com/watch?v=abc123", "https://example.com/report.pdf" ] }' ``` Returns 5 job IDs. Poll each individually or wait for all to complete. --- ## API Reference ### POST /api/agent/ingest Submit a single URL for ingestion. **Request:** ```json { "url": "string (required) — any supported URL" } ``` **Response 202 (new job):** ```json { "id": "job_abc123", "status": "processing", "content_type": "youtube_video", "poll_url": "/api/agent/ingest/job_abc123", "estimated_credits": 5 } ``` **Response 200 (cache hit):** ```json { "id": "job_abc123", "status": "completed", "cached": true, "credits_charged": 1 } ``` --- ### GET /api/agent/ingest/{id} Poll job status. No credit cost. **Response:** ```json { "id": "job_abc123", "status": "completed", "content_type": "youtube_video", "created_at": "2026-03-01T10:00:00Z", "completed_at": "2026-03-01T10:01:30Z", "credits_charged": 5, "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ" } ``` Status values: `queued` | `processing` | `completed` | `failed` --- ### GET /api/agent/ingest/{id}/content Retrieve full extracted content. No credit cost. **Response:** ```json { "id": "job_abc123", "content_type": "youtube_video", "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "metadata": { "title": "Video Title", "author": "Channel Name", "duration_seconds": 212, "published_at": "2009-10-25" }, "transcript": "Full transcript text...", "summary": "Brief summary of the content...", "word_count": 1840, "language": "en" } ``` --- ### POST /api/agent/ingest/batch Submit up to 20 URLs at once. **Request:** ```json { "urls": ["string", "..."], "callback_url": "string (optional)" } ``` **Response 202:** ```json { "jobs": [ { "url": "https://...", "id": "job_abc123", "status": "queued" }, { "url": "https://...", "id": "job_abc124", "status": "queued" } ], "total": 2 } ``` --- ### GET /api/agent/ingest List ingestion jobs. **Query params:** - `status` — filter by `queued | processing | completed | failed` - `limit` — max results (default 20, max 100) - `offset` — pagination offset **Response:** ```json { "jobs": [...], "total": 42, "limit": 20, "offset": 0 } ``` --- ## Glue Tools ### GET /api/agent/health Check API availability. 0 credits. ### GET /api/agent/me Return current agent identity and credit balance. 0 credits. ### GET /api/agent/status Return API status, current rate limit usage, and service health. 0 credits. --- ## Pricing | Content Type | Duration / Size | Credits | | -------------------- | --------------- | ---------- | | `web_article` | any | 1 credits | | `pdf_document` | any | 2 credits | | `youtube_video` | < 10 min | 5 credits | | `youtube_video` | 10–30 min | 15 credits | | `youtube_video` | 30–60 min | 30 credits | | `youtube_video` | 60–120 min | 55 credits | | `audio_file` | < 10 min | 3 credits | | `audio_file` | 10–30 min | 8 credits | | `audio_file` | 30–60 min | 15 credits | | `audio_file` | 60+ min | 30 credits | | Cache hit (any type) | — | 1 credits | Credits are charged on `completed` status only. Failed jobs are not charged. --- ## Limitations - **YouTube:** maximum video duration 120 minutes. Videos longer than 120 min are rejected with `DURATION_EXCEEDED`. - **Audio files:** maximum file size 50 MB. Files larger than 50 MB are rejected with `SIZE_EXCEEDED`. - **Supported content types:** `youtube_video`, `web_article`, `pdf_document`, `audio_file` - **Batch size:** maximum 20 URLs per batch request - **Private content:** private YouTube videos, paywalled articles, and login-gated content cannot be ingested --- ## Rate Limits | Endpoint | Limit | | ---------------------------- | ----------------------------- | | POST /api/agent/ingest | 30 requests/hour per tenant | | POST /api/agent/ingest/batch | 5 requests/hour per tenant | | All other endpoints | 60 requests/minute per tenant | Rate limit headers are included in all responses: - `X-RateLimit-Limit` - `X-RateLimit-Remaining` - `X-RateLimit-Reset` --- ## Error Handling | Error Code | HTTP Status | Meaning | | -------------------------- | ----------- | ---------------------------------- | | `INVALID_URL` | 400 | URL is malformed or unsupported | | `UNSUPPORTED_CONTENT_TYPE` | 400 | Content type not supported | | `DURATION_EXCEEDED` | 400 | YouTube video longer than 120 min | | `SIZE_EXCEEDED` | 400 | Audio file larger than 50 MB | | `INSUFFICIENT_CREDITS` | 402 | Not enough credits to process | | `RATE_LIMIT_EXCEEDED` | 429 | Too many requests | | `JOB_NOT_FOUND` | 404 | Job ID does not exist | | `PROCESSING_FAILED` | 500 | Ingestion failed on server side | | `PRIVATE_CONTENT` | 403 | Content is behind login or paywall | On `PROCESSING_FAILED`, retry after 60 seconds. If it fails twice, try a different URL or contact support. --- ## Response Guidelines When returning ingested content to the user: - **Always confirm** the content type detected (YouTube, article, PDF, audio) - **Show credit cost** before and after ingestion - **Summarize** before presenting the full transcript — users often want a quick answer first - **Ask what to do next** — "I have the transcript. Would you like me to write a blog post, summarize it, or extract key points?" - **For YouTube:** include video title, channel, and duration in your response - **On cache hit:** inform the user this was previously ingested and cost only 1 credit --- ## Want More? This skill is part of the Citedy AI platform. The full suite includes: - **Article Generation** — write SEO-optimized blog posts from keywords or URLs - **Social Adaptation** — repurpose articles for LinkedIn, X, Instagram, Reddit - **SEO Analysis** — content gap analysis, competitor tracking, visibility scanning - **Autopilot** — fully automated content pipeline from keywords to published articles Learn more at [citedy.com](https://www.citedy.com) or explore the `citedy-seo-agent` skill for the complete toolkit.