--- name: sf-to-llm-data-pipelines description: "Use this skill when extracting Salesforce data for consumption by external LLMs or vector stores outside the Salesforce ecosystem — covering Bulk API v2 extraction patterns, PII scrubbing before transmission, chunking and embedding pipelines that run in external infrastructure, and non-Data-Cloud vector store ingestion (Pinecone, pgvector, Weaviate, OpenSearch). Trigger keywords: export Salesforce data to external LLM, Bulk API extract for embeddings, send Salesforce records to OpenAI, pipe Salesforce data to Pinecone, PII scrubbing before LLM, external vector store ingestion. NOT for Data Cloud vector search or grounding within Salesforce (use rag-patterns-in-salesforce), NOT for Data Cloud data model or harmonized schema design (use ai-ready-data-architecture), NOT for Agentforce agent creation or topic design (use agentforce-agent-creation), NOT for BYO LLM registration inside Salesforce (use model-builder-and-byollm)." category: agentforce salesforce-version: "Spring '25+" well-architected-pillars: - Security - Performance - Reliability - Operational Excellence triggers: - "How do I export Salesforce records to an external vector database for our own LLM pipeline?" - "We are building a RAG system outside Salesforce and need to pull data from our org via Bulk API — what is the right approach?" - "How do I scrub PII from Salesforce data before sending it to OpenAI or another external embedding model?" - "Our team wants to sync Salesforce Knowledge articles to Pinecone nightly — what extraction and chunking pipeline should we build?" - "We need incremental extraction of Salesforce data changes for an external ML pipeline using the API — what is the recommended pattern?" tags: - bulk-api-v2 - external-llm - pii-scrubbing - vector-store - data-extraction - embedding-pipeline - agentforce - integration inputs: - "Target objects and fields to extract (object API names, field API names)" - "External LLM or vector store endpoint and credentials (e.g., OpenAI Embeddings API, Pinecone index)" - "PII classification for each extracted field (what must be scrubbed or pseudonymized)" - "Data volume estimate (record count, field count, average record size)" - "Extraction cadence: full load vs incremental delta extraction" - "Salesforce connected app credentials for Bulk API v2 OAuth flow" outputs: - "Bulk API v2 query job configuration (SOQL, content type, operation parameters)" - "PII scrubbing specification documenting field-level treatment (omit, mask, pseudonymize)" - "Chunking and embedding pipeline design (chunk size, overlap, embedding model choice)" - "External vector store ingestion schema (document ID, chunk text, metadata fields)" - "Incremental extraction strategy using SystemModstamp or a CDC channel" - "Monitoring and error handling runbook for the extraction pipeline" dependencies: - einstein-trust-layer - rag-patterns-in-salesforce version: 1.0.0 author: Pranav Nagrecha updated: 2026-04-06 --- # Salesforce-to-External-LLM Data Pipelines This skill activates when a team needs to extract Salesforce data and route it through an external pipeline — outside the Salesforce Einstein platform — to feed an LLM or populate a vector store they own and operate. It covers the Bulk API v2 extraction layer, PII scrubbing requirements before data leaves the org boundary, chunking and embedding pipelines built on external infrastructure, and ingestion schemas for non-Data-Cloud vector stores. This is the skill for teams that have chosen to build or run their own AI infrastructure rather than use Salesforce's native Data Cloud vector search. --- ## Before Starting Gather this context before working on anything in this domain: - **Confirm data residency and contractual constraints.** Salesforce data leaving the org boundary is subject to the customer's Salesforce Master Subscription Agreement (MSA), any applicable data processing addenda, and jurisdiction-specific regulations (GDPR, CCPA, HIPAA). Verify whether a Data Processing Agreement (DPA) is in place with the external LLM provider before designing the pipeline. This is not optional; it is a legal precondition. - **Identify every field that contains PII, quasi-identifiers, or regulated data.** Common Salesforce fields that require scrubbing include `Email`, `Phone`, `MobilePhone`, `Name` on Contact and Lead, `SSN__c` or similar custom fields, any field in a Health Cloud or Financial Services Cloud org that maps to PHI or MNPI. PII scrubbing must be applied before the record leaves the Salesforce API response — scrubbing at the vector store layer is too late. - **Determine the extraction pattern: full load or incremental delta.** Full loads are simpler to implement but become impractical above ~5 million records per object due to Bulk API v2 job duration limits. Incremental extraction requires a reliable high-watermark field (`SystemModstamp` or a custom `LastSyncedAt__c` field) and a strategy for detecting hard deletes separately. - **Establish the volume and velocity envelope.** Bulk API v2 supports up to 100 million records per 24-hour rolling window per connected app. A single Bulk API v2 query job supports up to 100 MB of compressed CSV output per result batch. Large-volume orgs may require job partitioning by date range or by a high-cardinality indexed field. --- ## Core Concepts ### 1. Bulk API v2 Query Jobs Bulk API v2 is the correct API for extracting large volumes of Salesforce records for external pipelines. It is asynchronous: the caller submits a SOQL query, Salesforce processes it in the background, and the caller polls for completion before downloading result sets as paginated CSV batches. Key behaviors: - **Content type is `CSV` only for query jobs** — unlike ingest jobs which support JSON and XML, Bulk API v2 query results are always returned as RFC 4180 CSV. - **Result sets are paginated.** Each call to the results endpoint returns up to `maxRecords` rows (default 50,000). Callers must follow the `Sfdclocator` cursor header to retrieve subsequent pages until the response body is empty. - **Job state machine:** `UploadComplete` → `InProgress` → `JobComplete` (or `Failed` / `Aborted`). Poll the job status endpoint; do not assume completion based on time elapsed. - **Query jobs do not consume API request limits in the standard sense** — they consume Bulk API v2 daily byte limits, not the per-org 24-hour API call limit that REST and SOAP calls consume. - **SOQL in Bulk API v2 query jobs cannot use relationship traversal in the SELECT clause** (e.g., `SELECT Account.Name FROM Contact` is not supported). Pre-join fields must be resolved either by querying parent objects separately and joining client-side, or by using formula fields that denormalize the parent value at the Salesforce schema layer. Source: [Bulk API 2.0 Developer Guide](https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/asynch_api_intro.htm) ### 2. PII Scrubbing Before Transmission Once a Salesforce record is returned by the API, it exists in the calling process's memory. If the process transmits the record to an external LLM endpoint before removing PII, there is no further enforcement point — the data has left the org boundary. PII scrubbing must therefore be a synchronous step in the extraction pipeline between the Bulk API result decode and any outbound network call. Scrubbing strategies by field type: - **Direct identifiers (name, email, phone):** Omit from the SOQL query altogether if not needed by the LLM pipeline. If needed for deduplication, replace with a deterministic pseudonym (e.g., HMAC-SHA256 of the original value keyed with a secret). Do not use MD5 — it is reversible with rainbow tables for common values. - **Quasi-identifiers (ZIP code, date of birth, job title):** Apply k-anonymity generalization where the field is needed semantically but exact value is not (e.g., truncate ZIP to 3 digits, reduce DOB to birth year). - **Free-text fields containing incidental PII (case descriptions, notes, chatter):** Apply a named-entity recognition (NER) pass before chunking. Salesforce does not provide an out-of-platform NER service; this must be implemented in the extraction pipeline using an open-source library (e.g., spaCy `en_core_web_sm`) or a privacy-preserving NER endpoint. Source: [Salesforce Security Guide — Data Security](https://help.salesforce.com/s/articleView?id=sf.security_data_access.htm) ### 3. Chunking and Embedding for External Vector Stores Text extracted from Salesforce must be chunked and embedded before it can be stored in an external vector store. The chunking strategy depends on the source field type: - **Structured text (short text areas, picklists concatenated into a document):** Fixed-size chunking with 256–512 tokens per chunk and 10% overlap. Small chunks improve precision for structured fact retrieval. - **Long-form content (Knowledge article body, case description, rich text):** Recursive character text splitting on semantic boundaries (paragraph breaks, sentence ends) is preferred over fixed-size splitting. This preserves paragraph coherence and reduces mid-sentence chunk boundaries. - **HTML content (Knowledge article `Body` field, rich text areas):** HTML must be stripped before chunking. The raw HTML of a Salesforce Knowledge article body contains `

`, `