--- name: taxonomy-architect description: Design and maintain classification systems for jobs, skills, and companies. Use when defining categories, resolving edge cases, planning ontology structures, or preparing for semantic search capabilities. --- # Taxonomy Architect Design, maintain, and evolve classification systems that power job matching, skill analysis, and company categorization. Ensure taxonomies are precise, consistent, and extensible toward future semantic search capabilities. ## When to Use This Skill Trigger when user asks to: - Define or refine job role categories (families, subfamilies) - Create or update skill taxonomies - Classify ambiguous roles or edge cases - Design company categorization schemes - Plan ontology structures for semantic search - Resolve classification conflicts or inconsistencies - Evaluate taxonomy coverage and gaps - Prepare embeddings or vector search strategies ## Core Principles ### 1. Mutually Exclusive, Collectively Exhaustive (MECE) Categories at the same level should not overlap, and together should cover all cases. ``` BAD: GOOD: ├── Data Analyst ├── Data Analyst ├── Business Analyst ├── Analytics Engineer ├── Analytics (overlap!) ├── Data Engineer └── BI Developer (overlap!) └── Data Scientist ``` ### 2. User Mental Model Alignment Categories should match how practitioners describe themselves, not internal corporate structures. ``` BAD (org-chart thinking): GOOD (practitioner thinking): ├── Engineering ├── Data Engineer │ └── Data ├── ML Engineer ├── Analytics ├── Analytics Engineer │ └── Data └── Data Analyst └── Science └── Data ``` ### 3. Stable Core, Flexible Edges Core categories should be stable over time. Edge cases and emerging roles should be handled without restructuring the core. ``` STABLE CORE: EDGE HANDLING: ├── Data Engineer "AI Engineer" → classify as: ├── ML Engineer - ML Engineer (if model-focused) ├── Data Scientist - out_of_scope (if API integration) └── Analytics Engineer Document decision, revisit quarterly ``` ### 4. Evidence-Based Boundaries Category boundaries should be defined by observable signals in job postings, not assumptions. | Signal Type | Examples | |-------------|----------| | Title patterns | "Analytics Engineer" vs "Data Analyst" | | Tool requirements | dbt, Airflow, Spark → Data/Analytics Engineer | | Responsibility keywords | "build pipelines" vs "create dashboards" | | Team placement | "Data Platform team" vs "Business Intelligence" | | Seniority markers | "Principal", "Staff", "Lead", "Senior", "Junior" | ### 5. Semantic Readiness Design with future embedding/vector search in mind. Categories should be describable in natural language that captures semantic meaning. ``` GOOD (embeddable description): "Analytics Engineer: Builds and maintains data transformation pipelines using tools like dbt, creates metrics layers and semantic models, bridges raw data and analyst-ready datasets." BAD (list of keywords): "Analytics Engineer: dbt, SQL, data modeling, metrics" ``` --- ## Current Taxonomy (v1.5) ### Job Families & Subfamilies ```yaml job_families: product: description: "Roles focused on product strategy, discovery, and delivery" subfamilies: core_pm: label: "Core PM" description: "General product management for user-facing features" signals: titles: ["Product Manager", "PM", "Product Lead"] keywords: ["roadmap", "user stories", "stakeholders", "prioritization"] anti_signals: ["growth", "platform", "API", "ML", "AI"] growth_pm: label: "Growth PM" description: "Acquisition, retention, monetization, conversion optimization" signals: titles: ["Growth PM", "Growth Product Manager"] keywords: ["acquisition", "retention", "conversion", "funnel", "experimentation", "A/B testing"] platform_pm: label: "Platform PM" description: "Developer tools, APIs, infrastructure products" signals: titles: ["Platform PM", "API PM", "Developer Experience PM"] keywords: ["API", "SDK", "developer", "platform", "infrastructure", "internal tools"] technical_pm: label: "Technical PM" description: "Deep technical skills required, often ex-engineers" signals: titles: ["Technical Product Manager", "TPM"] keywords: ["technical requirements", "engineering background", "system design"] ai_ml_pm: label: "AI/ML PM" description: "AI/ML products, models, data products" signals: titles: ["AI PM", "ML PM", "AI Product Manager", "Data Product Manager"] keywords: ["machine learning", "AI", "model", "LLM", "GenAI", "data product"] data: description: "Roles focused on data infrastructure, analysis, and machine learning" subfamilies: product_analytics: label: "Product Analytics" description: "Product metrics, experiments, user behavior, growth analytics" signals: titles: ["Product Analyst", "Growth Analyst", "Product Data Analyst"] keywords: ["product metrics", "experimentation", "user behavior", "funnel analysis", "Amplitude", "Mixpanel"] anti_signals: ["pipeline", "infrastructure", "model training"] data_analyst: label: "Data Analyst" description: "Business reporting, dashboards, SQL analysis, BI tools" signals: titles: ["Data Analyst", "Business Analyst", "BI Analyst", "Reporting Analyst"] keywords: ["dashboards", "reporting", "Tableau", "Power BI", "Looker", "business intelligence"] anti_signals: ["dbt", "pipeline", "modeling layer"] analytics_engineer: label: "Analytics Engineer" description: "dbt, metrics layer, data modeling, semantic layer" signals: titles: ["Analytics Engineer", "Data Modeling Engineer"] keywords: ["dbt", "data modeling", "metrics layer", "semantic layer", "transformation"] disambiguate_from: ["data_analyst", "data_engineer"] data_engineer: label: "Data Engineer" description: "Pipelines, infrastructure, ETL/ELT, big data" signals: titles: ["Data Engineer", "ETL Developer", "Data Platform Engineer"] keywords: ["pipeline", "Airflow", "Spark", "ETL", "ELT", "data infrastructure", "Kafka"] ml_engineer: label: "ML Engineer" description: "Production ML systems, MLOps, includes LLM/GenAI implementation" signals: titles: ["ML Engineer", "Machine Learning Engineer", "MLOps Engineer", "AI Engineer"] keywords: ["model deployment", "MLOps", "feature store", "model serving", "LLM", "fine-tuning"] notes: "AI Engineer roles classify here if model-focused; out_of_scope if primarily API integration" data_scientist: label: "Data Scientist" description: "Statistical modeling, predictions, business insights" signals: titles: ["Data Scientist", "Senior Data Scientist", "Applied Scientist"] keywords: ["statistical modeling", "prediction", "regression", "classification", "causal inference"] disambiguate_from: ["ml_engineer", "research_scientist"] research_scientist: label: "Research Scientist (ML/AI)" description: "Novel ML research, publications, pushing state-of-the-art" signals: titles: ["Research Scientist", "ML Researcher", "AI Researcher"] keywords: ["publications", "novel", "state-of-the-art", "research", "PhD"] data_architect: label: "Data Architect" description: "Data strategy, governance, platform design" signals: titles: ["Data Architect", "Enterprise Data Architect", "Data Governance Lead"] keywords: ["data strategy", "governance", "data catalog", "metadata", "architecture"] ``` ### Seniority Levels ```yaml seniority: junior: label: "Junior" signals: titles: ["Junior", "Jr", "Associate", "Entry Level", "Graduate"] experience: ["0-2 years", "entry level", "new grad"] mid: label: "Mid-Level" signals: titles: ["Data Engineer", "Product Manager"] # No prefix = usually mid experience: ["2-5 years", "3+ years"] senior: label: "Senior" signals: titles: ["Senior", "Sr", "Lead"] experience: ["5+ years", "7+ years"] staff_plus: label: "Staff+" signals: titles: ["Staff", "Principal", "Distinguished", "Architect", "Director"] experience: ["10+ years", "extensive experience"] ``` ### Working Arrangement ```yaml working_arrangement: onsite: label: "Onsite" signals: ["on-site", "in-office", "office-based", "in-person"] hybrid: label: "Hybrid" signals: ["hybrid", "flexible", "2-3 days in office", "partial remote"] remote: label: "Remote" signals: ["remote", "work from home", "distributed", "anywhere"] qualifiers: ["remote (US only)", "remote (timezone restricted)"] ``` --- ## Skills Taxonomy ### Structure ```yaml skills: parent_categories: product: label: "Product Skills" families: - discovery_research - execution_delivery - experimentation - analytics_pm - stakeholder_mgmt data_ml: label: "Data/ML Skills" families: - programming - analytics_stats - classical_ml - deep_learning - llm_genai - big_data - pipelines_orchestration - data_modeling - warehouses_lakes - mlops - cloud - streaming - visualization platform_infra: label: "Platform/Infra Skills" families: - deployment - infrastructure_code - ci_cd - monitoring ``` ### Skill Family Details ```yaml data_ml: programming: label: "Programming Languages" skills: ["Python", "R", "SQL", "Scala", "Java", "Julia"] notes: "SQL is both a language and a skill; always extract" analytics_stats: label: "Analytics & Statistics" skills: ["Statistics", "Probability", "Regression", "Causal inference", "Time series", "Hypothesis testing", "Bayesian analysis"] classical_ml: label: "Classical Machine Learning" skills: ["Scikit-learn", "XGBoost", "LightGBM", "Random Forest", "Logistic regression", "SVM", "Feature engineering"] deep_learning: label: "Deep Learning" skills: ["PyTorch", "TensorFlow", "Keras", "Neural networks", "CNNs", "RNNs", "Computer vision", "NLP"] llm_genai: label: "LLM/GenAI" skills: ["LLMs", "Transformers", "GPT", "BERT", "Claude", "Prompt engineering", "RAG", "Vector databases", "LangChain", "Embeddings", "Fine-tuning"] notes: "Fast-evolving category; review quarterly" big_data: label: "Big Data Processing" skills: ["Spark", "PySpark", "Hadoop", "Hive", "Presto", "Flink"] pipelines_orchestration: label: "Pipelines & Orchestration" skills: ["Airflow", "Dagster", "Prefect", "Luigi", "Data pipelines", "ETL", "ELT"] data_modeling: label: "Data Modeling" skills: ["dbt", "Data modeling", "Dimensional modeling", "Star schema", "Data warehouse design"] warehouses_lakes: label: "Warehouses & Lakes" skills: ["Snowflake", "BigQuery", "Redshift", "Databricks", "Athena", "Delta Lake", "Data lake"] mlops: label: "MLOps" skills: ["MLflow", "Kubeflow", "Model serving", "Model monitoring", "Feature stores", "Model registry", "Weights & Biases"] cloud: label: "Cloud Platforms" skills: ["AWS", "GCP", "Azure", "S3", "EC2", "Lambda", "Cloud Functions"] streaming: label: "Streaming" skills: ["Kafka", "Kinesis", "Pub/Sub", "Real-time processing"] visualization: label: "Data Visualization" skills: ["Tableau", "Power BI", "Looker", "Metabase", "Plotly", "Matplotlib", "Seaborn"] ``` --- ## Company/Employer Taxonomy **System of Record:** `docs/schema_taxonomy.yaml` (see `enums.employer_industry`) ### Employer Industry (20 Domain-Focused Categories) **Design Decision:** These are industry VERTICALS, not business models. "B2B SaaS" was intentionally excluded - it's a business model that spans multiple industries. A company like Stripe is `fintech` even though it sells B2B SaaS. | Code | Label | Examples | |------|-------|----------| | `fintech` | FinTech | Stripe, Monzo, Affirm, Plaid | | `healthtech` | HealthTech | Flatiron, Omada, Oscar | | `ecommerce` | E-commerce & Marketplace | Instacart, Deliveroo, Etsy | | `ai_ml` | AI/ML | OpenAI, Anthropic, Harvey AI | | `consumer` | Consumer Tech | Spotify, Reddit, Strava | | `mobility` | Mobility & Logistics | Uber, Waymo, Zipline | | `proptech` | PropTech | Airbnb, Zillow, CoStar | | `edtech` | EdTech | Coursera, Duolingo | | `climate` | Climate Tech | Watershed, Crusoe | | `crypto` | Crypto & Web3 | Coinbase, Kraken | | `devtools` | Developer Tools | GitHub, Vercel, Linear | | `data_infra` | Data Infrastructure | Snowflake, Databricks, dbt Labs | | `cybersecurity` | Cybersecurity | Okta, Vanta, 1Password | | `hr_tech` | HR Tech | Rippling, Gusto, Deel | | `martech` | Marketing Tech | Braze, Amplitude, HubSpot | | `professional_services` | Professional Services | Deloitte, Accenture | | `productivity` | Productivity & Collaboration | Notion, Asana, Airtable, Calendly | | `hardware` | Hardware & Robotics | Apple, Gecko Robotics | | `other` | Other | Catch-all | ### Employer Size | Code | Label | Signals | |------|-------|---------| | `startup` | Startup (1-50) | seed, series A, early stage | | `scaleup` | Scale-up (51-500) | series B/C, growth stage | | `enterprise` | Enterprise (500+) | public, Fortune 500, established | ### Multi-Industry Companies Some companies span multiple industries. Classification rules: 1. **Single primary industry** - Each company gets ONE `industry` value (MECE) 2. **Classify by core product/revenue** - Stripe is `fintech` (payments), not `devtools` 3. **For conglomerates** - Classify by the division most relevant to the job posting | Company | Industry | Rationale | |---------|----------|-----------| | Stripe | `fintech` | Core is payments, even though they have dev tools | | Uber | `mobility` | Core is transportation | | Airbnb | `proptech` | Real estate marketplace | | Amazon (AWS jobs) | `devtools` or `data_infra` | Depends on specific role | --- ## Edge Case Resolution ### Decision Framework When encountering ambiguous roles: ``` 1. Check title patterns against known signals 2. Analyze job description for disambiguating keywords 3. Look at team/department placement 4. Consider required tools/skills 5. Apply "where would the practitioner self-identify?" test 6. If still ambiguous, document and classify to best fit 7. Flag for quarterly taxonomy review ``` ### Documented Edge Cases | Role Pattern | Decision | Rationale | |--------------|----------|-----------| | "AI Engineer" | ML Engineer OR out_of_scope | If model-focused → ML Engineer; if API integration only → out_of_scope | | "Data Analyst" with dbt | Analytics Engineer | dbt is strong signal for AE over DA | | "Business Intelligence Engineer" | Data Analyst | Despite "engineer" title, typically dashboard/reporting focused | | "Applied Scientist" | Data Scientist | Amazon-specific title; responsibilities align with DS | | "Product Analyst" | Product Analytics | Distinct from generic Data Analyst by product focus | | "Growth Engineer" | out_of_scope | Engineering role, not data/product | | "Technical Program Manager" | out_of_scope | Program management, not product management | ### Geographic Variations | Term | US Meaning | UK Meaning | Resolution | |------|------------|------------|------------| | "Data Scientist" | Often ML-heavy | Sometimes more analytics | Check for ML signals | | "Analyst" | Entry-level connotation | Can be senior | Use seniority signals | --- ## Ontology Design (Future: Semantic Search) ### Current State: Taxonomy ``` Hierarchical classification ├── Fixed categories ├── Rule-based assignment └── Exact match on signals ``` ### Future State: Ontology ``` Semantic network ├── Entities with relationships ├── Embedding-based similarity ├── Natural language queries └── Fuzzy matching with confidence ``` ### Preparation Steps **1. Rich Entity Descriptions** Every category needs a natural language description suitable for embedding: ```yaml analytics_engineer: embedding_description: | An Analytics Engineer builds and maintains the data transformation layer between raw data sources and analyst-ready datasets. They typically work with tools like dbt to create reusable data models, define business metrics in a semantic layer, and ensure data quality through testing. They bridge the gap between Data Engineers who build pipelines and Data Analysts who consume clean data. Related roles: Data Analyst, Data Engineer, BI Developer Key differentiator: Focuses on transformation and modeling, not pipeline infrastructure or end-user dashboards. ``` **2. Relationship Types** ```yaml relationships: is_a: description: "Hierarchical parent-child" example: "Analytics Engineer IS_A Data Role" related_to: description: "Conceptually similar, often confused" example: "Analytics Engineer RELATED_TO Data Analyst" requires_skill: description: "Role typically requires this skill" example: "Analytics Engineer REQUIRES_SKILL dbt" collaborates_with: description: "Roles that frequently work together" example: "Analytics Engineer COLLABORATES_WITH Data Scientist" progression_to: description: "Common career progression" example: "Data Analyst PROGRESSION_TO Analytics Engineer" ``` **3. Embedding Strategy** ```yaml embedding_approach: model: "text-embedding-3-small" # or similar what_to_embed: - role descriptions (paragraph form) - skill descriptions - job posting titles + first 500 chars similarity_thresholds: high_confidence: 0.85 medium_confidence: 0.70 needs_review: 0.50 use_cases: - "Find roles similar to Analytics Engineer" - "What skills are adjacent to dbt?" - "Candidates with X skills might fit Y roles" ``` **4. Query Patterns (Future)** ``` Natural language queries the ontology should support: "Show me roles that are like Data Scientist but more engineering-focused" → Returns: ML Engineer, Analytics Engineer "What skills should a Data Analyst learn to become an Analytics Engineer?" → Returns: dbt, data modeling, SQL (advanced), Git "Find companies where Analytics Engineers report to Engineering not Analytics" → Returns: [requires company org data] "Which roles commonly transition to Product Management?" → Returns: Data Analyst, Product Analytics, Data Scientist ``` --- ## Taxonomy Maintenance ### Review Cadence | Review Type | Frequency | Focus | |-------------|-----------|-------| | Edge case log review | Weekly | Resolve accumulated ambiguities | | Coverage analysis | Monthly | Identify gaps, new role patterns | | Signal effectiveness | Monthly | Which signals are predictive? | | Full taxonomy review | Quarterly | Add/remove/restructure categories | | Skill taxonomy update | Quarterly | New tools, deprecated skills | ### Metrics to Track | Metric | Target | Action if Below | |--------|--------|-----------------| | Classification confidence (avg) | >0.85 | Review low-confidence patterns | | out_of_scope rate | <15% | Consider new categories | | Edge case backlog | <20 unresolved | Schedule resolution session | | Reclassification rate | <5% | Investigate unstable categories | ### Change Log Template ```markdown ## Taxonomy Change Log ### [Date] - v1.X.X **Added:** - [New category/skill] - Rationale: [why needed] **Changed:** - [Category] - [What changed] - Rationale: [why] **Removed:** - [Category/skill] - Rationale: [why deprecated] **Edge Cases Resolved:** - [Role pattern] → Now classifies as [category] **Open Questions:** - [Unresolved issue for next review] ``` --- ## Output Formats ### Classification Decision ```markdown ## Classification: [Job Title] **Input:** [Raw title and key description excerpts] **Decision:** - Family: [product/data] - Subfamily: [specific category] - Seniority: [level] - Confidence: [high/medium/low] **Signals Found:** - Title: [matching patterns] - Keywords: [matching terms] - Tools: [specific tools mentioned] **Disambiguation Notes:** [If edge case, explain reasoning] **Flags:** - [ ] Needs human review - [ ] New pattern for taxonomy consideration ``` ### Taxonomy Gap Analysis ```markdown ## Gap Analysis: [Date] **Coverage Summary:** - Total roles analyzed: [N] - Successfully classified: [N] ([%]) - Out of scope: [N] ([%]) - Low confidence: [N] ([%]) **Emerging Patterns:** | Pattern | Frequency | Suggested Action | |---------|-----------|------------------| | [New title pattern] | [N] | [Add category / Add signal / Monitor] | **Problem Categories:** | Category | Issue | Recommendation | |----------|-------|----------------| | [Category] | [High confusion rate with X] | [Improve signals / Merge / Split] | **Skill Gaps:** - [New skills appearing frequently but not in taxonomy] **Recommendations:** 1. [Specific change] 2. [Specific change] ``` --- ## Integration Points ### With Classifier (Claude Haiku) The taxonomy informs the classification prompt: ```python TAXONOMY_CONTEXT = """ Valid subfamilies for Data roles: - product_analytics: Product metrics, experiments, user behavior - data_analyst: Business reporting, dashboards, BI tools - analytics_engineer: dbt, metrics layer, data modeling - data_engineer: Pipelines, ETL/ELT, data infrastructure - ml_engineer: Production ML, MLOps, model deployment - data_scientist: Statistical modeling, predictions - research_scientist: Novel ML research, publications - data_architect: Data strategy, governance Classification rules: - "AI Engineer" → ml_engineer if model-focused, else out_of_scope - Presence of "dbt" strongly indicates analytics_engineer - "Business Analyst" → data_analyst unless product-focused """ ``` ### With Job Feed (Filtering) Taxonomy enables precise filtering: ```sql -- User selects "Analytics Engineer" -- Only returns exact subfamily match, not "Data Analyst" WHERE job_subfamily = 'analytics_engineer' -- User selects "Data" family -- Returns all data subfamilies WHERE job_family = 'data' ``` ### With Semantic Search (Future) ```python # Current: exact match results = db.query("subfamily = 'analytics_engineer'") # Future: semantic similarity query_embedding = embed("data transformation and metrics modeling role") results = vector_search(query_embedding, threshold=0.8) # Returns: analytics_engineer, data_engineer (lower score) ```