--- name: data-profiler description: Profile datasets to understand schema, quality, and characteristics. Use when analyzing data files (CSV, JSON, Parquet), discovering dataset properties, assessing data quality, or when user mentions data profiling, schema detection, data analysis, or quality metrics. Provides basic and intermediate profiling including distributions, uniqueness, and pattern detection. version: 1.0.0 --- # Data Profiler Skill ## Overview This skill provides comprehensive data profiling capabilities at Basic and Intermediate levels. It analyzes datasets to extract schema information, statistical summaries, data quality metrics, distributions, uniqueness characteristics, and pattern detection. ## Profiling Levels ### Basic Level - Row count and column count - Column names and data types - Null/missing value counts and percentages - Min, max, mean, median for numeric columns - String length statistics for text columns ### Intermediate Level - **Distributions**: Value frequencies, histograms, percentiles - **Uniqueness**: Distinct counts, cardinality, duplicate detection - **Pattern Detection**: Date/time formats, regex patterns, data format identification - **Quality Metrics**: Completeness scores, consistency indicators ## Supported Formats - CSV (Comma-Separated Values) - JSON (JSON Lines and standard JSON) - Parquet (Apache Parquet columnar format) ## When to Use - User requests data profiling or data analysis - Need to understand dataset structure and quality - Discovering schema for unknown datasets - Assessing data quality before ingestion - Identifying data issues (nulls, duplicates, anomalies) - Creating dataset specifications or documentation ## Core Profiling Functions ### 1. Schema Detection ```python import pandas as pd def detect_schema(df: pd.DataFrame) -> dict: """ Detect schema information from DataFrame. Returns: Dictionary with column names, types, nullable status """ schema = { 'columns': [] } for col in df.columns: col_info = { 'name': col, 'type': str(df[col].dtype), 'nullable': df[col].isnull().any(), 'null_count': int(df[col].isnull().sum()), 'null_percentage': float(df[col].isnull().sum() / len(df) * 100) } # Infer semantic type if pd.api.types.is_numeric_dtype(df[col]): col_info['semantic_type'] = 'numeric' elif pd.api.types.is_datetime64_dtype(df[col]): col_info['semantic_type'] = 'datetime' elif pd.api.types.is_bool_dtype(df[col]): col_info['semantic_type'] = 'boolean' else: col_info['semantic_type'] = 'string' schema['columns'].append(col_info) return schema ``` ### 2. Basic Statistics ```python def get_basic_statistics(df: pd.DataFrame) -> dict: """ Get basic statistical summary. Returns: Dictionary with row count, column count, and column stats """ stats = { 'row_count': len(df), 'column_count': len(df.columns), 'columns': {} } for col in df.columns: col_stats = {} if pd.api.types.is_numeric_dtype(df[col]): col_stats = { 'min': float(df[col].min()) if not df[col].isna().all() else None, 'max': float(df[col].max()) if not df[col].isna().all() else None, 'mean': float(df[col].mean()) if not df[col].isna().all() else None, 'median': float(df[col].median()) if not df[col].isna().all() else None, 'std': float(df[col].std()) if not df[col].isna().all() else None } elif pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object: non_null = df[col].dropna() if len(non_null) > 0: col_stats = { 'min_length': int(non_null.astype(str).str.len().min()), 'max_length': int(non_null.astype(str).str.len().max()), 'avg_length': float(non_null.astype(str).str.len().mean()) } stats['columns'][col] = col_stats return stats ``` ### 3. Distribution Analysis ```python def analyze_distributions(df: pd.DataFrame, max_unique: int = 50) -> dict: """ Analyze value distributions for each column. Args: df: DataFrame to analyze max_unique: Max unique values to show frequencies for Returns: Dictionary with distribution info per column """ distributions = {} for col in df.columns: dist_info = {} # Value counts value_counts = df[col].value_counts() unique_count = len(value_counts) dist_info['unique_count'] = unique_count dist_info['unique_percentage'] = (unique_count / len(df)) * 100 # Show top values if not too many unique if unique_count <= max_unique: dist_info['value_frequencies'] = { str(k): int(v) for k, v in value_counts.head(20).items() } # Percentiles for numeric columns if pd.api.types.is_numeric_dtype(df[col]): percentiles = df[col].quantile([0.25, 0.50, 0.75, 0.90, 0.95, 0.99]) dist_info['percentiles'] = { f'p{int(p*100)}': float(v) for p, v in percentiles.items() } distributions[col] = dist_info return distributions ``` ### 4. Uniqueness Analysis ```python def analyze_uniqueness(df: pd.DataFrame) -> dict: """ Analyze uniqueness characteristics of columns. Returns: Dictionary with uniqueness metrics per column """ uniqueness = {} for col in df.columns: unique_info = {} total_count = len(df[col]) non_null_count = df[col].notna().sum() unique_count = df[col].nunique() duplicate_count = total_count - unique_count unique_info['distinct_count'] = unique_count unique_info['duplicate_count'] = duplicate_count unique_info['uniqueness_ratio'] = unique_count / total_count if total_count > 0 else 0 unique_info['is_unique_key'] = (unique_count == non_null_count) unique_info['has_duplicates'] = (duplicate_count > 0) # Find duplicated values (top 10) if duplicate_count > 0: duplicates = df[col].value_counts() duplicates = duplicates[duplicates > 1].head(10) unique_info['top_duplicates'] = { str(k): int(v) for k, v in duplicates.items() } uniqueness[col] = unique_info return uniqueness ``` ### 5. Pattern Detection ```python import re def detect_patterns(df: pd.DataFrame) -> dict: """ Detect common patterns in string columns. Returns: Dictionary with detected patterns per column """ patterns = {} # Common regex patterns pattern_regexes = { 'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', 'phone_us': r'^\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}$', 'url': r'^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b', 'ipv4': r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$', 'date_iso': r'^\d{4}-\d{2}-\d{2}', 'date_us': r'^\d{1,2}\/\d{1,2}\/\d{2,4}$', 'uuid': r'^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$', 'numeric': r'^\d+(\.\d+)?$', 'alphanumeric': r'^[a-zA-Z0-9]+$' } for col in df.columns: if pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object: col_patterns = {} non_null = df[col].dropna().astype(str) if len(non_null) == 0: continue # Test each pattern for pattern_name, pattern_regex in pattern_regexes.items(): matches = non_null.str.match(pattern_regex, flags=re.IGNORECASE) match_count = matches.sum() match_percentage = (match_count / len(non_null)) * 100 if match_percentage > 50: # If >50% match, consider it a pattern col_patterns[pattern_name] = { 'match_count': int(match_count), 'match_percentage': float(match_percentage) } if col_patterns: patterns[col] = col_patterns return patterns ``` ### 6. Data Quality Metrics ```python def calculate_quality_metrics(df: pd.DataFrame) -> dict: """ Calculate overall data quality metrics. Returns: Dictionary with quality scores and metrics """ quality = { 'overall_completeness': 0.0, 'columns': {} } total_cells = len(df) * len(df.columns) non_null_cells = df.notna().sum().sum() quality['overall_completeness'] = (non_null_cells / total_cells) * 100 if total_cells > 0 else 0 for col in df.columns: col_quality = {} # Completeness total_count = len(df[col]) non_null_count = df[col].notna().sum() col_quality['completeness'] = (non_null_count / total_count) * 100 if total_count > 0 else 0 # Consistency (e.g., no mixed types) if pd.api.types.is_numeric_dtype(df[col]): col_quality['type_consistency'] = 100.0 # All numeric elif pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object: # Check if all non-null values are same type non_null = df[col].dropna() if len(non_null) > 0: types = non_null.apply(type) col_quality['type_consistency'] = (types.value_counts().max() / len(non_null)) * 100 quality['columns'][col] = col_quality return quality ``` ## Complete Profiling Example ```python import pandas as pd from typing import Dict, Any def profile_dataset(df: pd.DataFrame) -> Dict[str, Any]: """ Complete dataset profiling with all metrics. Args: df: pandas DataFrame to profile Returns: Comprehensive profiling results dictionary """ profile = { 'metadata': { 'row_count': len(df), 'column_count': len(df.columns), 'memory_usage_mb': df.memory_usage(deep=True).sum() / (1024 * 1024) }, 'schema': detect_schema(df), 'statistics': get_basic_statistics(df), 'distributions': analyze_distributions(df), 'uniqueness': analyze_uniqueness(df), 'patterns': detect_patterns(df), 'quality': calculate_quality_metrics(df) } return profile # Usage example if __name__ == "__main__": # Load dataset df = pd.read_csv("data.csv") # Profile profile_results = profile_dataset(df) # Print summary print(f"Dataset Profile:") print(f" Rows: {profile_results['metadata']['row_count']}") print(f" Columns: {profile_results['metadata']['column_count']}") print(f" Completeness: {profile_results['quality']['overall_completeness']:.2f}%") ``` ## Best Practices 1. **Sample Large Datasets**: For datasets >1M rows, profile a representative sample 2. **Handle Missing Data**: Check for nulls before calculating statistics 3. **Type Inference**: Use semantic types, not just Python/pandas types 4. **Memory Management**: Be cautious with very wide datasets (many columns) 5. **Percentiles**: Use percentiles to understand distribution better than mean 6. **Pattern Detection**: Test patterns on non-null values only ## Output Format for Personas When using this skill, output results in a structured format: ```yaml dataset_profile: metadata: row_count: 10000 column_count: 15 memory_usage_mb: 2.5 schema: columns: - name: customer_id type: string nullable: false null_count: 0 semantic_type: string - name: revenue type: float64 nullable: true null_count: 250 null_percentage: 2.5 semantic_type: numeric quality: overall_completeness: 95.5 columns: customer_id: completeness: 100.0 type_consistency: 100.0 revenue: completeness: 97.5 type_consistency: 100.0 recommendations: - "customer_id has 98% uniqueness - suitable as primary key" - "revenue has 2.5% nulls - consider default value or filtering" - "date column matches ISO 8601 format - ready for datetime conversion" ``` ## Integration with Other Skills - **azure-blob-storage**: Load data from Azure before profiling - **databricks-query**: Profile query results from Unity Catalog - Use profiling results to inform **Data Ingestion Agent** about transformations needed