--- name: moai-formats-data description: Data format specialist covering TOON encoding, JSON/YAML optimization, serialization patterns, and data validation for modern applications version: 1.0.0 category: library tags: - formats - data - toon - serialization - validation - optimization updated: 2025-11-30 status: active author: MoAI-ADK Team --- # Data Format Specialist ## Quick Reference (30 seconds) Advanced Data Format Management - Comprehensive data handling covering TOON encoding, JSON/YAML optimization, serialization patterns, and data validation for performance-critical applications. Core Capabilities: - TOON Encoding: 40-60% token reduction vs JSON for LLM communication - JSON/YAML Optimization: Efficient serialization and parsing patterns - Data Validation: Schema validation, type checking, error handling - Format Conversion: Seamless transformation between data formats - Performance: Optimized data structures and caching strategies - Schema Management: Dynamic schema generation and evolution When to Use: - Optimizing data transmission to LLMs within token budgets - High-performance serialization/deserialization - Schema validation and data integrity - Format conversion and data transformation - Large dataset processing and optimization Quick Start: ```python # TOON encoding (40-60% token reduction) from moai_formats_data import TOONEncoder encoder = TOONEncoder() compressed = encoder.encode({"user": "John", "age": 30}) original = encoder.decode(compressed) # Fast JSON processing from moai_formats_data import JSONOptimizer optimizer = JSONOptimizer() fast_json = optimizer.serialize_fast(large_dataset) # Data validation from moai_formats_data import DataValidator validator = DataValidator() schema = validator.create_schema({"name": {"type": "string", "required": True}}) result = validator.validate({"name": "John"}, schema) ``` --- ## Implementation Guide (5 minutes) ### Core Concepts TOON (Token-Optimized Object Notation): - Custom binary-compatible format optimized for LLM token usage - Type markers: `#` (numbers), `!` (booleans), `@` (timestamps), `~` (null) - 40-60% size reduction vs JSON for typical data structures - Lossless round-trip encoding/decoding Performance Optimization: - Ultra-fast JSON processing with orjson (2-5x faster than standard json) - Streaming processing for large datasets using ijson - Intelligent caching with LRU eviction and memory management - Schema compression and validation optimization Data Validation: - Type-safe validation with custom rules and patterns - Schema evolution and migration support - Cross-field validation and dependency checking - Performance-optimized batch validation ### Basic Implementation ```python from moai_formats_data import TOONEncoder, JSONOptimizer, DataValidator from datetime import datetime # 1. TOON Encoding for LLM optimization encoder = TOONEncoder() data = { "user": {"id": 123, "name": "John", "active": True, "created": datetime.now()}, "permissions": ["read", "write", "admin"] } # Encode and compare sizes toon_data = encoder.encode(data) original_data = encoder.decode(toon_data) # 2. Fast JSON Processing optimizer = JSONOptimizer() # Ultra-fast serialization json_bytes = optimizer.serialize_fast(data) parsed_data = optimizer.deserialize_fast(json_bytes) # Schema compression for repeated validation schema = {"type": "object", "properties": {"name": {"type": "string"}}} compressed_schema = optimizer.compress_schema(schema) # 3. Data Validation validator = DataValidator() # Create validation schema user_schema = validator.create_schema({ "username": {"type": "string", "required": True, "min_length": 3}, "email": {"type": "email", "required": True}, "age": {"type": "integer", "required": False, "min_value": 13} }) # Validate data user_data = {"username": "john_doe", "email": "john@example.com", "age": 30} result = validator.validate(user_data, user_schema) if result['valid']: print("Data is valid!") sanitized = result['sanitized_data'] else: print("Validation errors:", result['errors']) ``` ### Common Use Cases API Response Optimization: ```python # Optimize API responses for LLM consumption def optimize_api_response(data: Dict) -> str: encoder = TOONEncoder() return encoder.encode(data) # Parse optimized responses def parse_optimized_response(toon_data: str) -> Dict: encoder = TOONEncoder() return encoder.decode(toon_data) ``` Configuration Management: ```python # Fast YAML configuration loading from moai_formats_data import YAMLOptimizer yaml_optimizer = YAMLOptimizer() config = yaml_optimizer.load_fast("config.yaml") # Merge multiple configurations merged = yaml_optimizer.merge_configs(base_config, env_config, user_config) ``` Large Dataset Processing: ```python # Stream processing for large JSON files from moai_formats_data import StreamProcessor processor = StreamProcessor(chunk_size=8192) # Process file line by line without loading into memory def process_item(item): print(f"Processing: {item['id']}") processor.process_json_stream("large_dataset.json", process_item) ``` --- ## Advanced Features (10+ minutes) ### Advanced TOON Features Custom Type Handlers: ```python # Extend TOON encoder with custom types class CustomTOONEncoder(TOONEncoder): def _encode_value(self, value): # Handle UUID objects if hasattr(value, 'hex') and len(value.hex) == 32: # UUID return f'${value.hex}' # Handle Decimal objects elif hasattr(value, 'as_tuple'): # Decimal return f'&{str(value)}' return super()._encode_value(value) def _parse_value(self, s): # Parse custom UUIDs if s.startswith('$') and len(s) == 33: import uuid return uuid.UUID(s[1:]) # Parse custom Decimals elif s.startswith('&'): from decimal import Decimal return Decimal(s[1:]) return super()._parse_value(s) ``` Streaming TOON Processing: ```python # Process TOON data in streaming mode def stream_toon_data(data_generator): encoder = TOONEncoder() for data in data_generator: yield encoder.encode(data) # Batch TOON processing def batch_encode_toon(data_list: List[Dict], batch_size: int = 1000): encoder = TOONEncoder() results = [] for i in range(0, len(data_list), batch_size): batch = data_list[i:i + batch_size] encoded_batch = [encoder.encode(item) for item in batch] results.extend(encoded_batch) return results ``` ### Advanced Validation Patterns Cross-Field Validation: ```python # Validate relationships between fields class CrossFieldValidator: def __init__(self): self.base_validator = DataValidator() def validate_user_data(self, data: Dict) -> Dict: # Base validation schema = self.base_validator.create_schema({ "password": {"type": "string", "required": True, "min_length": 8}, "confirm_password": {"type": "string", "required": True}, "email": {"type": "email", "required": True} }) result = self.base_validator.validate(data, schema) # Cross-field validation if data.get("password") != data.get("confirm_password"): result['errors']['password_mismatch'] = "Passwords do not match" result['valid'] = False return result ``` Schema Evolution: ```python # Handle schema changes over time from moai_formats_data import SchemaEvolution evolution = SchemaEvolution() # Define schema versions v1_schema = {"name": {"type": "string"}, "age": {"type": "integer"}} v2_schema = {"full_name": {"type": "string"}, "age": {"type": "integer"}, "email": {"type": "email"}} # Register schemas evolution.register_schema("v1", v1_schema) evolution.register_schema("v2", v2_schema) # Add migration function def migrate_v1_to_v2(data: Dict) -> Dict: return { "full_name": data["name"], "age": data["age"], "email": None # New required field } evolution.add_migration("v1", "v2", migrate_v1_to_v2) # Migrate data old_data = {"name": "John Doe", "age": 30} new_data = evolution.migrate_data(old_data, "v1", "v2") ``` ### Performance Optimization Intelligent Caching: ```python from moai_formats_data import SmartCache # Create cache with memory constraints cache = SmartCache(max_memory_mb=50, max_items=10000) @cache.cache.cache_result(ttl=1800) # 30 minutes def expensive_data_processing(data: Dict) -> Dict: # Simulate expensive computation time.sleep(0.1) return {"processed": True, "data": data} # Cache statistics print(cache.get_stats()) # Cache warming def warm_common_data(): common_queries = [ {"type": "user", "id": 1}, {"type": "user", "id": 2}, {"type": "config", "key": "app"} ] for query in common_queries: expensive_data_processing(query) warm_common_data() ``` Batch Processing Optimization: ```python # Optimized batch validation def validate_batch_optimized(data_list: List[Dict], schema: Dict) -> List[Dict]: validator = DataValidator() # Pre-compile patterns for performance validator._compile_schema_patterns(schema) # Process in batches for memory efficiency batch_size = 1000 results = [] for i in range(0, len(data_list), batch_size): batch = data_list[i:i + batch_size] batch_results = [validator.validate(data, schema) for data in batch] results.extend(batch_results) return results ``` ### Integration Patterns LLM Integration: ```python # Prepare data for LLM consumption def prepare_for_llm(data: Dict, max_tokens: int = 2000) -> str: encoder = TOONEncoder() toon_data = encoder.encode(data) # Check token count estimated_tokens = len(toon_data.split()) if estimated_tokens > max_tokens: # Implement data reduction strategy reduced_data = reduce_data_complexity(data, max_tokens) toon_data = encoder.encode(reduced_data) return toon_data def reduce_data_complexity(data: Dict, max_tokens: int) -> Dict: """Reduce data complexity to fit token budget.""" # Implement selective field removal priority_fields = ["id", "name", "email", "status"] reduced = {k: v for k, v in data.items() if k in priority_fields} # Further reduction if needed encoder = TOONEncoder() while len(encoder.encode(reduced).split()) > max_tokens: # Remove least important fields if len(reduced) <= 1: break reduced.popitem() return reduced ``` Database Integration: ```python # Optimize database queries with format conversion def optimize_db_response(db_data: List[Dict]) -> Dict: # Convert database results to optimized format optimizer = JSONOptimizer() # Compress and cache schema common_schema = {"type": "object", "properties": {"id": {"type": "integer"}}} compressed_schema = optimizer.compress_schema(common_schema) # Process in batches processor = StreamProcessor() processed_data = [] for item in db_data: # Apply validation and transformation processed_item = transform_db_item(item) processed_data.append(processed_item) return { "data": processed_data, "count": len(processed_data), "schema": compressed_schema } ``` --- ## Works Well With - moai-domain-backend - Backend data serialization and API responses - moai-domain-database - Database data format optimization - moai-integration-mcp - MCP data serialization and transmission - moai-docs-generation - Documentation data formatting - moai-foundation-core - Core data architecture principles --- ## Module References Core Implementation Modules: - [`modules/toon-encoding.md`](./modules/toon-encoding.md) - TOON encoding implementation and examples - [`modules/json-optimization.md`](./modules/json-optimization.md) - High-performance JSON/YAML processing - [`modules/data-validation.md`](./modules/data-validation.md) - Advanced validation and schema management - [`modules/caching-performance.md`](./modules/caching-performance.md) - Caching strategies and performance optimization --- ## Usage Examples ### CLI Usage ```bash # Encode data to TOON format moai-formats encode-toon --input data.json --output data.toon # Validate data against schema moai-formats validate --schema schema.json --data data.json # Convert between formats moai-formats convert --input data.json --output data.yaml --format yaml # Optimize JSON structure moai-formats optimize-json --input large-data.json --output optimized.json ``` ### Python API ```python from moai_formats_data import TOONEncoder, DataValidator, JSONOptimizer # TOON encoding encoder = TOONEncoder() toon_data = encoder.encode({"user": "John", "age": 30}) original_data = encoder.decode(toon_data) # Data validation validator = DataValidator() schema = validator.create_schema({ "name": {"type": "string", "required": True, "min_length": 2}, "email": {"type": "email", "required": True} }) result = validator.validate({"name": "John", "email": "john@example.com"}, schema) # JSON optimization optimizer = JSONOptimizer() fast_json = optimizer.serialize_fast(large_dataset) parsed_data = optimizer.deserialize_fast(fast_json) ``` --- ## Technology Stack Core Libraries: - orjson: Ultra-fast JSON parsing and serialization - PyYAML: YAML processing with C-based loaders - ijson: Streaming JSON parser for large files - python-dateutil: Advanced datetime parsing - regex: Advanced regular expression support Performance Tools: - lru_cache: Built-in memoization - pickle: Object serialization - hashlib: Hash generation for caching - functools: Function decorators and utilities Validation Libraries: - jsonschema: JSON Schema validation - cerberus: Lightweight data validation - marshmallow: Object serialization/deserialization - pydantic: Data validation using Python type hints --- Status: Production Ready Last Updated: 2025-11-30 Maintained by: MoAI-ADK Data Team