--- name: google-cloud-configs description: Google Cloud Platform configuration templates for BigQuery ML and Vertex AI training with authentication setup, GPU/TPU configs, and cost estimation tools. Use when setting up GCP ML training, configuring BigQuery ML models, deploying Vertex AI training jobs, estimating GCP costs, configuring cloud authentication, selecting GPUs/TPUs for training, or when user mentions BigQuery ML, Vertex AI, GCP training, cloud ML setup, TPU training, or Google Cloud costs. allowed-tools: Bash, Read, Write, Edit --- Use when: - Setting up BigQuery ML for SQL-based machine learning - Configuring Vertex AI custom training jobs - Setting up GCP authentication for ML workflows - Selecting appropriate GPU/TPU configurations - Estimating costs for GCP ML training - Deploying models to Vertex AI endpoints - Configuring distributed training on GCP - Optimizing cost vs performance for cloud ML ## Platform Overview ### BigQuery ML **What it is**: SQL-based machine learning directly in BigQuery **Best for**: - Quick ML prototypes using existing data warehouse data - Classification, regression, forecasting on structured data - Users familiar with SQL but not Python/ML frameworks - Large-scale batch predictions **Available Models**: - Linear/Logistic Regression - XGBoost (BOOSTED_TREE) - Deep Neural Networks (DNN) - AutoML Tables - TensorFlow/PyTorch imported models **Pricing**: - Based on data processed (same as BigQuery queries) - $5 per TB processed for analysis - AutoML: $19.32/hour for training ### Vertex AI Training **What it is**: Fully managed ML training platform **Best for**: - Custom PyTorch/TensorFlow training - Large-scale distributed training - GPU/TPU-accelerated workloads - Production ML pipelines **Available Compute**: - **CPUs**: n1-standard, n1-highmem, n1-highcpu - **GPUs**: NVIDIA T4, P4, V100, P100, A100, L4 - **TPUs**: v2, v3, v4, v5e (8 cores to 512 cores) **Pricing**: - CPU: $0.05-0.30/hour depending on machine type - GPU T4: $0.35/hour - GPU A100: $3.67/hour (40GB) or $4.95/hour (80GB) - TPU v3: $8.00/hour (8 cores) - TPU v4: $11.00/hour (8 cores) ## GPU/TPU Selection Guide ### GPU Selection (Vertex AI) **T4 (16GB VRAM)**: - Use case: Inference, light training, small models - Cost: $0.35/hour - Good for: BERT-base, small CNNs, inference serving **V100 (16GB VRAM)**: - Use case: Mid-size training, mixed precision training - Cost: $2.48/hour - Good for: ResNet training, medium transformers **A100 (40GB/80GB VRAM)**: - Use case: Large model training, distributed training - Cost: $3.67/hour (40GB), $4.95/hour (80GB) - Good for: GPT-style models, large vision models, multi-GPU training **L4 (24GB VRAM)**: - Use case: Modern alternative to T4, better performance - Cost: $0.66/hour - Good for: Mid-size models, efficient inference ### TPU Selection (Vertex AI) **TPU v2 (8 cores)**: - Use case: TensorFlow/JAX training, matrix operations - Cost: $4.50/hour - Memory: 8GB per core (64GB total) - Good for: Legacy TensorFlow models **TPU v3 (8 cores)**: - Use case: Standard TPU training - Cost: $8.00/hour - Memory: 16GB per core (128GB total) - Good for: BERT, T5, image classification **TPU v4 (8 cores)**: - Use case: Latest generation, best performance - Cost: $11.00/hour - Memory: 32GB per core (256GB total) - Good for: Large language models, cutting-edge research **TPU v5e (8 cores)**: - Use case: Cost-optimized TPU - Cost: $2.50/hour - Good for: Development, training at scale on budget **Multi-node TPU Pods**: - v3-32: 32 cores, $32/hour - v3-128: 128 cores, $128/hour - v4-128: 128 cores, $176/hour - Use for: Massive distributed training (GPT-3 scale) ## Usage ### Setup BigQuery ML Environment ```bash bash scripts/setup-bigquery-ml.sh ``` **Prompts for**: - GCP Project ID - BigQuery dataset name - Service account credentials - Default model type preference **Creates**: - `bigquery_config.json` - Project configuration - `.bigqueryrc` - CLI configuration - Example training SQL in examples/ ### Setup Vertex AI Training Environment ```bash bash scripts/setup-vertex-ai.sh ``` **Prompts for**: - GCP Project ID - Region (us-central1, europe-west4, etc.) - Service account credentials - Default machine type - GPU/TPU preference **Creates**: - `vertex_config.yaml` - Training job configuration - `vertex_requirements.txt` - Python dependencies - Training script template ### Configure GCP Authentication ```bash bash scripts/configure-auth.sh ``` **Prompts for**: - Authentication method (service account, user account, workload identity) - Service account key path (if applicable) - IAM roles needed **Creates**: - `.gcp_auth_config` - Authentication configuration - Sets GOOGLE_APPLICATION_CREDENTIALS environment variable - Validates permissions **Required IAM Roles**: - BigQuery ML: `roles/bigquery.dataEditor`, `roles/bigquery.jobUser` - Vertex AI: `roles/aiplatform.user`, `roles/storage.objectAdmin` - Both: `roles/serviceusage.serviceUsageConsumer` ### Estimate GCP Training Costs ```bash bash scripts/estimate-gcp-cost.sh ``` **Interactive prompts**: - Platform: BigQuery ML or Vertex AI - If BigQuery ML: Data size to process - If Vertex AI: - Machine type (CPU/GPU/TPU) - Number of machines - Training duration estimate - Storage requirements **Output**: - Estimated compute cost - Storage cost - Data transfer cost (if applicable) - Total estimated cost - Cost comparison with other GCP options ## Templates ### BigQuery ML Training Template (`templates/bigquery_ml_training.sql`) SQL template for creating and training models: - Model creation syntax - Feature engineering examples - Training options (L1/L2 reg, learning rate, etc.) - Evaluation queries - Prediction queries **Supported model types**: - LINEAR_REG, LOGISTIC_REG - BOOSTED_TREE_CLASSIFIER, BOOSTED_TREE_REGRESSOR - DNN_CLASSIFIER, DNN_REGRESSOR - AUTOML_CLASSIFIER, AUTOML_REGRESSOR ### Vertex AI Training Job Template (`templates/vertex_training_job.py`) Python template for custom training: - Training loop structure - Distributed training setup (PyTorch DDP) - Checkpointing and model saving - Metrics logging to Vertex AI - Hyperparameter tuning integration **Includes**: - Single GPU training - Multi-GPU training (DataParallel, DistributedDataParallel) - TPU training with PyTorch/XLA - Cloud Storage integration ### GPU Configuration Template (`templates/vertex_gpu_config.yaml`) YAML configuration for GPU training jobs: - Machine type selection - GPU type and count - Disk configuration - Network configuration - Environment variables **Presets included**: - Single T4 (budget) - Single A100 (standard) - 4x A100 (distributed) - 8x A100 (large-scale) ### TPU Configuration Template (`templates/vertex_tpu_config.yaml`) YAML configuration for TPU training jobs: - TPU type and topology - TPU version selection - JAX/TensorFlow runtime - XLA compilation flags **Presets included**: - v3-8 (single TPU) - v4-32 (TPU pod slice) - v5e-8 (cost-optimized) ### GCP Authentication Template (`templates/gcp_auth.json`) Service account configuration template: - Project ID - Service account email - Key file path - Required scopes - IAM role assignments **Security notes**: - Uses placeholders only (never real keys) - Documents how to create service accounts - Includes `.gitignore` protection ## Examples ### BigQuery ML Regression Example (`examples/bigquery-regression-example.sql`) Complete example: - Dataset: NYC taxi trip data - Task: Predict trip duration - Model: BOOSTED_TREE_REGRESSOR - Includes feature engineering, training, evaluation **Demonstrates**: - CREATE MODEL syntax - TRANSFORM clause for feature engineering - MODEL evaluation - Batch predictions ### Vertex AI PyTorch Training Example (`examples/vertex-pytorch-training.py`) Complete training script: - Dataset: IMDB sentiment analysis - Model: DistilBERT fine-tuning - Training: Single GPU - Logging: Vertex AI experiments **Demonstrates**: - Loading data from GCS - Training loop with mixed precision - Checkpointing to GCS - Metrics logging - Model export to Vertex AI ### Vertex AI Distributed Training Example (`examples/vertex-distributed-training.py`) Multi-GPU training example: - Dataset: ImageNet subset - Model: ResNet-50 - Training: 4x A100 with DDP - Scaling: Linear scaling rule **Demonstrates**: - PyTorch DistributedDataParallel - Gradient accumulation - Learning rate scaling - Synchronized batch norm - Multi-node coordination ### Hugging Face Fine-tuning on Vertex AI (`examples/vertex-huggingface-finetuning.py`) Production fine-tuning template: - Dataset: Custom text classification - Model: BERT/RoBERTa/DeBERTa - Training: Hugging Face Trainer API - Deployment: Vertex AI endpoint **Demonstrates**: - Hugging Face Trainer integration - Hyperparameter tuning with Vertex AI - Model versioning - Endpoint deployment - Online predictions ## Cost Optimization Tips ### BigQuery ML **Reduce data processed**: - Use partitioned tables - Filter data in WHERE clause before training - Use table sampling for experimentation - Cache intermediate results **Use appropriate model types**: - Start with LINEAR_REG/LOGISTIC_REG (cheapest) - Use BOOSTED_TREE for better accuracy at moderate cost - Reserve AutoML for when simpler models fail **Optimize queries**: - Avoid SELECT * (specify columns) - Use clustering on filter columns - Materialize views for repeated training ### Vertex AI **Machine type selection**: - Start with CPU for prototyping - Use T4 for small models (cheapest GPU) - Use A100 only for large models that need it - Consider TPU v5e for TensorFlow/JAX (very cost-effective) **Training optimization**: - Use preemptible instances (60-70% cheaper, can be interrupted) - Enable automatic checkpoint/resume for preemptible - Use mixed precision training (FP16/BF16) for faster training - Profile to eliminate CPU bottlenecks **Storage optimization**: - Store datasets in Cloud Storage (cheaper than persistent disk) - Use Filestore only if needed for POSIX filesystem - Clean up old model artifacts - Use lifecycle policies to archive old data **Multi-GPU efficiency**: - Ensure near-linear scaling before adding more GPUs - Profile inter-GPU communication - Use gradient accumulation instead of larger batch sizes - Consider 2x GPUs instead of 1x larger GPU (often same cost, better availability) ## Integration with ML Training Plugin This skill integrates with other ml-training components: - **training-patterns**: Provides GCP configs for generated training scripts - **cost-calculator**: Uses GCP pricing data for budget planning - **monitoring-dashboard**: Integrates with Vertex AI TensorBoard - **validation-scripts**: Validates GCP credentials and permissions - **integration-helpers**: Deploys trained models to Vertex AI endpoints ## Common Workflows ### Workflow 1: Quick BigQuery ML Prototype 1. Run `bash scripts/setup-bigquery-ml.sh` 2. Copy `templates/bigquery_ml_training.sql` to your project 3. Modify SQL for your dataset and features 4. Run training query in BigQuery console 5. Evaluate with built-in ML.EVALUATE() 6. Export predictions with ML.PREDICT() **Time**: 30 minutes setup + training time **Cost**: $5 per TB of data processed ### Workflow 2: Custom PyTorch Training on Vertex AI 1. Run `bash scripts/configure-auth.sh` 2. Run `bash scripts/setup-vertex-ai.sh` 3. Copy `templates/vertex_training_job.py` 4. Customize training loop for your model 5. Copy `templates/vertex_gpu_config.yaml` 6. Submit job: `gcloud ai custom-jobs create ...` 7. Monitor in Vertex AI console **Time**: 1 hour setup + training time **Cost**: Depends on GPU/TPU selection ### Workflow 3: Large-Scale Distributed Training 1. Setup Vertex AI (workflow 2) 2. Copy `examples/vertex-distributed-training.py` 3. Adapt for your model architecture 4. Test locally with 1 GPU 5. Test with 2 GPUs to verify scaling 6. Scale to 4-8 GPUs for full training 7. Use preemptible instances with checkpointing **Time**: 2-4 hours setup + training time **Cost**: $15-60/hour depending on GPU count ## Troubleshooting ### BigQuery ML Issues **"Insufficient permissions"**: - Verify `roles/bigquery.dataEditor` and `roles/bigquery.jobUser` - Check dataset-level permissions - Ensure billing is enabled **"Model training failed"**: - Check for NULL values in features - Verify data types match model expectations - Review feature engineering TRANSFORM clause - Check for sufficient training data ### Vertex AI Issues **"Service account lacks permissions"**: - Verify `roles/aiplatform.user` - Add `roles/storage.objectAdmin` for GCS access - Check project-level IAM policies **"GPU/TPU quota exceeded"**: - Request quota increase in GCP console - Use different region with availability - Start with smaller GPU/TPU configuration - Use preemptible instances (separate quota) **"Training job crashes"**: - Check for CUDA OOM (reduce batch size) - Verify dependencies in requirements.txt - Review logs in Cloud Logging - Test locally before submitting to Vertex ## Security Best Practices ### Credentials Management **DO**: - ✅ Use service accounts with minimal permissions - ✅ Store credentials in Secret Manager - ✅ Use Workload Identity for GKE deployments - ✅ Rotate service account keys regularly - ✅ Add `.gitignore` for `*.json` key files **DON'T**: - ❌ Hardcode credentials in code - ❌ Commit service account keys to git - ❌ Use overly permissive roles (e.g., Owner) - ❌ Share service account keys across projects - ❌ Use personal credentials for production ### IAM Best Practices - Use separate service accounts for training vs serving - Grant roles at resource level, not project level when possible - Use Workload Identity Federation instead of keys when possible - Enable Cloud Audit Logs for ML API usage - Review IAM permissions quarterly ## Performance Benchmarks ### BigQuery ML vs Vertex AI **BigQuery ML**: - Best for: Structured data, SQL users, quick prototypes - Training time: Minutes to hours (depends on data size) - Scalability: Automatic (serverless) - Cost: $5/TB processed **Vertex AI Custom Training**: - Best for: Deep learning, custom architectures, GPU/TPU workloads - Training time: Hours to days (configurable hardware) - Scalability: Manual (choose machine type) - Cost: $0.35-20/hour depending on hardware **Rule of thumb**: - Use BigQuery ML for tabular data with < 100M rows - Use Vertex AI for images, text, audio, or custom models - Use Vertex AI for models requiring GPU/TPU acceleration ## Additional Resources - **GCP ML Documentation**: https://cloud.google.com/vertex-ai/docs - **BigQuery ML Reference**: https://cloud.google.com/bigquery-ml/docs - **Pricing Calculator**: https://cloud.google.com/products/calculator - **TPU Best Practices**: https://cloud.google.com/tpu/docs/best-practices - **Vertex AI Samples**: https://github.com/GoogleCloudPlatform/vertex-ai-samples