---
name: google-cloud-configs
description: Google Cloud Platform configuration templates for BigQuery ML and Vertex AI training with authentication setup, GPU/TPU configs, and cost estimation tools. Use when setting up GCP ML training, configuring BigQuery ML models, deploying Vertex AI training jobs, estimating GCP costs, configuring cloud authentication, selecting GPUs/TPUs for training, or when user mentions BigQuery ML, Vertex AI, GCP training, cloud ML setup, TPU training, or Google Cloud costs.
allowed-tools: Bash, Read, Write, Edit
---

Use when:
- Setting up BigQuery ML for SQL-based machine learning
- Configuring Vertex AI custom training jobs
- Setting up GCP authentication for ML workflows
- Selecting appropriate GPU/TPU configurations
- Estimating costs for GCP ML training
- Deploying models to Vertex AI endpoints
- Configuring distributed training on GCP
- Optimizing cost vs performance for cloud ML

## Platform Overview

### BigQuery ML

**What it is**: SQL-based machine learning directly in BigQuery
**Best for**:
- Quick ML prototypes using existing data warehouse data
- Classification, regression, forecasting on structured data
- Users familiar with SQL but not Python/ML frameworks
- Large-scale batch predictions

**Available Models**:
- Linear/Logistic Regression
- XGBoost (BOOSTED_TREE)
- Deep Neural Networks (DNN)
- AutoML Tables
- TensorFlow/PyTorch imported models

**Pricing**:
- Based on data processed (same as BigQuery queries)
- $5 per TB processed for analysis
- AutoML: $19.32/hour for training

### Vertex AI Training

**What it is**: Fully managed ML training platform
**Best for**:
- Custom PyTorch/TensorFlow training
- Large-scale distributed training
- GPU/TPU-accelerated workloads
- Production ML pipelines

**Available Compute**:
- **CPUs**: n1-standard, n1-highmem, n1-highcpu
- **GPUs**: NVIDIA T4, P4, V100, P100, A100, L4
- **TPUs**: v2, v3, v4, v5e (8 cores to 512 cores)

**Pricing**:
- CPU: $0.05-0.30/hour depending on machine type
- GPU T4: $0.35/hour
- GPU A100: $3.67/hour (40GB) or $4.95/hour (80GB)
- TPU v3: $8.00/hour (8 cores)
- TPU v4: $11.00/hour (8 cores)

## GPU/TPU Selection Guide

### GPU Selection (Vertex AI)

**T4 (16GB VRAM)**:
- Use case: Inference, light training, small models
- Cost: $0.35/hour
- Good for: BERT-base, small CNNs, inference serving

**V100 (16GB VRAM)**:
- Use case: Mid-size training, mixed precision training
- Cost: $2.48/hour
- Good for: ResNet training, medium transformers

**A100 (40GB/80GB VRAM)**:
- Use case: Large model training, distributed training
- Cost: $3.67/hour (40GB), $4.95/hour (80GB)
- Good for: GPT-style models, large vision models, multi-GPU training

**L4 (24GB VRAM)**:
- Use case: Modern alternative to T4, better performance
- Cost: $0.66/hour
- Good for: Mid-size models, efficient inference

### TPU Selection (Vertex AI)

**TPU v2 (8 cores)**:
- Use case: TensorFlow/JAX training, matrix operations
- Cost: $4.50/hour
- Memory: 8GB per core (64GB total)
- Good for: Legacy TensorFlow models

**TPU v3 (8 cores)**:
- Use case: Standard TPU training
- Cost: $8.00/hour
- Memory: 16GB per core (128GB total)
- Good for: BERT, T5, image classification

**TPU v4 (8 cores)**:
- Use case: Latest generation, best performance
- Cost: $11.00/hour
- Memory: 32GB per core (256GB total)
- Good for: Large language models, cutting-edge research

**TPU v5e (8 cores)**:
- Use case: Cost-optimized TPU
- Cost: $2.50/hour
- Good for: Development, training at scale on budget

**Multi-node TPU Pods**:
- v3-32: 32 cores, $32/hour
- v3-128: 128 cores, $128/hour
- v4-128: 128 cores, $176/hour
- Use for: Massive distributed training (GPT-3 scale)

## Usage

### Setup BigQuery ML Environment

```bash
bash scripts/setup-bigquery-ml.sh
```

**Prompts for**:
- GCP Project ID
- BigQuery dataset name
- Service account credentials
- Default model type preference

**Creates**:
- `bigquery_config.json` - Project configuration
- `.bigqueryrc` - CLI configuration
- Example training SQL in examples/

### Setup Vertex AI Training Environment

```bash
bash scripts/setup-vertex-ai.sh
```

**Prompts for**:
- GCP Project ID
- Region (us-central1, europe-west4, etc.)
- Service account credentials
- Default machine type
- GPU/TPU preference

**Creates**:
- `vertex_config.yaml` - Training job configuration
- `vertex_requirements.txt` - Python dependencies
- Training script template

### Configure GCP Authentication

```bash
bash scripts/configure-auth.sh
```

**Prompts for**:
- Authentication method (service account, user account, workload identity)
- Service account key path (if applicable)
- IAM roles needed

**Creates**:
- `.gcp_auth_config` - Authentication configuration
- Sets GOOGLE_APPLICATION_CREDENTIALS environment variable
- Validates permissions

**Required IAM Roles**:
- BigQuery ML: `roles/bigquery.dataEditor`, `roles/bigquery.jobUser`
- Vertex AI: `roles/aiplatform.user`, `roles/storage.objectAdmin`
- Both: `roles/serviceusage.serviceUsageConsumer`

### Estimate GCP Training Costs

```bash
bash scripts/estimate-gcp-cost.sh
```

**Interactive prompts**:
- Platform: BigQuery ML or Vertex AI
- If BigQuery ML: Data size to process
- If Vertex AI:
  - Machine type (CPU/GPU/TPU)
  - Number of machines
  - Training duration estimate
  - Storage requirements

**Output**:
- Estimated compute cost
- Storage cost
- Data transfer cost (if applicable)
- Total estimated cost
- Cost comparison with other GCP options

## Templates

### BigQuery ML Training Template (`templates/bigquery_ml_training.sql`)

SQL template for creating and training models:
- Model creation syntax
- Feature engineering examples
- Training options (L1/L2 reg, learning rate, etc.)
- Evaluation queries
- Prediction queries

**Supported model types**:
- LINEAR_REG, LOGISTIC_REG
- BOOSTED_TREE_CLASSIFIER, BOOSTED_TREE_REGRESSOR
- DNN_CLASSIFIER, DNN_REGRESSOR
- AUTOML_CLASSIFIER, AUTOML_REGRESSOR

### Vertex AI Training Job Template (`templates/vertex_training_job.py`)

Python template for custom training:
- Training loop structure
- Distributed training setup (PyTorch DDP)
- Checkpointing and model saving
- Metrics logging to Vertex AI
- Hyperparameter tuning integration

**Includes**:
- Single GPU training
- Multi-GPU training (DataParallel, DistributedDataParallel)
- TPU training with PyTorch/XLA
- Cloud Storage integration

### GPU Configuration Template (`templates/vertex_gpu_config.yaml`)

YAML configuration for GPU training jobs:
- Machine type selection
- GPU type and count
- Disk configuration
- Network configuration
- Environment variables

**Presets included**:
- Single T4 (budget)
- Single A100 (standard)
- 4x A100 (distributed)
- 8x A100 (large-scale)

### TPU Configuration Template (`templates/vertex_tpu_config.yaml`)

YAML configuration for TPU training jobs:
- TPU type and topology
- TPU version selection
- JAX/TensorFlow runtime
- XLA compilation flags

**Presets included**:
- v3-8 (single TPU)
- v4-32 (TPU pod slice)
- v5e-8 (cost-optimized)

### GCP Authentication Template (`templates/gcp_auth.json`)

Service account configuration template:
- Project ID
- Service account email
- Key file path
- Required scopes
- IAM role assignments

**Security notes**:
- Uses placeholders only (never real keys)
- Documents how to create service accounts
- Includes `.gitignore` protection

## Examples

### BigQuery ML Regression Example (`examples/bigquery-regression-example.sql`)

Complete example:
- Dataset: NYC taxi trip data
- Task: Predict trip duration
- Model: BOOSTED_TREE_REGRESSOR
- Includes feature engineering, training, evaluation

**Demonstrates**:
- CREATE MODEL syntax
- TRANSFORM clause for feature engineering
- MODEL evaluation
- Batch predictions

### Vertex AI PyTorch Training Example (`examples/vertex-pytorch-training.py`)

Complete training script:
- Dataset: IMDB sentiment analysis
- Model: DistilBERT fine-tuning
- Training: Single GPU
- Logging: Vertex AI experiments

**Demonstrates**:
- Loading data from GCS
- Training loop with mixed precision
- Checkpointing to GCS
- Metrics logging
- Model export to Vertex AI

### Vertex AI Distributed Training Example (`examples/vertex-distributed-training.py`)

Multi-GPU training example:
- Dataset: ImageNet subset
- Model: ResNet-50
- Training: 4x A100 with DDP
- Scaling: Linear scaling rule

**Demonstrates**:
- PyTorch DistributedDataParallel
- Gradient accumulation
- Learning rate scaling
- Synchronized batch norm
- Multi-node coordination

### Hugging Face Fine-tuning on Vertex AI (`examples/vertex-huggingface-finetuning.py`)

Production fine-tuning template:
- Dataset: Custom text classification
- Model: BERT/RoBERTa/DeBERTa
- Training: Hugging Face Trainer API
- Deployment: Vertex AI endpoint

**Demonstrates**:
- Hugging Face Trainer integration
- Hyperparameter tuning with Vertex AI
- Model versioning
- Endpoint deployment
- Online predictions

## Cost Optimization Tips

### BigQuery ML

**Reduce data processed**:
- Use partitioned tables
- Filter data in WHERE clause before training
- Use table sampling for experimentation
- Cache intermediate results

**Use appropriate model types**:
- Start with LINEAR_REG/LOGISTIC_REG (cheapest)
- Use BOOSTED_TREE for better accuracy at moderate cost
- Reserve AutoML for when simpler models fail

**Optimize queries**:
- Avoid SELECT * (specify columns)
- Use clustering on filter columns
- Materialize views for repeated training

### Vertex AI

**Machine type selection**:
- Start with CPU for prototyping
- Use T4 for small models (cheapest GPU)
- Use A100 only for large models that need it
- Consider TPU v5e for TensorFlow/JAX (very cost-effective)

**Training optimization**:
- Use preemptible instances (60-70% cheaper, can be interrupted)
- Enable automatic checkpoint/resume for preemptible
- Use mixed precision training (FP16/BF16) for faster training
- Profile to eliminate CPU bottlenecks

**Storage optimization**:
- Store datasets in Cloud Storage (cheaper than persistent disk)
- Use Filestore only if needed for POSIX filesystem
- Clean up old model artifacts
- Use lifecycle policies to archive old data

**Multi-GPU efficiency**:
- Ensure near-linear scaling before adding more GPUs
- Profile inter-GPU communication
- Use gradient accumulation instead of larger batch sizes
- Consider 2x GPUs instead of 1x larger GPU (often same cost, better availability)

## Integration with ML Training Plugin

This skill integrates with other ml-training components:

- **training-patterns**: Provides GCP configs for generated training scripts
- **cost-calculator**: Uses GCP pricing data for budget planning
- **monitoring-dashboard**: Integrates with Vertex AI TensorBoard
- **validation-scripts**: Validates GCP credentials and permissions
- **integration-helpers**: Deploys trained models to Vertex AI endpoints

## Common Workflows

### Workflow 1: Quick BigQuery ML Prototype

1. Run `bash scripts/setup-bigquery-ml.sh`
2. Copy `templates/bigquery_ml_training.sql` to your project
3. Modify SQL for your dataset and features
4. Run training query in BigQuery console
5. Evaluate with built-in ML.EVALUATE()
6. Export predictions with ML.PREDICT()

**Time**: 30 minutes setup + training time
**Cost**: $5 per TB of data processed

### Workflow 2: Custom PyTorch Training on Vertex AI

1. Run `bash scripts/configure-auth.sh`
2. Run `bash scripts/setup-vertex-ai.sh`
3. Copy `templates/vertex_training_job.py`
4. Customize training loop for your model
5. Copy `templates/vertex_gpu_config.yaml`
6. Submit job: `gcloud ai custom-jobs create ...`
7. Monitor in Vertex AI console

**Time**: 1 hour setup + training time
**Cost**: Depends on GPU/TPU selection

### Workflow 3: Large-Scale Distributed Training

1. Setup Vertex AI (workflow 2)
2. Copy `examples/vertex-distributed-training.py`
3. Adapt for your model architecture
4. Test locally with 1 GPU
5. Test with 2 GPUs to verify scaling
6. Scale to 4-8 GPUs for full training
7. Use preemptible instances with checkpointing

**Time**: 2-4 hours setup + training time
**Cost**: $15-60/hour depending on GPU count

## Troubleshooting

### BigQuery ML Issues

**"Insufficient permissions"**:
- Verify `roles/bigquery.dataEditor` and `roles/bigquery.jobUser`
- Check dataset-level permissions
- Ensure billing is enabled

**"Model training failed"**:
- Check for NULL values in features
- Verify data types match model expectations
- Review feature engineering TRANSFORM clause
- Check for sufficient training data

### Vertex AI Issues

**"Service account lacks permissions"**:
- Verify `roles/aiplatform.user`
- Add `roles/storage.objectAdmin` for GCS access
- Check project-level IAM policies

**"GPU/TPU quota exceeded"**:
- Request quota increase in GCP console
- Use different region with availability
- Start with smaller GPU/TPU configuration
- Use preemptible instances (separate quota)

**"Training job crashes"**:
- Check for CUDA OOM (reduce batch size)
- Verify dependencies in requirements.txt
- Review logs in Cloud Logging
- Test locally before submitting to Vertex

## Security Best Practices

### Credentials Management

**DO**:
- ✅ Use service accounts with minimal permissions
- ✅ Store credentials in Secret Manager
- ✅ Use Workload Identity for GKE deployments
- ✅ Rotate service account keys regularly
- ✅ Add `.gitignore` for `*.json` key files

**DON'T**:
- ❌ Hardcode credentials in code
- ❌ Commit service account keys to git
- ❌ Use overly permissive roles (e.g., Owner)
- ❌ Share service account keys across projects
- ❌ Use personal credentials for production

### IAM Best Practices

- Use separate service accounts for training vs serving
- Grant roles at resource level, not project level when possible
- Use Workload Identity Federation instead of keys when possible
- Enable Cloud Audit Logs for ML API usage
- Review IAM permissions quarterly

## Performance Benchmarks

### BigQuery ML vs Vertex AI

**BigQuery ML**:
- Best for: Structured data, SQL users, quick prototypes
- Training time: Minutes to hours (depends on data size)
- Scalability: Automatic (serverless)
- Cost: $5/TB processed

**Vertex AI Custom Training**:
- Best for: Deep learning, custom architectures, GPU/TPU workloads
- Training time: Hours to days (configurable hardware)
- Scalability: Manual (choose machine type)
- Cost: $0.35-20/hour depending on hardware

**Rule of thumb**:
- Use BigQuery ML for tabular data with < 100M rows
- Use Vertex AI for images, text, audio, or custom models
- Use Vertex AI for models requiring GPU/TPU acceleration

## Additional Resources

- **GCP ML Documentation**: https://cloud.google.com/vertex-ai/docs
- **BigQuery ML Reference**: https://cloud.google.com/bigquery-ml/docs
- **Pricing Calculator**: https://cloud.google.com/products/calculator
- **TPU Best Practices**: https://cloud.google.com/tpu/docs/best-practices
- **Vertex AI Samples**: https://github.com/GoogleCloudPlatform/vertex-ai-samples