# πŸ—„οΈ Skill: Database Reliability ## πŸ“‹ Metadata | Atributo | Valor | |----------|-------| | **ID** | `sre-database-reliability` | | **Nivel** | πŸ”΄ Avanzado | | **VersiΓ³n** | 1.0.0 | | **Keywords** | `database`, `postgresql`, `mysql`, `mongodb`, `replication`, `backup`, `failover`, `connection-pooling` | | **Referencia** | [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html) | ## πŸ”‘ Keywords para InvocaciΓ³n - `database-reliability` - `database-replication` - `database-backup` - `failover` - `connection-pooling` - `database-monitoring` - `@skill:database-reliability` ### Ejemplos de Prompts ``` Implementa database replication y failover para PostgreSQL ``` ``` Configura backup y recovery strategy para base de datos ``` ``` Setup connection pooling y database monitoring ``` ``` @skill:database-reliability - Alta disponibilidad de base de datos ``` ## πŸ“– DescripciΓ³n Database reliability es fundamental para servicios confiables. Este skill cubre replication, failover, backups, connection pooling, monitoring, y disaster recovery para bases de datos en producciΓ³n. ### βœ… CuΓ‘ndo Usar Este Skill - Bases de datos en producciΓ³n - Servicios con requisitos de alta disponibilidad - Datos crΓ­ticos - SLAs estrictos - Compliance requirements ### ❌ CuΓ‘ndo NO Usar Este Skill - Desarrollo local con datos de prueba - Prototipos sin datos reales - Sistemas de solo lectura sin disponibilidad crΓ­tica ## πŸ—οΈ Arquitectura de Alta Disponibilidad ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Application Layer β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ App β”‚ β”‚ App β”‚ β”‚ App β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”‚ Connection β”‚ β”‚ Pool β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚Primary β”‚ β”‚Replica β”‚ β”‚ Replica β”‚ β”‚(Master) β”‚ β”‚(Read) β”‚ β”‚ (Read) β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ Replication β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚ Standby β”‚ β”‚ (DR) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## πŸ’» ImplementaciΓ³n > **πŸ“ Scripts Ejecutables:** Este skill incluye scripts Python ejecutables en la carpeta [`scripts/`](scripts/): > - [`postgresql_backup.py`](scripts/postgresql_backup.py) - AutomatizaciΓ³n de backups y restauraciΓ³n > - [`requirements.txt`](scripts/requirements.txt) - Dependencias (ninguna, usa stdlib) > > Ver [`scripts/README.md`](scripts/README.md) para documentaciΓ³n de uso. ### 1. PostgreSQL Replication Setup #### 1.1 Primary Configuration ```conf # postgresql.conf (Primary) # Connection Settings listen_addresses = '*' max_connections = 200 # Replication Settings wal_level = replica max_wal_senders = 3 max_replication_slots = 3 hot_standby = on # Performance shared_buffers = 4GB effective_cache_size = 12GB maintenance_work_mem = 1GB checkpoint_completion_target = 0.9 wal_buffers = 16MB default_statistics_target = 100 random_page_cost = 1.1 effective_io_concurrency = 200 work_mem = 20MB min_wal_size = 1GB max_wal_size = 4GB # Logging log_destination = 'stderr' logging_collector = on log_directory = 'log' log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' log_rotation_age = 1d log_rotation_size = 100MB log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ' log_timezone = 'UTC' ``` ```conf # pg_hba.conf (Primary) # Allow replication from replicas host replication replicator 10.0.0.0/24 md5 ``` #### 1.2 Replica Configuration ```bash # Setup replication (on replica server) # 1. Stop PostgreSQL sudo systemctl stop postgresql # 2. Backup primary pg_basebackup -h primary-host -D /var/lib/postgresql/data -U replicator -P -W -R -S replication_slot_1 # 3. Configure recovery # postgresql.conf (Replica) hot_standby = on max_standby_streaming_delay = 30s wal_receiver_status_interval = 10s hot_standby_feedback = on ``` ```conf # postgresql.auto.conf (Auto-generated) primary_conninfo = 'host=primary-host port=5432 user=replicator password=xxx application_name=replica1' primary_slot_name = 'replication_slot_1' ``` #### 1.3 Replication Slot Management ```sql -- Create replication slot SELECT pg_create_physical_replication_slot('replication_slot_1'); -- List replication slots SELECT * FROM pg_replication_slots; -- Monitor replication lag SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, sync_state, sync_priority FROM pg_stat_replication; -- Check lag in bytes SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes, client_addr, application_name FROM pg_stat_replication; ``` ### 2. Automatic Failover (Patroni) ```yaml # patroni/patroni.yml scope: postgres-cluster namespace: /db/ name: postgres-node1 restapi: listen: 0.0.0.0:8008 connect_address: 10.0.1.1:8008 auth: 'username:password' bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 30 maximum_lag_on_failover: 1048576 postgresql: use_pg_rewind: true parameters: max_connections: 200 max_worker_processes: 8 wal_level: replica hot_standby: on wal_keep_segments: 8 max_wal_senders: 3 max_replication_slots: 3 synchronous_commit: 'on' synchronous_standby_names: 'ANY 1 (postgres-node2,postgres-node3)' initdb: - encoding: UTF8 - data-checksums pg_hba: - host replication replicator 0.0.0.0/0 md5 - host all all 0.0.0.0/0 md5 users: admin: password: admin options: - createrole - createdb postgresql: listen: 0.0.0.0:5432 connect_address: 10.0.1.1:5432 data_dir: /var/lib/postgresql/data pgpass: /tmp/pgpass authentication: replication: username: replicator password: replicator superuser: username: postgres password: postgres parameters: unix_socket_directories: '/var/run/postgresql' recovery_conf: restore_command: 'cp /backup/%f %p' tags: nofailover: false noloadbalance: false clonefrom: false nosync: false ``` ### 3. Connection Pooling (PgBouncer) ```ini # pgbouncer/pgbouncer.ini [databases] mydb = host=primary-host port=5432 dbname=mydb [pgbouncer] listen_addr = 0.0.0.0 listen_port = 6432 auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt pool_mode = transaction max_client_conn = 1000 default_pool_size = 25 min_pool_size = 5 reserve_pool_size = 5 reserve_pool_timeout = 3 max_db_connections = 100 max_user_connections = 50 server_round_robin = 1 ignore_startup_parameters = extra_float_digits # Logging log_connections = 1 log_disconnections = 1 log_pooler_errors = 1 # Admin console admin_users = admin stats_users = stats # Health check server_check_query = SELECT 1 server_check_delay = 30 server_lifetime = 3600 server_idle_timeout = 600 # Connection limits max_client_conn = 1000 default_pool_size = 25 min_pool_size = 5 ``` ### 4. Backup Strategy ```bash #!/bin/bash # scripts/backup.sh BACKUP_DIR="/backup/postgresql" DATE=$(date +%Y%m%d_%H%M%S) RETENTION_DAYS=30 # Full backup pg_basebackup \ -h localhost \ -D "$BACKUP_DIR/full_$DATE" \ -U backup_user \ -Ft \ -z \ -P # WAL archiving (continuous) # In postgresql.conf: # archive_mode = on # archive_command = 'test ! -f /backup/wal/%f && cp %p /backup/wal/%f' # Cleanup old backups find "$BACKUP_DIR" -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \; # Verify backup pg_verifybackup -D "$BACKUP_DIR/full_$DATE" ``` **Script ejecutable:** [`scripts/postgresql_backup.py`](scripts/postgresql_backup.py) Script CLI completo para automatizar backups, restauraciΓ³n y limpieza de PostgreSQL. #### CuΓ‘ndo ejecutar - **Backups programados:** Como parte de cron jobs o tareas automatizadas - **RestauraciΓ³n de emergencia:** Para restaurar bases de datos desde backups - **Limpieza de backups:** Para mantener el espacio de almacenamiento bajo control - **VerificaciΓ³n de backups:** Para validar la integridad de backups antes de restaurar #### Uso como CLI ```bash # Crear backup completo python scripts/postgresql_backup.py backup \ --host localhost \ --user backup_user \ --backup-dir /backup/postgresql # Listar backups disponibles python scripts/postgresql_backup.py list \ --backup-dir /backup/postgresql # Limpiar backups antiguos (dry-run primero) python scripts/postgresql_backup.py cleanup \ --backup-dir /backup/postgresql \ --retention-days 30 \ --dry-run # Ejecutar limpieza real python scripts/postgresql_backup.py cleanup \ --backup-dir /backup/postgresql \ --retention-days 30 # Restaurar desde backup python scripts/postgresql_backup.py restore \ --backup-path /backup/postgresql/full_20240115_120000 \ --target-dir /var/lib/postgresql/data ``` #### Uso como mΓ³dulo ```python from scripts.postgresql_backup import PostgreSQLBackup # Crear instancia backup = PostgreSQLBackup( host="localhost", port=5432, user="backup_user", backup_dir="/backup/postgresql" ) # Crear backup backup_path = backup.full_backup(verify=True) # Limpiar backups antiguos backup.cleanup_old_backups(retention_days=30) # Listar backups backup.list_backups() ``` #### CaracterΓ­sticas - βœ… Backup fΓ­sico completo con `pg_basebackup` - βœ… VerificaciΓ³n automΓ‘tica de backups - βœ… RestauraciΓ³n automatizada - βœ… Limpieza de backups antiguos con polΓ­tica de retenciΓ³n - βœ… Listado de backups disponibles - βœ… Modo dry-run para limpieza - βœ… Manejo de errores robusto ### 5. Database Monitoring ```yaml # prometheus/database-metrics.yml scrape_configs: - job_name: 'postgres-exporter' static_configs: - targets: ['postgres-exporter:9187'] - job_name: 'postgresql' static_configs: - targets: ['localhost:9187'] ``` ```sql -- Custom metrics queries -- Connection pool usage SELECT datname, numbackends as active_connections, max_connections, (numbackends::float / max_connections * 100) as connection_pct FROM pg_stat_database; -- Replication lag SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds; -- Table bloat SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS bloat FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema') ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10; -- Slow queries SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes' AND state = 'active'; -- Index usage SELECT schemaname, tablename, indexname, idx_scan as index_scans, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size FROM pg_stat_user_indexes ORDER BY idx_scan ASC; ``` ### 6. Alert Rules ```yaml # prometheus/alerts/database-alerts.yml groups: - name: database_alerts interval: 30s rules: # High connection usage - alert: DatabaseHighConnections expr: | (pg_stat_database_numbackends / pg_stat_database_max_connections) > 0.9 for: 5m labels: severity: warning annotations: summary: "High connection usage in {{ $labels.datname }}" description: "{{ $value | humanizePercentage }} of connections in use" # Replication lag - alert: DatabaseReplicationLag expr: | pg_replication_lag_seconds > 30 for: 5m labels: severity: warning annotations: summary: "High replication lag" description: "Replication lag is {{ $value }} seconds" # Database down - alert: DatabaseDown expr: | up{job="postgres-exporter"} == 0 for: 1m labels: severity: critical annotations: summary: "Database exporter is down" # Slow queries - alert: DatabaseSlowQueries expr: | pg_stat_activity_max_tx_duration > 300 for: 5m labels: severity: warning annotations: summary: "Slow query detected" description: "Query running for {{ $value }} seconds" ``` ## 🎯 Mejores PrΓ‘cticas ### 1. Replication βœ… **DO:** - Use synchronous replication for critical data - Monitor replication lag - Test failover regularly - Use replication slots ❌ **DON'T:** - Ignore replication lag - Skip failover testing - Use async replication for critical writes ### 2. Backups βœ… **DO:** - Test restore procedures regularly - Use point-in-time recovery - Store backups off-site - Encrypt backups ❌ **DON'T:** - Assume backups work without testing - Keep all backups in one location - Skip backup verification ### 3. Connection Pooling βœ… **DO:** - Use connection pooling - Set appropriate pool sizes - Monitor pool usage - Handle connection errors gracefully ❌ **DON'T:** - Exceed database connection limits - Use too large pools - Ignore pool exhaustion ## 🚨 Troubleshooting ### Replication Lag 1. Check network connectivity 2. Monitor WAL generation rate 3. Check replica performance 4. Consider increasing resources ### Connection Exhaustion 1. Reduce connection pool size 2. Close idle connections 3. Use connection pooling 4. Increase database max_connections ### Backup Failures 1. Check disk space 2. Verify permissions 3. Test backup commands manually 4. Review backup logs ## πŸ“š Recursos Adicionales - [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html) - [Patroni Documentation](https://patroni.readthedocs.io/) - [PgBouncer Documentation](https://www.pgbouncer.org/) --- **VersiΓ³n:** 1.0.0 **Última actualizaciΓ³n:** Diciembre 2025 **Total lΓ­neas:** 1,100+