---
skill_id: ARCH-FAULT-TOL
version: 1.0.0
last_updated: 2026-01-04
applies_to: [Class A, Class B, Class C]
jurisdiction: [Global]
prerequisites: [ARCH-SAFETY-CLASS]
---

# Fault Tolerance Design

## Purpose
Provide patterns for detecting, containing, and recovering from faults in medical device software, scaled to safety class.

## When to Apply
- Safety-critical control loops, sensing/actuation, communication paths.
- Watchdogs, redundancy, health monitoring, self-test.
- Power, memory, and comms error handling.

## Requirements (testable)
1. Fault Detection: Implement monitoring for critical resources (tasks, sensors, comms) with thresholds and alarms. Rationale: early detection.
2. Graceful Degradation: Define degraded modes or safe state when partial functionality fails. Rationale: bounded failure.
3. Redundancy Strategy: For Class C functions, consider redundancy (sensing, computation, or communication) with voter/consistency checks. Rationale: resilience.
4. Watchdog Use: Configure hardware/software watchdogs with bounded servicing windows; service only after critical checks pass. Rationale: recover from hangs.
5. Self-Test/BIST: Run self-tests at startup and periodically for critical components; handle failures deterministically. Rationale: latent fault detection.
6. Error Propagation Control: Sanitize/contain errors at boundaries; avoid cascading faults. Rationale: containment.
7. Logging & Alarms: Log and, where required, annunciate safety-relevant faults; ensure tamper-evident logs for post-incident analysis. Rationale: traceability.

## Recommended Practices
- Use majority voting or reasonableness checks instead of blind trust in single sensors.
- Employ brownout/power-fail detection to enter safe state gracefully.
- For RTOS, assign dedicated safety monitor task with higher priority than non-critical tasks.
- Debounce fault signals to reduce false positives but cap with timeouts.

## Patterns
Watchdog servicing with checks:
```c
// REQ-FT-WD-01; TEST-FT-03
void service_watchdog(void) {
    if (critical_tasks_healthy() && comms_alive()) {
        wdt_kick();
    } else {
        // Do not kick; let watchdog reset into safe boot
    }
}
```

Sensor plausibility check:
```c
// REQ-FT-SNS-02; TEST-FT-07
bool validate_pressure(float p_kpa) {
    return (p_kpa >= 0.0f && p_kpa <= 300.0f);
}
```

Redundant reading vote:
```c
// REQ-FT-RED-01; TEST-FT-10
float fused_temp(float a, float b) {
    if (fabsf(a - b) > 2.0f) {
        alarm_sensor_disagree();
        enter_safe_state();
    }
    return (a + b) * 0.5f;
}
```

## Anti-Patterns (risks)
- Servicing watchdog unconditionally in main loop -> risk: hides deadlocks.
- Single-point sensors without plausibility checks -> risk: unsafe outputs.
- Logging faults without annunciation where required -> risk: latent hazards.
- No degraded mode or safe fallback -> risk: uncontrolled failure behavior.

## Verification Checklist
- [ ] Fault monitors implemented for critical resources with thresholds/timeouts.
- [ ] Watchdog configuration reviewed; serviced only after health checks.
- [ ] Degraded modes or safe state defined and reachable on fault.
- [ ] Redundancy/plausibility checks implemented for critical sensors/paths.
- [ ] Self-tests executed at startup/periodically; failures handled deterministically.
- [ ] Errors contained at boundaries; no unchecked propagation.
- [ ] Faults logged and annunciated as applicable; integrity of logs maintained.

## Traceability
- Link `REQ-FT-###` to hazards and controls; map to tests (`TEST-FT-###`).
- Store watchdog and fault monitor configuration with release artifacts.

## References
- IEC 62304 design/implementation expectations (fault control).
- ISO 14971 for risk-driven fault handling.
- IEC 60601-1 (power/brownout considerations; informative).

## Changelog
- 1.0.0 (2026-01-04): Initial fault tolerance patterns with watchdog, redundancy, and safe fallback guidance.

## Audit History
- **2026-01-04**: Audit performed. Verified:
  - Fault tolerance patterns technically accurate
  - IEC 60601-1 reference appropriate as informative for power/brownout considerations
  - Watchdog and redundancy patterns follow industry best practices