Disaster Recovery
Design Principles
Recivr's calculation engine is designed for resilience:
- Stateless computation — the API performs no database writes during fee calculation, eliminating write-path failures as a latency or availability risk
- Multi-zone deployment — the service runs across geographically separated availability zones with independent infrastructure
- Automatic failover — if the primary zone returns errors or becomes unreachable, traffic is rerouted to the secondary zone within seconds
- Edge-level request archival — every inbound request is durably stored at the network edge before processing begins, ensuring a complete audit trail even if downstream services are temporarily unavailable
Failover Behavior
All traffic enters through a global edge layer that handles routing and failover:
- Requests are forwarded to the primary zone
- If the primary returns a server error or times out, the request is automatically retried on the secondary zone
- If all zones are unavailable, the API returns
503 Service Unavailablewith aRetry-Afterheader
Failover is transparent to the client — no configuration change is needed. The response includes headers indicating which zone served the request.
Direct Fallback Endpoint
In the unlikely event that the global edge routing layer itself is impaired, a direct fallback endpoint is available:
https://api-do.recivr.com/v1/calculateThis endpoint uses independent DNS infrastructure and connects directly to the compute layer, bypassing the primary routing. Same API, same authentication — use only as emergency fallback.
Data Durability
After fee calculation, results are published to a durable message stream. A dedicated worker consumes from the stream and persists data to the analytics database in batches. This ensures:
- The API never blocks on database writes
- Transaction data is preserved even if the analytics database is temporarily unavailable
- Status lifecycle transitions are tracked with a full audit trail
Audit Trail
Every raw request is archived at the network edge with:
- 7-year retention for regulatory compliance
- Replay capability — historical requests can be re-processed if fee rules change retroactively
- Zero latency impact — archival happens asynchronously at the edge
Preventive Measures
- Automated health monitoring with alerting at multiple severity levels
- Proactive scaling — capacity is provisioned ahead of projected load
- Hardened infrastructure — all systems follow security best practices including least-privilege access, encrypted transit, and regular patching
- Incident response — on-call engineering with defined escalation procedures and <5 minute detection-to-response time
Recovery Objectives
| Metric | Target |
|---|---|
| RTO (Recovery Time Objective) | < 30 seconds (automatic failover) |
| RPO (Recovery Point Objective) | 0 — no data loss (edge archival + stream buffering) |
