Disaster Recovery

Design Principles

Recivr's calculation engine is designed for resilience:

Stateless computation — the API performs no database writes during fee calculation, eliminating write-path failures as a latency or availability risk
Multi-zone deployment — the service runs across geographically separated availability zones with independent infrastructure
Automatic failover — if the primary zone returns errors or becomes unreachable, traffic is rerouted to the secondary zone within seconds
Edge-level request archival — every inbound request is durably stored at the network edge before processing begins, ensuring a complete audit trail even if downstream services are temporarily unavailable

Failover Behavior

All traffic enters through a global edge layer that handles routing and failover:

Requests are forwarded to the primary zone
If the primary returns a server error or times out, the request is automatically retried on the secondary zone
If all zones are unavailable, the API returns 503 Service Unavailable with a Retry-After header

Failover is transparent to the client — no configuration change is needed. The response includes headers indicating which zone served the request.

Direct Fallback Endpoint

In the unlikely event that the global edge routing layer itself is impaired, a direct fallback endpoint is available:

https://api-do.recivr.com/v1/calculate

This endpoint uses independent DNS infrastructure and connects directly to the compute layer, bypassing the primary routing. Same API, same authentication — use only as emergency fallback.

Data Durability

After fee calculation, results are published to a durable message stream. A dedicated worker consumes from the stream and persists data to the analytics database in batches. This ensures:

The API never blocks on database writes
Transaction data is preserved even if the analytics database is temporarily unavailable
Status lifecycle transitions are tracked with a full audit trail

Audit Trail

Every raw request is archived at the network edge with:

7-year retention for regulatory compliance
Replay capability — historical requests can be re-processed if fee rules change retroactively
Zero latency impact — archival happens asynchronously at the edge

Preventive Measures

Automated health monitoring with alerting at multiple severity levels
Proactive scaling — capacity is provisioned ahead of projected load
Hardened infrastructure — all systems follow security best practices including least-privilege access, encrypted transit, and regular patching
Incident response — on-call engineering with defined escalation procedures and <5 minute detection-to-response time

Recovery Objectives

Metric	Target
RTO (Recovery Time Objective)	< 30 seconds (automatic failover)
RPO (Recovery Point Objective)	0 — no data loss (edge archival + stream buffering)

Disaster Recovery ​

Design Principles ​

Failover Behavior ​

Direct Fallback Endpoint ​

Data Durability ​

Audit Trail ​

Preventive Measures ​

Recovery Objectives ​