Building a Fault-Tolerant Order Processing System
Design a system that processes orders reliably even when components fail, using saga pattern and event sourcing.
Problem Statement
Design a fault-tolerant order processing system that can handle failures gracefully while maintaining data consistency across multiple services.
Context: E-commerce platform handling 10,000+ orders/day with 99.9% uptime requirement.
Architecture Overview
The system uses a saga pattern with event sourcing to ensure reliable order processing
Saga Flow
Step-by-step execution of the order processing saga
Data Model
Core entities and their relationships with detailed schema definitions
Orders Table Schema
CREATE TABLE orders (
order_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(user_id),
status VARCHAR(50) NOT NULL DEFAULT 'pending',
total_amount DECIMAL(10,2) NOT NULL,
currency VARCHAR(3) NOT NULL DEFAULT 'USD',
shipping_address JSONB NOT NULL,
billing_address JSONB NOT NULL,
order_items JSONB NOT NULL,
metadata JSONB,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
expires_at TIMESTAMP WITH TIME ZONE,
CONSTRAINT valid_status CHECK (status IN ('pending', 'processing', 'confirmed', 'cancelled', 'failed'))
);
-- Indexes for performance
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_status_created ON orders(status, created_at);
Events Table Schema
CREATE TABLE events (
event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_id UUID NOT NULL,
aggregate_type VARCHAR(100) NOT NULL,
event_type VARCHAR(100) NOT NULL,
event_data JSONB NOT NULL,
event_version INTEGER NOT NULL DEFAULT 1,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
correlation_id UUID,
causation_id UUID,
metadata JSONB
);
-- Indexes for event sourcing queries
CREATE INDEX idx_events_aggregate ON events(aggregate_id, aggregate_type);
CREATE INDEX idx_events_type ON events(event_type);
CREATE INDEX idx_events_created_at ON events(created_at);
CREATE INDEX idx_events_correlation ON events(correlation_id);
Key Decisions & Trade-offs
Analysis of architectural decisions and their implications
Event Sourcing
Pros: Full audit trail, temporal queries, replay capability
Cons: Eventual consistency, complexity, storage overhead
Chosen because: Compliance requirements and need for complete audit trail
Saga Pattern
Pros: Handles distributed transactions, maintains consistency
Cons: Complex compensation logic, potential for infinite loops
Chosen because: Business requirement for atomic order processing
Implementation Guide
Step-by-step implementation approach with code examples
Set up Event Store
Configure event sourcing infrastructure with proper event schemas and versioning
// Event Store Configuration
const eventStore = new EventStore({
connection: {
host: process.env.EVENT_STORE_HOST,
port: process.env.EVENT_STORE_PORT,
database: 'events'
},
schema: {
events: {
order_created: {
version: '1.0',
fields: ['orderId', 'userId', 'items', 'totalAmount', 'timestamp']
},
payment_reserved: {
version: '1.0',
fields: ['orderId', 'paymentId', 'amount', 'status', 'timestamp']
},
inventory_reserved: {
version: '1.0',
fields: ['orderId', 'productId', 'quantity', 'reservationId', 'timestamp']
}
}
}
});
Implement Saga Coordinator
Create saga orchestration logic with compensation handling and retry mechanisms
// Saga Coordinator Implementation
class OrderSagaCoordinator {
async executeOrderSaga(orderData) {
const sagaId = uuid();
const saga = new Saga(sagaId, 'order_processing');
try {
// Step 1: Create Order
const order = await this.createOrder(orderData);
saga.addStep('order_created', order.id);
// Step 2: Reserve Payment
const payment = await this.reservePayment(order.id, orderData.payment);
saga.addStep('payment_reserved', payment.id);
// Step 3: Reserve Inventory
const inventory = await this.reserveInventory(order.id, orderData.items);
saga.addStep('inventory_reserved', inventory.id);
// Step 4: Confirm Order
await this.confirmOrder(order.id);
saga.complete();
return { success: true, orderId: order.id };
} catch (error) {
await this.compensate(saga);
throw error;
}
}
async compensate(saga) {
const steps = saga.getCompletedSteps();
for (let i = steps.length - 1; i >= 0; i--) {
const step = steps[i];
await this.executeCompensation(step);
}
}
}
Add Monitoring & Alerting
Implement comprehensive monitoring for saga execution, failures, and performance metrics
// Monitoring Configuration
const monitoring = {
metrics: {
sagaExecutionTime: new Histogram({
name: 'saga_execution_time_seconds',
help: 'Time taken to execute complete saga',
buckets: [0.1, 0.5, 1, 2, 5, 10]
}),
sagaSuccessRate: new Counter({
name: 'saga_success_total',
help: 'Total successful saga executions'
}),
sagaFailureRate: new Counter({
name: 'saga_failure_total',
help: 'Total failed saga executions'
})
},
alerts: {
highFailureRate: {
condition: 'saga_failure_rate > 0.05', // 5% failure rate
severity: 'critical',
message: 'Saga failure rate is above threshold'
},
slowExecution: {
condition: 'saga_execution_time_p95 > 5', // 95th percentile > 5s
severity: 'warning',
message: 'Saga execution time is degrading'
}
}
};
API Design & Endpoints
RESTful API design with OpenAPI specifications and error handling
Create Order Endpoint
POST /api/v1/orders
Content-Type: application/json
Authorization: Bearer {token}
{
"user_id": "uuid",
"items": [
{
"product_id": "uuid",
"quantity": 2,
"unit_price": 29.99
}
],
"shipping_address": {
"street": "123 Main St",
"city": "New York",
"state": "NY",
"zip": "10001",
"country": "US"
},
"payment_method": {
"type": "credit_card",
"token": "pm_1234567890"
}
}
// Response
{
"order_id": "uuid",
"status": "processing",
"estimated_completion": "2024-01-15T10:30:00Z",
"saga_id": "uuid",
"links": {
"self": "/api/v1/orders/{order_id}",
"status": "/api/v1/orders/{order_id}/status"
}
}
Order Status Endpoint
GET /api/v1/orders/{order_id}/status
Authorization: Bearer {token}
// Response
{
"order_id": "uuid",
"status": "confirmed",
"current_step": "inventory_reserved",
"completed_steps": [
"order_created",
"payment_reserved"
],
"pending_steps": [
"inventory_reserved",
"shipping_created"
],
"estimated_completion": "2024-01-15T10:30:00Z",
"last_updated": "2024-01-15T10:25:00Z"
}
Error Handling
// Error Response Format
{
"error": {
"code": "SAGA_FAILED",
"message": "Order processing failed during payment reservation",
"details": {
"step": "payment_reserved",
"reason": "Insufficient funds",
"saga_id": "uuid",
"compensation_required": true
},
"timestamp": "2024-01-15T10:25:00Z",
"request_id": "uuid"
}
}
// Common Error Codes
const ERROR_CODES = {
INVALID_ORDER_DATA: 'INVALID_ORDER_DATA',
PAYMENT_FAILED: 'PAYMENT_FAILED',
INVENTORY_UNAVAILABLE: 'INVENTORY_UNAVAILABLE',
SAGA_TIMEOUT: 'SAGA_TIMEOUT',
COMPENSATION_FAILED: 'COMPENSATION_FAILED'
};
Operational Considerations
Key operational aspects with implementation details and best practices
Monitoring & Observability
- • Saga execution time (P50, P95, P99)
- • Compensation rate and success rate
- • Event processing latency
- • Service health checks and dependencies
- • Database connection pool status
- • Message queue depth and throughput
// Prometheus Metrics
const metrics = {
saga_duration: new Histogram({
name: 'saga_duration_seconds',
help: 'Saga execution duration',
labelNames: ['saga_type', 'status'],
buckets: [0.1, 0.5, 1, 2, 5, 10, 30]
}),
compensation_rate: new Counter({
name: 'compensation_total',
help: 'Total compensation operations',
labelNames: ['saga_type', 'step', 'reason']
})
};
Failure Handling & Resilience
- • Exponential backoff retry logic
- • Circuit breaker pattern implementation
- • Dead letter queues for failed events
- • Manual intervention procedures
- • Graceful degradation strategies
- • Rollback and compensation strategies
// Retry Logic with Exponential Backoff
class RetryHandler {
async executeWithRetry(operation, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await operation();
} catch (error) {
if (attempt === maxRetries) throw error;
const delay = Math.pow(2, attempt) * 1000; // 2s, 4s, 8s
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
}
Deployment & Infrastructure
Container Orchestration
- • Kubernetes deployment with health checks
- • Horizontal pod autoscaling (HPA)
- • Resource limits and requests
- • Rolling updates with zero downtime
Database Management
- • Connection pooling configuration
- • Read replicas for analytics
- • Automated backups and point-in-time recovery
- • Performance tuning and query optimization
Results & Metrics
Expected outcomes and success metrics