Building a Fault-Tolerant Order Processing System

Design a system that processes orders reliably even when components fail, using saga pattern and event sourcing.

15 min read

Alex Chen

2.4k views

2024-01-15

distributed-systems

resilience

event-sourcing

Problem Statement

Design a fault-tolerant order processing system that can handle failures gracefully while maintaining data consistency across multiple services.

Context: E-commerce platform handling 10,000+ orders/day with 99.9% uptime requirement.

Architecture Overview

The system uses a saga pattern with event sourcing to ensure reliable order processing

Order processing flow with saga pattern and event sourcing

Saga Flow

Step-by-step execution of the order processing saga

Saga execution with compensation handling

Data Model

Core entities and their relationships with detailed schema definitions

Entity relationships for order processing

Orders Table Schema

CREATE TABLE orders (
  order_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID NOT NULL REFERENCES users(user_id),
  status VARCHAR(50) NOT NULL DEFAULT 'pending',
  total_amount DECIMAL(10,2) NOT NULL,
  currency VARCHAR(3) NOT NULL DEFAULT 'USD',
  shipping_address JSONB NOT NULL,
  billing_address JSONB NOT NULL,
  order_items JSONB NOT NULL,
  metadata JSONB,
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
  updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
  expires_at TIMESTAMP WITH TIME ZONE,
  CONSTRAINT valid_status CHECK (status IN ('pending', 'processing', 'confirmed', 'cancelled', 'failed'))
);

-- Indexes for performance
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_status_created ON orders(status, created_at);

Events Table Schema

CREATE TABLE events (
  event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  aggregate_id UUID NOT NULL,
  aggregate_type VARCHAR(100) NOT NULL,
  event_type VARCHAR(100) NOT NULL,
  event_data JSONB NOT NULL,
  event_version INTEGER NOT NULL DEFAULT 1,
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
  correlation_id UUID,
  causation_id UUID,
  metadata JSONB
);

-- Indexes for event sourcing queries
CREATE INDEX idx_events_aggregate ON events(aggregate_id, aggregate_type);
CREATE INDEX idx_events_type ON events(event_type);
CREATE INDEX idx_events_created_at ON events(created_at);
CREATE INDEX idx_events_correlation ON events(correlation_id);

Key Decisions & Trade-offs

Analysis of architectural decisions and their implications

Event Sourcing

Pros: Full audit trail, temporal queries, replay capability

Cons: Eventual consistency, complexity, storage overhead

Chosen because: Compliance requirements and need for complete audit trail

Saga Pattern

Pros: Handles distributed transactions, maintains consistency

Cons: Complex compensation logic, potential for infinite loops

Chosen because: Business requirement for atomic order processing

Implementation Guide

Step-by-step implementation approach with code examples

Set up Event Store

Configure event sourcing infrastructure with proper event schemas and versioning

// Event Store Configuration
const eventStore = new EventStore({
  connection: {
    host: process.env.EVENT_STORE_HOST,
    port: process.env.EVENT_STORE_PORT,
    database: 'events'
  },
  schema: {
    events: {
      order_created: {
        version: '1.0',
        fields: ['orderId', 'userId', 'items', 'totalAmount', 'timestamp']
      },
      payment_reserved: {
        version: '1.0',
        fields: ['orderId', 'paymentId', 'amount', 'status', 'timestamp']
      },
      inventory_reserved: {
        version: '1.0',
        fields: ['orderId', 'productId', 'quantity', 'reservationId', 'timestamp']
      }
    }
  }
});

Implement Saga Coordinator

Create saga orchestration logic with compensation handling and retry mechanisms

// Saga Coordinator Implementation
class OrderSagaCoordinator {
  async executeOrderSaga(orderData) {
    const sagaId = uuid();
    const saga = new Saga(sagaId, 'order_processing');
    
    try {
      // Step 1: Create Order
      const order = await this.createOrder(orderData);
      saga.addStep('order_created', order.id);
      
      // Step 2: Reserve Payment
      const payment = await this.reservePayment(order.id, orderData.payment);
      saga.addStep('payment_reserved', payment.id);
      
      // Step 3: Reserve Inventory
      const inventory = await this.reserveInventory(order.id, orderData.items);
      saga.addStep('inventory_reserved', inventory.id);
      
      // Step 4: Confirm Order
      await this.confirmOrder(order.id);
      saga.complete();
      
      return { success: true, orderId: order.id };
      
    } catch (error) {
      await this.compensate(saga);
      throw error;
    }
  }
  
  async compensate(saga) {
    const steps = saga.getCompletedSteps();
    
    for (let i = steps.length - 1; i >= 0; i--) {
      const step = steps[i];
      await this.executeCompensation(step);
    }
  }
}

Add Monitoring & Alerting

Implement comprehensive monitoring for saga execution, failures, and performance metrics

// Monitoring Configuration
const monitoring = {
  metrics: {
    sagaExecutionTime: new Histogram({
      name: 'saga_execution_time_seconds',
      help: 'Time taken to execute complete saga',
      buckets: [0.1, 0.5, 1, 2, 5, 10]
    }),
    sagaSuccessRate: new Counter({
      name: 'saga_success_total',
      help: 'Total successful saga executions'
    }),
    sagaFailureRate: new Counter({
      name: 'saga_failure_total',
      help: 'Total failed saga executions'
    })
  },
  
  alerts: {
    highFailureRate: {
      condition: 'saga_failure_rate > 0.05', // 5% failure rate
      severity: 'critical',
      message: 'Saga failure rate is above threshold'
    },
    slowExecution: {
      condition: 'saga_execution_time_p95 > 5', // 95th percentile > 5s
      severity: 'warning',
      message: 'Saga execution time is degrading'
    }
  }
};

API Design & Endpoints

RESTful API design with OpenAPI specifications and error handling

Create Order Endpoint

POST /api/v1/orders
Content-Type: application/json
Authorization: Bearer {token}

{
  "user_id": "uuid",
  "items": [
    {
      "product_id": "uuid",
      "quantity": 2,
      "unit_price": 29.99
    }
  ],
  "shipping_address": {
    "street": "123 Main St",
    "city": "New York",
    "state": "NY",
    "zip": "10001",
    "country": "US"
  },
  "payment_method": {
    "type": "credit_card",
    "token": "pm_1234567890"
  }
}

// Response
{
  "order_id": "uuid",
  "status": "processing",
  "estimated_completion": "2024-01-15T10:30:00Z",
  "saga_id": "uuid",
  "links": {
    "self": "/api/v1/orders/{order_id}",
    "status": "/api/v1/orders/{order_id}/status"
  }
}

Order Status Endpoint

GET /api/v1/orders/{order_id}/status
Authorization: Bearer {token}

// Response
{
  "order_id": "uuid",
  "status": "confirmed",
  "current_step": "inventory_reserved",
  "completed_steps": [
    "order_created",
    "payment_reserved"
  ],
  "pending_steps": [
    "inventory_reserved",
    "shipping_created"
  ],
  "estimated_completion": "2024-01-15T10:30:00Z",
  "last_updated": "2024-01-15T10:25:00Z"
}

Error Handling

// Error Response Format
{
  "error": {
    "code": "SAGA_FAILED",
    "message": "Order processing failed during payment reservation",
    "details": {
      "step": "payment_reserved",
      "reason": "Insufficient funds",
      "saga_id": "uuid",
      "compensation_required": true
    },
    "timestamp": "2024-01-15T10:25:00Z",
    "request_id": "uuid"
  }
}

// Common Error Codes
const ERROR_CODES = {
  INVALID_ORDER_DATA: 'INVALID_ORDER_DATA',
  PAYMENT_FAILED: 'PAYMENT_FAILED',
  INVENTORY_UNAVAILABLE: 'INVENTORY_UNAVAILABLE',
  SAGA_TIMEOUT: 'SAGA_TIMEOUT',
  COMPENSATION_FAILED: 'COMPENSATION_FAILED'
};

Operational Considerations

Key operational aspects with implementation details and best practices

Monitoring & Observability

• Saga execution time (P50, P95, P99)
• Compensation rate and success rate
• Event processing latency
• Service health checks and dependencies
• Database connection pool status
• Message queue depth and throughput

// Prometheus Metrics
const metrics = {
  saga_duration: new Histogram({
    name: 'saga_duration_seconds',
    help: 'Saga execution duration',
    labelNames: ['saga_type', 'status'],
    buckets: [0.1, 0.5, 1, 2, 5, 10, 30]
  }),
  compensation_rate: new Counter({
    name: 'compensation_total',
    help: 'Total compensation operations',
    labelNames: ['saga_type', 'step', 'reason']
  })
};

Failure Handling & Resilience

• Exponential backoff retry logic
• Circuit breaker pattern implementation
• Dead letter queues for failed events
• Manual intervention procedures
• Graceful degradation strategies
• Rollback and compensation strategies

// Retry Logic with Exponential Backoff
class RetryHandler {
  async executeWithRetry(operation, maxRetries = 3) {
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        return await operation();
      } catch (error) {
        if (attempt === maxRetries) throw error;
        
        const delay = Math.pow(2, attempt) * 1000; // 2s, 4s, 8s
        await new Promise(resolve => setTimeout(resolve, delay));
      }
    }
  }
}

Deployment & Infrastructure

Container Orchestration

• Kubernetes deployment with health checks
• Horizontal pod autoscaling (HPA)
• Resource limits and requests
• Rolling updates with zero downtime

Database Management

• Connection pooling configuration
• Read replicas for analytics
• Automated backups and point-in-time recovery
• Performance tuning and query optimization

Results & Metrics

Expected outcomes and success metrics

99.9%

Uptime

<500ms

Order Processing Time

0.1%

Order Failure Rate

Back to Blueprints Next: Chat App Scalability