Files
AI-Proxy-Worker/docs/Monitoring.en.md
2025-08-17 19:59:10 +08:00

8.8 KiB

Monitoring Guide

🌍 Language / 语言

🇺🇸 English | 🇨🇳 中文

Learn how to monitor your AI Proxy Worker deployment, track performance, and troubleshoot issues using Cloudflare's built-in monitoring tools.

📊 Cloudflare Dashboard Monitoring

Worker Analytics

Access real-time metrics in your Cloudflare dashboard:

  1. Navigate to Workers & Pages
  2. Select your AI Proxy Worker
  3. View Analytics tab

Key Metrics to Monitor

Request Metrics

  • Requests per second - Traffic volume
  • Success rate - Percentage of successful requests
  • Error rate - Failed requests requiring attention
  • Response time - Average latency

Resource Usage

  • CPU usage - Worker execution time
  • Memory usage - Memory consumption per request
  • Duration - Request processing time

Error Analysis

  • 4xx errors - Client-side issues (authentication, validation)
  • 5xx errors - Server-side problems (upstream API issues)
  • Timeout errors - Requests exceeding time limits

🔍 Log Analysis

Accessing Logs

# View real-time logs
wrangler tail

# Filter by specific log level
wrangler tail --format json | jq 'select(.level == "error")'

# Save logs to file
wrangler tail --format json > worker-logs.json

Log Levels and Meanings

INFO Level

console.log('Request received:', {
  method: request.method,
  url: request.url,
  timestamp: new Date().toISOString()
});

WARN Level

console.warn('Rate limit approaching:', {
  clientId: 'user123',
  requestCount: 95,
  limit: 100
});

ERROR Level

console.error('API request failed:', {
  error: error.message,
  statusCode: response.status,
  timestamp: new Date().toISOString()
});

📈 Performance Monitoring

Response Time Tracking

Monitor these key performance indicators:

// Custom timing logs
const start = Date.now();
const response = await fetch(upstreamAPI);
const duration = Date.now() - start;

console.log('Upstream API timing:', {
  duration: duration,
  endpoint: 'deepseek-api',
  status: response.status
});
  • Chat requests: < 2 seconds
  • Streaming responses: First token < 1 second
  • Health checks: < 500ms

Performance Optimization Monitoring

Track these metrics to identify optimization opportunities:

  1. Cold start frequency - Worker initialization time
  2. Memory usage patterns - Identify memory leaks
  3. CPU utilization - Optimize heavy computations
  4. Network latency - Upstream API response times

🚨 Alert Configuration

Cloudflare Alerts

Set up alerts for critical issues:

Error Rate Alert

Alert Type: Worker Error Rate
Threshold: > 5% error rate
Time Period: 5 minutes
Notification: Email, Webhook

Response Time Alert

Alert Type: Worker Response Time
Threshold: > 3 seconds average
Time Period: 5 minutes
Notification: Email, Slack

Request Volume Alert

Alert Type: Request Volume
Threshold: > 1000 requests/minute
Time Period: 1 minute
Notification: Email

Custom Alert Implementation

// In your worker code
const ALERT_THRESHOLDS = {
  ERROR_RATE: 0.05,  // 5%
  RESPONSE_TIME: 3000,  // 3 seconds
  REQUEST_RATE: 1000   // 1000/minute
};

async function checkAlerts(metrics) {
  if (metrics.errorRate > ALERT_THRESHOLDS.ERROR_RATE) {
    await sendAlert('High error rate detected', metrics);
  }
}

📋 Health Checks

Endpoint Monitoring

Create a health check endpoint:

// Add to your worker
if (url.pathname === '/health') {
  const healthStatus = await checkSystemHealth();
  return new Response(JSON.stringify(healthStatus), {
    headers: { 'Content-Type': 'application/json' }
  });
}

async function checkSystemHealth() {
  return {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    version: '1.0.0',
    upstreamAPIs: {
      deepseek: await checkDeepSeekAPI()
    }
  };
}

External Monitoring Services

Integrate with external monitoring tools:

Uptime Robot

# Monitor endpoint
https://your-worker.workers.dev/health

# Check every 5 minutes
# Alert on 3 consecutive failures

Pingdom

# HTTP check configuration
URL: https://your-worker.workers.dev/health
Interval: 1 minute
Timeout: 30 seconds

🔧 Debugging Tools

Debug Mode

Enable detailed logging for troubleshooting:

const DEBUG = env.DEBUG_MODE === 'true';

if (DEBUG) {
  console.log('Debug: Request details:', {
    headers: Object.fromEntries(request.headers),
    body: await request.clone().text(),
    timestamp: new Date().toISOString()
  });
}

Request Tracing

Track requests through the system:

function generateTraceId() {
  return Math.random().toString(36).substring(2, 15);
}

const traceId = generateTraceId();
console.log('Request started:', { traceId, url: request.url });

// Pass traceId through all function calls
const result = await processRequest(request, { traceId });

console.log('Request completed:', { traceId, duration });

📊 Custom Metrics

Business Metrics

Track application-specific metrics:

// Track model usage
const modelUsage = {
  'deepseek-chat': 0,
  'deepseek-reasoner': 0
};

// Track user activity
const userMetrics = {
  activeUsers: new Set(),
  totalRequests: 0,
  streamingRequests: 0
};

// Log metrics periodically
setInterval(() => {
  console.log('Business metrics:', {
    modelUsage,
    userMetrics: {
      activeUsers: userMetrics.activeUsers.size,
      totalRequests: userMetrics.totalRequests,
      streamingRequests: userMetrics.streamingRequests
    }
  });
}, 60000); // Every minute

Cost Tracking

Monitor usage costs:

// Track request costs
const costTracking = {
  totalRequests: 0,
  cpuTime: 0,
  bandwidthUsed: 0
};

// Calculate estimated costs
function calculateCosts(metrics) {
  const workerCost = metrics.totalRequests * 0.0000005; // $0.50 per million
  const cpuCost = metrics.cpuTime * 0.000002; // $2 per million CPU seconds
  
  return {
    workerCost,
    cpuCost,
    totalCost: workerCost + cpuCost
  };
}

🔍 Log Analysis Best Practices

Structured Logging

Use consistent log formats:

function logEvent(level, event, data) {
  const logEntry = {
    level,
    event,
    timestamp: new Date().toISOString(),
    workerId: env.WORKER_ID || 'unknown',
    ...data
  };
  
  console[level](JSON.stringify(logEntry));
}

// Usage
logEvent('info', 'request_received', { method, url });
logEvent('error', 'api_error', { error: err.message, statusCode });

Log Retention

Understand Cloudflare's log retention:

  • Real-time logs: Available during development
  • Analytics data: Retained for 30 days
  • Custom logging: Use external services for long-term storage

External Log Aggregation

Send logs to external services:

async function sendToLogService(logData) {
  if (env.LOG_SERVICE_URL) {
    await fetch(env.LOG_SERVICE_URL, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(logData)
    });
  }
}

📱 Monitoring Dashboard

Creating Custom Dashboards

Use tools like Grafana or Datadog:

// Send metrics to external service
async function sendMetrics(metrics) {
  if (env.METRICS_ENDPOINT) {
    await fetch(env.METRICS_ENDPOINT, {
      method: 'POST',
      headers: { 
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${env.METRICS_API_KEY}`
      },
      body: JSON.stringify({
        service: 'ai-proxy-worker',
        timestamp: Date.now(),
        metrics
      })
    });
  }
}

Key Dashboard Widgets

  1. Request Volume - Line chart showing requests over time
  2. Error Rate - Percentage gauge with threshold alerts
  3. Response Time - Histogram showing latency distribution
  4. Model Usage - Pie chart showing model usage breakdown
  5. Geographic Distribution - Map showing request origins

🚨 Incident Response

Incident Detection

Automated monitoring should detect:

  • High error rates (>5%)
  • Slow response times (>3s average)
  • Service unavailability
  • Unusual traffic patterns

Response Procedures

  1. Immediate: Check Cloudflare status page
  2. Investigate: Review recent deployments and logs
  3. Mitigate: Roll back if necessary
  4. Communicate: Update status page and notify users
  5. Resolve: Fix root cause
  6. Post-mortem: Document lessons learned

Emergency Contacts

Maintain an escalation list:

  • Primary: On-call engineer
  • Secondary: Team lead
  • Escalation: Infrastructure team

Effective monitoring ensures reliable service 📊

Regular monitoring helps you maintain high availability and quickly resolve issues.