AI-Proxy-Worker/docs/Monitoring.en.md

# Monitoring Guide

<div align="center">

**🌍 Language / 语言**

[🇺🇸 English](./Monitoring.en.md) | [🇨🇳 中文](./Monitoring.md)

</div>

Learn how to monitor your AI Proxy Worker deployment, track performance, and troubleshoot issues using Cloudflare's built-in monitoring tools.

## 📊 Cloudflare Dashboard Monitoring

### Worker Analytics
Access real-time metrics in your Cloudflare dashboard:

1. **Navigate to Workers & Pages**
2. **Select your AI Proxy Worker**
3. **View Analytics tab**

### Key Metrics to Monitor

#### Request Metrics
- **Requests per second** - Traffic volume
- **Success rate** - Percentage of successful requests
- **Error rate** - Failed requests requiring attention
- **Response time** - Average latency

#### Resource Usage
- **CPU usage** - Worker execution time
- **Memory usage** - Memory consumption per request
- **Duration** - Request processing time

#### Error Analysis
- **4xx errors** - Client-side issues (authentication, validation)
- **5xx errors** - Server-side problems (upstream API issues)
- **Timeout errors** - Requests exceeding time limits

## 🔍 Log Analysis

### Accessing Logs
```bash
# View real-time logs
wrangler tail

# Filter by specific log level
wrangler tail --format json | jq 'select(.level == "error")'

# Save logs to file
wrangler tail --format json > worker-logs.json
```

### Log Levels and Meanings

#### INFO Level
```javascript
console.log('Request received:', {
  method: request.method,
  url: request.url,
  timestamp: new Date().toISOString()
});
```

#### WARN Level
```javascript
console.warn('Rate limit approaching:', {
  clientId: 'user123',
  requestCount: 95,
  limit: 100
});
```

#### ERROR Level
```javascript
console.error('API request failed:', {
  error: error.message,
  statusCode: response.status,
  timestamp: new Date().toISOString()
});
```

## 📈 Performance Monitoring

### Response Time Tracking
Monitor these key performance indicators:

```javascript
// Custom timing logs
const start = Date.now();
const response = await fetch(upstreamAPI);
const duration = Date.now() - start;

console.log('Upstream API timing:', {
  duration: duration,
  endpoint: 'deepseek-api',
  status: response.status
});
```

### Recommended Response Time Targets
- **Chat requests**: < 2 seconds
- **Streaming responses**: First token < 1 second
- **Health checks**: < 500ms

### Performance Optimization Monitoring
Track these metrics to identify optimization opportunities:

1. **Cold start frequency** - Worker initialization time
2. **Memory usage patterns** - Identify memory leaks
3. **CPU utilization** - Optimize heavy computations
4. **Network latency** - Upstream API response times

## 🚨 Alert Configuration

### Cloudflare Alerts
Set up alerts for critical issues:

#### Error Rate Alert
```yaml
Alert Type: Worker Error Rate
Threshold: > 5% error rate
Time Period: 5 minutes
Notification: Email, Webhook
```

#### Response Time Alert
```yaml
Alert Type: Worker Response Time
Threshold: > 3 seconds average
Time Period: 5 minutes
Notification: Email, Slack
```

#### Request Volume Alert
```yaml
Alert Type: Request Volume
Threshold: > 1000 requests/minute
Time Period: 1 minute
Notification: Email
```

### Custom Alert Implementation
```javascript
// In your worker code
const ALERT_THRESHOLDS = {
  ERROR_RATE: 0.05,  // 5%
  RESPONSE_TIME: 3000,  // 3 seconds
  REQUEST_RATE: 1000   // 1000/minute
};

async function checkAlerts(metrics) {
  if (metrics.errorRate > ALERT_THRESHOLDS.ERROR_RATE) {
    await sendAlert('High error rate detected', metrics);
  }
}
```

## 📋 Health Checks

### Endpoint Monitoring
Create a health check endpoint:

```javascript
// Add to your worker
if (url.pathname === '/health') {
  const healthStatus = await checkSystemHealth();
  return new Response(JSON.stringify(healthStatus), {
    headers: { 'Content-Type': 'application/json' }
  });
}

async function checkSystemHealth() {
  return {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    version: '1.0.0',
    upstreamAPIs: {
      deepseek: await checkDeepSeekAPI()
    }
  };
}
```

### External Monitoring Services
Integrate with external monitoring tools:

#### Uptime Robot
```bash
# Monitor endpoint
https://your-worker.workers.dev/health

# Check every 5 minutes
# Alert on 3 consecutive failures
```

#### Pingdom
```bash
# HTTP check configuration
URL: https://your-worker.workers.dev/health
Interval: 1 minute
Timeout: 30 seconds
```

## 🔧 Debugging Tools

### Debug Mode
Enable detailed logging for troubleshooting:

```javascript
const DEBUG = env.DEBUG_MODE === 'true';

if (DEBUG) {
  console.log('Debug: Request details:', {
    headers: Object.fromEntries(request.headers),
    body: await request.clone().text(),
    timestamp: new Date().toISOString()
  });
}
```

### Request Tracing
Track requests through the system:

```javascript
function generateTraceId() {
  return Math.random().toString(36).substring(2, 15);
}

const traceId = generateTraceId();
console.log('Request started:', { traceId, url: request.url });

// Pass traceId through all function calls
const result = await processRequest(request, { traceId });

console.log('Request completed:', { traceId, duration });
```

## 📊 Custom Metrics

### Business Metrics
Track application-specific metrics:

```javascript
// Track model usage
const modelUsage = {
  'deepseek-chat': 0,
  'deepseek-reasoner': 0
};

// Track user activity
const userMetrics = {
  activeUsers: new Set(),
  totalRequests: 0,
  streamingRequests: 0
};

// Log metrics periodically
setInterval(() => {
  console.log('Business metrics:', {
    modelUsage,
    userMetrics: {
      activeUsers: userMetrics.activeUsers.size,
      totalRequests: userMetrics.totalRequests,
      streamingRequests: userMetrics.streamingRequests
    }
  });
}, 60000); // Every minute
```

### Cost Tracking
Monitor usage costs:

```javascript
// Track request costs
const costTracking = {
  totalRequests: 0,
  cpuTime: 0,
  bandwidthUsed: 0
};

// Calculate estimated costs
function calculateCosts(metrics) {
  const workerCost = metrics.totalRequests * 0.0000005; // $0.50 per million
  const cpuCost = metrics.cpuTime * 0.000002; // $2 per million CPU seconds

  return {
    workerCost,
    cpuCost,
    totalCost: workerCost + cpuCost
  };
}
```

## 🔍 Log Analysis Best Practices

### Structured Logging
Use consistent log formats:

```javascript
function logEvent(level, event, data) {
  const logEntry = {
    level,
    event,
    timestamp: new Date().toISOString(),
    workerId: env.WORKER_ID || 'unknown',
    ...data
  };

  console[level](JSON.stringify(logEntry));
}

// Usage
logEvent('info', 'request_received', { method, url });
logEvent('error', 'api_error', { error: err.message, statusCode });
```

### Log Retention
Understand Cloudflare's log retention:
- **Real-time logs**: Available during development
- **Analytics data**: Retained for 30 days
- **Custom logging**: Use external services for long-term storage

### External Log Aggregation
Send logs to external services:

```javascript
async function sendToLogService(logData) {
  if (env.LOG_SERVICE_URL) {
    await fetch(env.LOG_SERVICE_URL, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(logData)
    });
  }
}
```

## 📱 Monitoring Dashboard

### Creating Custom Dashboards
Use tools like Grafana or Datadog:

```javascript
// Send metrics to external service
async function sendMetrics(metrics) {
  if (env.METRICS_ENDPOINT) {
    await fetch(env.METRICS_ENDPOINT, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${env.METRICS_API_KEY}`
      },
      body: JSON.stringify({
        service: 'ai-proxy-worker',
        timestamp: Date.now(),
        metrics
      })
    });
  }
}
```

### Key Dashboard Widgets
1. **Request Volume** - Line chart showing requests over time
2. **Error Rate** - Percentage gauge with threshold alerts
3. **Response Time** - Histogram showing latency distribution
4. **Model Usage** - Pie chart showing model usage breakdown
5. **Geographic Distribution** - Map showing request origins

## 🚨 Incident Response

### Incident Detection
Automated monitoring should detect:
- High error rates (>5%)
- Slow response times (>3s average)
- Service unavailability
- Unusual traffic patterns

### Response Procedures
1. **Immediate**: Check Cloudflare status page
2. **Investigate**: Review recent deployments and logs
3. **Mitigate**: Roll back if necessary
4. **Communicate**: Update status page and notify users
5. **Resolve**: Fix root cause
6. **Post-mortem**: Document lessons learned

### Emergency Contacts
Maintain an escalation list:
- Primary: On-call engineer
- Secondary: Team lead
- Escalation: Infrastructure team

---

**Effective monitoring ensures reliable service** 📊

Regular monitoring helps you maintain high availability and quickly resolve issues.