Files
AI-Proxy-Worker/docs/Monitoring.en.md

399 lines
8.8 KiB
Markdown
Raw Normal View History

# Monitoring Guide
<div align="center">
**🌍 Language / 语言**
[🇺🇸 English](./Monitoring.en.md) | [🇨🇳 中文](./Monitoring.md)
</div>
Learn how to monitor your AI Proxy Worker deployment, track performance, and troubleshoot issues using Cloudflare's built-in monitoring tools.
## 📊 Cloudflare Dashboard Monitoring
### Worker Analytics
Access real-time metrics in your Cloudflare dashboard:
1. **Navigate to Workers & Pages**
2. **Select your AI Proxy Worker**
3. **View Analytics tab**
### Key Metrics to Monitor
#### Request Metrics
- **Requests per second** - Traffic volume
- **Success rate** - Percentage of successful requests
- **Error rate** - Failed requests requiring attention
- **Response time** - Average latency
#### Resource Usage
- **CPU usage** - Worker execution time
- **Memory usage** - Memory consumption per request
- **Duration** - Request processing time
#### Error Analysis
- **4xx errors** - Client-side issues (authentication, validation)
- **5xx errors** - Server-side problems (upstream API issues)
- **Timeout errors** - Requests exceeding time limits
## 🔍 Log Analysis
### Accessing Logs
```bash
# View real-time logs
wrangler tail
# Filter by specific log level
wrangler tail --format json | jq 'select(.level == "error")'
# Save logs to file
wrangler tail --format json > worker-logs.json
```
### Log Levels and Meanings
#### INFO Level
```javascript
console.log('Request received:', {
method: request.method,
url: request.url,
timestamp: new Date().toISOString()
});
```
#### WARN Level
```javascript
console.warn('Rate limit approaching:', {
clientId: 'user123',
requestCount: 95,
limit: 100
});
```
#### ERROR Level
```javascript
console.error('API request failed:', {
error: error.message,
statusCode: response.status,
timestamp: new Date().toISOString()
});
```
## 📈 Performance Monitoring
### Response Time Tracking
Monitor these key performance indicators:
```javascript
// Custom timing logs
const start = Date.now();
const response = await fetch(upstreamAPI);
const duration = Date.now() - start;
console.log('Upstream API timing:', {
duration: duration,
endpoint: 'deepseek-api',
status: response.status
});
```
### Recommended Response Time Targets
- **Chat requests**: < 2 seconds
- **Streaming responses**: First token < 1 second
- **Health checks**: < 500ms
### Performance Optimization Monitoring
Track these metrics to identify optimization opportunities:
1. **Cold start frequency** - Worker initialization time
2. **Memory usage patterns** - Identify memory leaks
3. **CPU utilization** - Optimize heavy computations
4. **Network latency** - Upstream API response times
## 🚨 Alert Configuration
### Cloudflare Alerts
Set up alerts for critical issues:
#### Error Rate Alert
```yaml
Alert Type: Worker Error Rate
Threshold: > 5% error rate
Time Period: 5 minutes
Notification: Email, Webhook
```
#### Response Time Alert
```yaml
Alert Type: Worker Response Time
Threshold: > 3 seconds average
Time Period: 5 minutes
Notification: Email, Slack
```
#### Request Volume Alert
```yaml
Alert Type: Request Volume
Threshold: > 1000 requests/minute
Time Period: 1 minute
Notification: Email
```
### Custom Alert Implementation
```javascript
// In your worker code
const ALERT_THRESHOLDS = {
ERROR_RATE: 0.05, // 5%
RESPONSE_TIME: 3000, // 3 seconds
REQUEST_RATE: 1000 // 1000/minute
};
async function checkAlerts(metrics) {
if (metrics.errorRate > ALERT_THRESHOLDS.ERROR_RATE) {
await sendAlert('High error rate detected', metrics);
}
}
```
## 📋 Health Checks
### Endpoint Monitoring
Create a health check endpoint:
```javascript
// Add to your worker
if (url.pathname === '/health') {
const healthStatus = await checkSystemHealth();
return new Response(JSON.stringify(healthStatus), {
headers: { 'Content-Type': 'application/json' }
});
}
async function checkSystemHealth() {
return {
status: 'healthy',
timestamp: new Date().toISOString(),
version: '1.0.0',
upstreamAPIs: {
deepseek: await checkDeepSeekAPI()
}
};
}
```
### External Monitoring Services
Integrate with external monitoring tools:
#### Uptime Robot
```bash
# Monitor endpoint
https://your-worker.workers.dev/health
# Check every 5 minutes
# Alert on 3 consecutive failures
```
#### Pingdom
```bash
# HTTP check configuration
URL: https://your-worker.workers.dev/health
Interval: 1 minute
Timeout: 30 seconds
```
## 🔧 Debugging Tools
### Debug Mode
Enable detailed logging for troubleshooting:
```javascript
const DEBUG = env.DEBUG_MODE === 'true';
if (DEBUG) {
console.log('Debug: Request details:', {
headers: Object.fromEntries(request.headers),
body: await request.clone().text(),
timestamp: new Date().toISOString()
});
}
```
### Request Tracing
Track requests through the system:
```javascript
function generateTraceId() {
return Math.random().toString(36).substring(2, 15);
}
const traceId = generateTraceId();
console.log('Request started:', { traceId, url: request.url });
// Pass traceId through all function calls
const result = await processRequest(request, { traceId });
console.log('Request completed:', { traceId, duration });
```
## 📊 Custom Metrics
### Business Metrics
Track application-specific metrics:
```javascript
// Track model usage
const modelUsage = {
'deepseek-chat': 0,
'deepseek-reasoner': 0
};
// Track user activity
const userMetrics = {
activeUsers: new Set(),
totalRequests: 0,
streamingRequests: 0
};
// Log metrics periodically
setInterval(() => {
console.log('Business metrics:', {
modelUsage,
userMetrics: {
activeUsers: userMetrics.activeUsers.size,
totalRequests: userMetrics.totalRequests,
streamingRequests: userMetrics.streamingRequests
}
});
}, 60000); // Every minute
```
### Cost Tracking
Monitor usage costs:
```javascript
// Track request costs
const costTracking = {
totalRequests: 0,
cpuTime: 0,
bandwidthUsed: 0
};
// Calculate estimated costs
function calculateCosts(metrics) {
const workerCost = metrics.totalRequests * 0.0000005; // $0.50 per million
const cpuCost = metrics.cpuTime * 0.000002; // $2 per million CPU seconds
return {
workerCost,
cpuCost,
totalCost: workerCost + cpuCost
};
}
```
## 🔍 Log Analysis Best Practices
### Structured Logging
Use consistent log formats:
```javascript
function logEvent(level, event, data) {
const logEntry = {
level,
event,
timestamp: new Date().toISOString(),
workerId: env.WORKER_ID || 'unknown',
...data
};
console[level](JSON.stringify(logEntry));
}
// Usage
logEvent('info', 'request_received', { method, url });
logEvent('error', 'api_error', { error: err.message, statusCode });
```
### Log Retention
Understand Cloudflare's log retention:
- **Real-time logs**: Available during development
- **Analytics data**: Retained for 30 days
- **Custom logging**: Use external services for long-term storage
### External Log Aggregation
Send logs to external services:
```javascript
async function sendToLogService(logData) {
if (env.LOG_SERVICE_URL) {
await fetch(env.LOG_SERVICE_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(logData)
});
}
}
```
## 📱 Monitoring Dashboard
### Creating Custom Dashboards
Use tools like Grafana or Datadog:
```javascript
// Send metrics to external service
async function sendMetrics(metrics) {
if (env.METRICS_ENDPOINT) {
await fetch(env.METRICS_ENDPOINT, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${env.METRICS_API_KEY}`
},
body: JSON.stringify({
service: 'ai-proxy-worker',
timestamp: Date.now(),
metrics
})
});
}
}
```
### Key Dashboard Widgets
1. **Request Volume** - Line chart showing requests over time
2. **Error Rate** - Percentage gauge with threshold alerts
3. **Response Time** - Histogram showing latency distribution
4. **Model Usage** - Pie chart showing model usage breakdown
5. **Geographic Distribution** - Map showing request origins
## 🚨 Incident Response
### Incident Detection
Automated monitoring should detect:
- High error rates (>5%)
- Slow response times (>3s average)
- Service unavailability
- Unusual traffic patterns
### Response Procedures
1. **Immediate**: Check Cloudflare status page
2. **Investigate**: Review recent deployments and logs
3. **Mitigate**: Roll back if necessary
4. **Communicate**: Update status page and notify users
5. **Resolve**: Fix root cause
6. **Post-mortem**: Document lessons learned
### Emergency Contacts
Maintain an escalation list:
- Primary: On-call engineer
- Secondary: Team lead
- Escalation: Infrastructure team
---
**Effective monitoring ensures reliable service** 📊
Regular monitoring helps you maintain high availability and quickly resolve issues.