Observability: Logging, Metrics & Tracing

Chapter 1: Flying Blind - Why Observability Matters

When you run code on your laptop, you can see the terminal. You can debug it. You are in control.

When you deploy code to the cloud, it runs on a machine you cannot touch, in a building you cannot visit. If it breaks at 3 AM, how do you know?

✈️ Real World: The Cockpit Analogy
Imagine flying a plane at night. You can't see the ground. You rely entirely on your instruments (Altitude, Speed, Fuel).

Observability is building that dashboard. Without it, you are flying blind.

Chapter 2: The Three Pillars

To understand a system, you need three types of data.

2.1 Logs - "The Detailed Story"

A log is a record of a specific event. "User X clicked Button Y."

❌ Bad Log: Error in file. (Useless)
❌ Bad Log: User logged in. (Which user?)

✅ Good Log:

{
  "level": "ERROR",
  "message": "Payment failed",
  "userId": "usr_123",
  "orderId": "ord_555",
  "reason": "Insufficient Funds"
}

Rule: Always use Structured Logging (JSON). A computer can search JSON ("Show me all errors for User 123"). A computer cannot easily search text sentences.

2.2 Metrics - "The Vital Signs"

Logs are expensive to store (text takes space). Metrics are cheap numbers.

🚗 Real World: Speedometer
You don't write down "I am going 60mph" on a notepad every second. You just glance at the needle.

Common Metric Types:

Counter: "Total Requests: 10,500" (Always goes up).
Gauge: "CPU Usage: 45%" (Goes up and down).
Histogram: "Latency: 95% of requests took < 200ms".

2.3 Tracing - "The Journey"

In modern systems (Microservices), one click might hit 10 different servers. If the click is slow, WHICH server is slow?

📦 Real World: Package Tracking
You buy a shirt. It travels:

Warehouse → Truck → Sort Facility → Truck → Your House.

If it is late, you check the Tracking ID to see exactly where it got stuck (e.g., "Held at Sort Facility").

Trace ID (Correlation ID): A unique ID (e.g., GUID) generated at the start of a request. Every service passes it to the next service. You can search your logs for this ID to see the entire path.

Chapter 3: Health Checks - Are You Alive?

Your Load Balancer needs to know if your server is healthy.

Liveness Probe (/livez): "Are you running?"
If No: Restart the container. (Crash loop).
Readiness Probe (/readyz): "Can you accept traffic?"
If No: Don't send users here yet. (e.g., Still starting up, loading cache).

Chapter 4: Alerting - Preventing Dashboard Fatigue

You cannot stare at a dashboard 24/7. You need alarms.

Good Alerts vs Bad Alerts

❌ Bad Alert (Noise): "CPU is at 80%".
Why bad? Maybe the server works fine at 80%. If the user is happy, don't wake me up.
❌ Bad Alert (Spam): "Server restarted" (100 times/hour).
Why bad? You will ignore it eventually "just like the boy who cried wolf".
✅ Good Alert (Symptom-based): "Error Rate > 5%". or "Login Page latency > 5 seconds".
Why good? The user is suffering. Wake up and fix it!

Chapter 5: Strategies - How to Look at Data

The RED Method (For Services)

For every API you build, measure:

R - Rate: Traffic (Requests per second).
E - Errors: Failed requests per second.
D - Duration: Latency (how long it takes).

The USE Method (For Hardware)

For every resource (CPU, Disk, RAM):

U - Utilization: How busy? (e.g., 90% usage).
S - Saturation: Is work queuing up? (Disk queue length).
E - Errors: Hardware errors?

Chapter 6: Summary Checklist

Observability Best Practices:

[ ] Log as JSON. Never plain text.
[ ] No PII/Secrets. Never log passwords, credit card numbers, or API keys.
[ ] Use Trace IDs. Pass a X-Correlation-ID header to every downstream API.
[ ] Define "Golden Signals" (RED Method) on your dashboard.
[ ] Alert on Symptoms. Only wake up engineers if users are impacted.

Quick Review

Observability is the ability to understand a system's behavior in production from telemetry (logs, metrics, traces), allowing us to detect, diagnose, and localize failures and latency.

✅ The three pillars

Logs: detailed events (good for debugging specific cases).
Metrics: cheap numbers over time (good for dashboards and alerts).
Traces: end-to-end request journey across services (find the slow hop).

✅ Correlation is the superpower

Trace/Correlation ID: one ID travels through every service so you can connect logs and spans.

✅ Health checks and alerting

Liveness: “should I restart it?”
Readiness: “should I send traffic to it?”
Alerts: trigger on user-visible symptoms (error rate, latency), not raw noise (CPU alone).

✅ Mental models for dashboards

RED: Rate, Errors, Duration (service viewpoint).
USE: Utilization, Saturation, Errors (resource viewpoint).