-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Problem
When multiple Coder instances are deployed behind a load balancer (e.g., Kubernetes Service) pointing at the same database without a Premium license, users experience intermittent failures. This happens because:
- Each Coder instance registers itself as a replica in the database
- After the grace period (default 1 minute), the entitlements check detects multiple replicas without HA entitlement
- An error is recorded: "You have multiple replicas but high availability is an Enterprise feature. You will be unable to connect to workspaces."
- However, the
/healthzendpoint still returns200 OK - The load balancer continues routing traffic to all nodes, but only one can properly serve workspace connections
The result is that ~50% of requests fail unpredictably (or worse ratios with more replicas).
Current Behavior
The /healthz endpoint (coderd/coderd.go:909) unconditionally returns "OK":
r.Get("/healthz", func(w http.ResponseWriter, _ *http.Request) { _, _ = w.Write([]byte("OK")) })The entitlement error is being detected and stored (enterprise/coderd/license/license.go:431-449), but it's only exposed via:
- The authenticated
/api/v2/entitlementsendpoint - Warning headers on authenticated responses
Neither of these can be used for Kubernetes readiness probes.
Proposed Solution
Add a /readyz endpoint (unauthenticated) that returns:
200 OKwhen the node is fully operational503 Service Unavailablewhen the node has critical issues that should exclude it from load balancing
The readiness check should verify:
- Database connectivity - Can the node reach the database?
- No blocking entitlement errors - Specifically, the multi-replica without HA license error
This follows Kubernetes conventions where:
/healthz(liveness) = "Is the process alive?" → restart if failing/readyz(readiness) = "Can this instance serve traffic?" → remove from load balancer if failing
Implementation Notes
Key code areas:
- Entitlements tracking:
coderd/entitlements/entitlements.go- Add method likeHasBlockingErrors() boolthat checks for errors that should make the node unready - New endpoint:
coderd/coderd.go- Add/readyzroute - Error detection: The replica error is already generated in
enterprise/coderd/license/license.goin theErrorsslice
Example implementation sketch:
r.Get("/readyz", func(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()
// Check database connectivity
if _, err := api.Database.Ping(ctx); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
_, _ = w.Write([]byte("database unreachable"))
return
}
// Check for blocking entitlement errors
if api.Entitlements.HasBlockingErrors() {
w.WriteHeader(http.StatusServiceUnavailable)
_, _ = w.Write([]byte("entitlement error"))
return
}
_, _ = w.Write([]byte("OK"))
})Alternatives Considered
-
Modify
/healthzdirectly - Rejected because changing liveness probe behavior could cause restart loops instead of just removing from load balancer -
Require authentication on readiness probe - Rejected because Kubernetes probes typically run without application-level auth, and managing secrets for probes adds operational complexity
-
External monitoring of
/api/v2/entitlements- Works but requires additional infrastructure (sidecar, external health checker with credentials)
Additional Context
This issue particularly affects:
- Kubernetes deployments using
replicas > 1without realizing HA requires Premium - Blue-green or rolling deployments where multiple pods temporarily coexist
- Development/staging environments that mirror production topology without licenses
The current workaround is to manually ensure only one replica runs, but this defeats the purpose of high availability and is error-prone.