Skip to content

Add readiness probe endpoint that detects unlicensed multi-replica state #21255

@blinkagent

Description

@blinkagent

Problem

When multiple Coder instances are deployed behind a load balancer (e.g., Kubernetes Service) pointing at the same database without a Premium license, users experience intermittent failures. This happens because:

  1. Each Coder instance registers itself as a replica in the database
  2. After the grace period (default 1 minute), the entitlements check detects multiple replicas without HA entitlement
  3. An error is recorded: "You have multiple replicas but high availability is an Enterprise feature. You will be unable to connect to workspaces."
  4. However, the /healthz endpoint still returns 200 OK
  5. The load balancer continues routing traffic to all nodes, but only one can properly serve workspace connections

The result is that ~50% of requests fail unpredictably (or worse ratios with more replicas).

Current Behavior

The /healthz endpoint (coderd/coderd.go:909) unconditionally returns "OK":

r.Get("/healthz", func(w http.ResponseWriter, _ *http.Request) { _, _ = w.Write([]byte("OK")) })

The entitlement error is being detected and stored (enterprise/coderd/license/license.go:431-449), but it's only exposed via:

  • The authenticated /api/v2/entitlements endpoint
  • Warning headers on authenticated responses

Neither of these can be used for Kubernetes readiness probes.

Proposed Solution

Add a /readyz endpoint (unauthenticated) that returns:

  • 200 OK when the node is fully operational
  • 503 Service Unavailable when the node has critical issues that should exclude it from load balancing

The readiness check should verify:

  1. Database connectivity - Can the node reach the database?
  2. No blocking entitlement errors - Specifically, the multi-replica without HA license error

This follows Kubernetes conventions where:

  • /healthz (liveness) = "Is the process alive?" → restart if failing
  • /readyz (readiness) = "Can this instance serve traffic?" → remove from load balancer if failing

Implementation Notes

Key code areas:

  • Entitlements tracking: coderd/entitlements/entitlements.go - Add method like HasBlockingErrors() bool that checks for errors that should make the node unready
  • New endpoint: coderd/coderd.go - Add /readyz route
  • Error detection: The replica error is already generated in enterprise/coderd/license/license.go in the Errors slice

Example implementation sketch:

r.Get("/readyz", func(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()
    
    // Check database connectivity
    if _, err := api.Database.Ping(ctx); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        _, _ = w.Write([]byte("database unreachable"))
        return
    }
    
    // Check for blocking entitlement errors
    if api.Entitlements.HasBlockingErrors() {
        w.WriteHeader(http.StatusServiceUnavailable)
        _, _ = w.Write([]byte("entitlement error"))
        return
    }
    
    _, _ = w.Write([]byte("OK"))
})

Alternatives Considered

  1. Modify /healthz directly - Rejected because changing liveness probe behavior could cause restart loops instead of just removing from load balancer

  2. Require authentication on readiness probe - Rejected because Kubernetes probes typically run without application-level auth, and managing secrets for probes adds operational complexity

  3. External monitoring of /api/v2/entitlements - Works but requires additional infrastructure (sidecar, external health checker with credentials)

Additional Context

This issue particularly affects:

  • Kubernetes deployments using replicas > 1 without realizing HA requires Premium
  • Blue-green or rolling deployments where multiple pods temporarily coexist
  • Development/staging environments that mirror production topology without licenses

The current workaround is to manually ensure only one replica runs, but this defeats the purpose of high availability and is error-prone.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions