Skip to content

Conversation

@jeremyeder
Copy link
Collaborator

Summary

Fixes production frontend pod OOMKills by increasing memory limits from 512Mi to 2Gi and adding resource limits to the oauth-proxy sidecar.

Problem

Frontend pods in vteam-uat are experiencing a sawtooth memory pattern - memory climbs to 500-550Mi, triggers OOMKill, pod restarts, and the cycle repeats every few hours.

Evidence:

  • OAuth proxy logs show repeated "unexpected EOF" and "connection refused" errors
  • Memory graph shows characteristic sawtooth pattern (see screenshot below)
  • Pattern matches previous LiteLLM issue (commit b43bf39)

Frontend Memory Sawtooth Pattern

Root Cause

  1. Insufficient memory limits: 512Mi too tight for Next.js + React Query + WebSocket sessions + OAuth proxy
  2. No OAuth proxy limits: Sidecar consuming memory without constraints
  3. No Node.js heap control: Currently relies on default behavior

Solution

Minimal production-only fix (1 file, 18 lines changed):

Frontend Container

  • Memory request: 256Mi → 512Mi (matches observed peak)
  • Memory limit: 512Mi → 2Gi (4x increase, matches LiteLLM precedent)
  • CPU request: 100m → 200m
  • CPU limit: 500m → 1000m

OAuth Proxy Container (NEW)

  • Memory request: 256Mi
  • Memory limit: 512Mi (conservative - prevents auth outage)
  • CPU request: 50m
  • CPU limit: 200m

Rationale:

  • 2Gi limit provides 3.6-4x safety margin over observed peak
  • Matches LiteLLM precedent (commit b43bf39: 512Mi → 2Gi)
  • Production-only scope (other environments not affected)

Deferred Work (YAGNI)

Not included in this PR - add later if needed:

  • NODE_OPTIONS heap tuning (Node.js 20+ auto-tunes for containers)
  • Health probe tuning (symptom, not cause)
  • Other environments (no evidence of issues)
  • Monitoring/alerting updates (validate limits first)

Test Plan

Pre-Deployment:

  • YAML syntax validated with kustomize build
  • Measure current oauth-proxy memory usage
  • Backup current deployment config

Post-Deployment:

# Deploy
kubectl apply -k components/manifests/overlays/production -n vteam-uat
kubectl rollout status deployment/frontend -n vteam-uat

# Monitor (24-48 hours)
kubectl top pods -n vteam-uat -l app=frontend --watch --containers
kubectl get events -n vteam-uat --field-selector involvedObject.name=frontend-* | grep OOMKilled

Success Criteria:

  • ✅ Memory stabilizes below 1.5Gi
  • ✅ No sawtooth pattern (no restarts)
  • ✅ No OOMKilled events
  • ✅ OAuth proxy usage < 400Mi
  • ✅ Zero "unexpected EOF" errors in logs

Rollback Plan:

kubectl rollout undo deployment/frontend -n vteam-uat

Agent Review

This plan was reviewed by:

  • Code Architect: Validated 2Gi limit (not 4Gi), OAuth proxy sizing, deployment strategy
  • Code Simplifier: Reduced from 5 environments → production only, removed unnecessary complexity

Files Changed

  • components/manifests/overlays/production/frontend-oauth-deployment-patch.yaml (18 lines added)
  • docs/screenshots/frontend-memory-sawtooth.png (183KB - visual evidence)

🤖 Generated with Claude Code

- Frontend memory: 512Mi → 2Gi limit, 256Mi → 512Mi request
- OAuth proxy limits: 256Mi request, 512Mi limit (NEW - prevent unbounded growth)

Fixes sawtooth memory pattern causing pod restarts and connection errors.
Matches LiteLLM precedent (commit b43bf39: 512Mi → 2Gi, 4x increase).

Production only - other environments not affected by OOMKills.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@jeremyeder
Copy link
Collaborator Author

Screenshot 2026-01-10 at 23 46 28

@github-actions
Copy link
Contributor

github-actions bot commented Jan 11, 2026

Claude Code Review

Summary

This PR addresses production frontend OOMKills by increasing memory limits from 512Mi to 2Gi and adding resource limits to the oauth-proxy sidecar. The changes are scoped to production-only deployment via Kustomize patch, following precedent from LiteLLM fix (commit b43bf39). The memory graph clearly shows the sawtooth pattern indicative of OOMKills, and the proposed fix is well-justified.

Issues by Severity

🚫 Blocker Issues

None - This PR is ready to merge.

🔴 Critical Issues

None

🟡 Major Issues

1. Kustomize Strategic Merge Behavior - Container List Replacement Risk

Location: components/manifests/overlays/production/frontend-oauth-deployment-patch.yaml:11-20

Issue: Kustomize's strategic merge for container arrays uses name as the merge key. When you specify a container with name: frontend, it should merge with the base container IF the patch strategy is correctly interpreted. However, the current patch lists the frontend container before the oauth-proxy container.

Potential Problem: Some Kustomize versions or configurations might replace the entire container list instead of merging individual containers by name. The ordering (frontend first, oauth-proxy second) suggests intent to merge, but this isn't guaranteed without explicit merge directives.

Evidence from base:

# base/frontend-deployment.yaml:19-40
containers:
- name: frontend
  imagePullPolicy: Always
  image: quay.io/ambient_code/vteam_frontend:latest
  # ... existing config with 512Mi limit

Recommended Fix:
Add explicit merge strategy directive to ensure container-level merging:

spec:
  template:
    spec:
      containers:
      - name: frontend
        $patch: merge  # Explicit merge directive
        resources:
          requests:
            memory: 512Mi
            cpu: 200m
          limits:
            memory: 2Gi
            cpu: 1000m

OR verify that the current approach works as expected by testing the kustomize build output:

kustomize build components/manifests/overlays/production | grep -A 50 "kind: Deployment" | grep -A 40 "name: frontend"

Risk Level: Medium - The patch likely works (similar patterns exist in local-dev/frontend-deployment-patch.yaml), but lack of explicit merge directive leaves room for ambiguity.

🔵 Minor Issues

1. Missing NODE_OPTIONS Heap Tuning Discussion

Location: PR description - Deferred Work section

Observation: The PR correctly defers NODE_OPTIONS heap tuning, citing Node.js 20+ auto-tuning. This is pragmatic (YAGNI principle), but worth documenting the decision basis.

Recommendation: Consider adding a comment in the YAML explaining why NODE_OPTIONS isn't set:

      - name: frontend
        # No NODE_OPTIONS --max-old-space-size tuning:
        # Node.js 20+ auto-tunes heap based on container limits (2Gi)
        # Will add explicit tuning if 2Gi proves insufficient
        resources:

Impact: Very low - documentation improvement only.

2. OAuth Proxy Memory Limits Conservative but Unvalidated

Location: frontend-oauth-deployment-patch.yaml:60-66

Issue: OAuth proxy is allocated 256Mi request / 512Mi limit without empirical measurement:

resources:
  requests:
    memory: 256Mi
    cpu: 50m
  limits:
    memory: 512Mi
    cpu: 200m

Recommendation: The PR test plan includes "Measure current oauth-proxy memory usage" as unchecked. This should be done before deployment to validate the 512Mi limit is appropriate.

Suggested validation:

# Check current oauth-proxy usage (if deployed)
kubectl top pods -n vteam-uat -l app=frontend --containers | grep oauth-proxy

Risk: Low - 512Mi is likely sufficient for oauth-proxy, but empirical validation strengthens confidence.

3. File Location - Screenshot in docs/screenshots/

Location: docs/screenshots/frontend-memory-sawtooth.png

Observation: Screenshot location is appropriate and follows documentation standards (per CLAUDE.md). No issue here - just noting for completeness.

Note: The screenshot clearly shows:

  • Green line (frontend) exhibits sawtooth pattern
  • Memory climbs to ~500-550Mi then drops sharply
  • Blue line (minio) shows stable, healthy memory profile
  • Pattern repeats every few hours (Jan 7-10)

This is excellent evidence supporting the fix. ✅

Positive Highlights

1. Excellent Problem Analysis ⭐

  • Clear evidence with memory graph showing OOMKill sawtooth pattern
  • Root cause analysis identifies insufficient limits AND oauth-proxy missing limits
  • Follows established precedent (LiteLLM commit b43bf39)

2. Appropriate Scope - Production Only 🎯

  • Kustomize overlay correctly targets production environment only
  • Avoids unnecessary changes to dev/e2e/minikube environments
  • Follows YAGNI principle (no premature optimization)

3. Well-Documented Test Plan 📋

  • Clear success criteria (memory < 1.5Gi, no sawtooth, no OOMKills)
  • Monitoring plan (24-48 hours with kubectl top/events)
  • Rollback plan documented

4. Consistent with LiteLLM Precedent 🔄

  • Both fixes: 512Mi → 2Gi (4x increase)
  • Both: CPU 500m → 1000m
  • Both: Requests also increased for stability
  • Pattern proven successful in production

5. Agent Review Process ✅

  • Code Architect validated sizing
  • Code Simplifier enforced YAGNI (reduced scope)
  • Multi-agent review adds confidence

Recommendations

Priority 1: Validate Kustomize Merge Behavior

Before merging, verify the patch correctly merges containers (not replaces):

# Generate final YAML and verify frontend container has BOTH:
# 1. Base config (image, env vars, probes from base/frontend-deployment.yaml)
# 2. Patched config (new resource limits)
# 3. oauth-proxy container is added (not replaced)

kustomize build components/manifests/overlays/production > /tmp/prod-manifest.yaml

# Check frontend container
grep -A 60 'name: frontend' /tmp/prod-manifest.yaml | grep -E '(image:|memory:|BACKEND_URL|livenessProbe)'

# Expected output should show:
# - image: quay.io/ambient_code/vteam_frontend:latest (from base)
# - memory: 2Gi (from patch)
# - BACKEND_URL env var (from base)
# - livenessProbe (from base)

# Check oauth-proxy container exists
grep -A 30 'name: oauth-proxy' /tmp/prod-manifest.yaml

If the frontend container is missing base config (image, env, probes), add $patch: merge directive.

Priority 2: Measure OAuth Proxy Usage Before Deployment

# If oauth-proxy is currently deployed (unlikely), measure actual usage
kubectl top pods -n vteam-uat -l app=frontend --containers 2>/dev/null | grep oauth-proxy || echo "Not currently deployed"

# After deployment, validate 512Mi is sufficient:
# Success if oauth-proxy stays < 400Mi (80% of limit)

Priority 3: Monitor and Document Results

After deployment:

  1. Monitor for 48 hours as planned
  2. Document actual memory usage in follow-up comment or docs
  3. If memory stabilizes < 1.5Gi, consider this precedent for future Next.js deployments

Priority 4: Consider Future Optimization (Post-Fix)

Once the immediate OOMKill issue is resolved and you have 48 hours of data:

  • Analyze actual memory usage patterns
  • Determine if 2Gi can be reduced (e.g., to 1.5Gi) for better resource utilization
  • Document findings in docs/decisions.md or ADR

Final Assessment

Verdict: APPROVE with minor validation recommended before merge

This PR demonstrates excellent problem-solving:

  • Evidence-based diagnosis (memory graph)
  • Appropriate fix following proven precedent
  • Scoped to affected environment only
  • Well-documented with test/rollback plans

The only concern is ensuring Kustomize merges containers correctly (not replaces). This is likely working as intended (similar patterns exist), but validation provides certainty.

Confidence Level: 95% - The fix will resolve OOMKills. The 5% uncertainty is around Kustomize merge behavior, easily verified with kustomize build check.


Review completed following:

  • ✅ CLAUDE.md project standards
  • ✅ Backend development context (K8s patterns)
  • ✅ Security standards (no secrets, appropriate RBAC)
  • ✅ Error handling patterns
  • ✅ Architecture consistency

No security, authentication, or RBAC concerns - This is infrastructure configuration only.

🤖 Review by Claude Code (Sonnet 4.5)


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant