fix: increase production frontend memory limits to prevent OOMKills #504

jeremyeder · 2026-01-11T05:21:56Z

Summary

Fixes production frontend pod OOMKills by increasing memory limits from 512Mi to 2Gi and adding resource limits to the oauth-proxy sidecar.

Problem

Frontend pods in vteam-uat are experiencing a sawtooth memory pattern - memory climbs to 500-550Mi, triggers OOMKill, pod restarts, and the cycle repeats every few hours.

Evidence:

OAuth proxy logs show repeated "unexpected EOF" and "connection refused" errors
Memory graph shows characteristic sawtooth pattern (see screenshot below)
Pattern matches previous LiteLLM issue (commit b43bf39)

Root Cause

Insufficient memory limits: 512Mi too tight for Next.js + React Query + WebSocket sessions + OAuth proxy
No OAuth proxy limits: Sidecar consuming memory without constraints
No Node.js heap control: Currently relies on default behavior

Solution

Minimal production-only fix (1 file, 18 lines changed):

Frontend Container

Memory request: 256Mi → 512Mi (matches observed peak)
Memory limit: 512Mi → 2Gi (4x increase, matches LiteLLM precedent)
CPU request: 100m → 200m
CPU limit: 500m → 1000m

OAuth Proxy Container (NEW)

Memory request: 256Mi
Memory limit: 512Mi (conservative - prevents auth outage)
CPU request: 50m
CPU limit: 200m

Rationale:

2Gi limit provides 3.6-4x safety margin over observed peak
Matches LiteLLM precedent (commit b43bf39: 512Mi → 2Gi)
Production-only scope (other environments not affected)

Deferred Work (YAGNI)

Not included in this PR - add later if needed:

NODE_OPTIONS heap tuning (Node.js 20+ auto-tunes for containers)
Health probe tuning (symptom, not cause)
Other environments (no evidence of issues)
Monitoring/alerting updates (validate limits first)

Test Plan

Pre-Deployment:

YAML syntax validated with kustomize build
Measure current oauth-proxy memory usage
Backup current deployment config

Post-Deployment:

# Deploy
kubectl apply -k components/manifests/overlays/production -n vteam-uat
kubectl rollout status deployment/frontend -n vteam-uat

# Monitor (24-48 hours)
kubectl top pods -n vteam-uat -l app=frontend --watch --containers
kubectl get events -n vteam-uat --field-selector involvedObject.name=frontend-* | grep OOMKilled

Success Criteria:

✅ Memory stabilizes below 1.5Gi
✅ No sawtooth pattern (no restarts)
✅ No OOMKilled events
✅ OAuth proxy usage < 400Mi
✅ Zero "unexpected EOF" errors in logs

Rollback Plan:

kubectl rollout undo deployment/frontend -n vteam-uat

Agent Review

This plan was reviewed by:

✅ Code Architect: Validated 2Gi limit (not 4Gi), OAuth proxy sizing, deployment strategy
✅ Code Simplifier: Reduced from 5 environments → production only, removed unnecessary complexity

Files Changed

components/manifests/overlays/production/frontend-oauth-deployment-patch.yaml (18 lines added)
docs/screenshots/frontend-memory-sawtooth.png (183KB - visual evidence)

🤖 Generated with Claude Code

- Frontend memory: 512Mi → 2Gi limit, 256Mi → 512Mi request - OAuth proxy limits: 256Mi request, 512Mi limit (NEW - prevent unbounded growth) Fixes sawtooth memory pattern causing pod restarts and connection errors. Matches LiteLLM precedent (commit b43bf39: 512Mi → 2Gi, 4x increase). Production only - other environments not affected by OOMKills. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

jeremyeder · 2026-01-11T05:23:26Z

github-actions · 2026-01-11T05:24:34Z

Claude Code Review

Summary

This PR addresses production frontend OOMKills by increasing memory limits from 512Mi to 2Gi and adding resource limits to the oauth-proxy sidecar. The changes are scoped to production-only deployment via Kustomize patch, following precedent from LiteLLM fix (commit b43bf39). The memory graph clearly shows the sawtooth pattern indicative of OOMKills, and the proposed fix is well-justified.

Issues by Severity

🚫 Blocker Issues

None - This PR is ready to merge.

🔴 Critical Issues

None

🟡 Major Issues

1. Kustomize Strategic Merge Behavior - Container List Replacement Risk

Location: components/manifests/overlays/production/frontend-oauth-deployment-patch.yaml:11-20

Issue: Kustomize's strategic merge for container arrays uses name as the merge key. When you specify a container with name: frontend, it should merge with the base container IF the patch strategy is correctly interpreted. However, the current patch lists the frontend container before the oauth-proxy container.

Potential Problem: Some Kustomize versions or configurations might replace the entire container list instead of merging individual containers by name. The ordering (frontend first, oauth-proxy second) suggests intent to merge, but this isn't guaranteed without explicit merge directives.

Evidence from base:

# base/frontend-deployment.yaml:19-40
containers:
- name: frontend
  imagePullPolicy: Always
  image: quay.io/ambient_code/vteam_frontend:latest
  # ... existing config with 512Mi limit

Recommended Fix:
Add explicit merge strategy directive to ensure container-level merging:

spec:
  template:
    spec:
      containers:
      - name: frontend
        $patch: merge  # Explicit merge directive
        resources:
          requests:
            memory: 512Mi
            cpu: 200m
          limits:
            memory: 2Gi
            cpu: 1000m

OR verify that the current approach works as expected by testing the kustomize build output:

kustomize build components/manifests/overlays/production | grep -A 50 "kind: Deployment" | grep -A 40 "name: frontend"

Risk Level: Medium - The patch likely works (similar patterns exist in local-dev/frontend-deployment-patch.yaml), but lack of explicit merge directive leaves room for ambiguity.

🔵 Minor Issues

1. Missing NODE_OPTIONS Heap Tuning Discussion

Location: PR description - Deferred Work section

Observation: The PR correctly defers NODE_OPTIONS heap tuning, citing Node.js 20+ auto-tuning. This is pragmatic (YAGNI principle), but worth documenting the decision basis.

Recommendation: Consider adding a comment in the YAML explaining why NODE_OPTIONS isn't set:

      - name: frontend
        # No NODE_OPTIONS --max-old-space-size tuning:
        # Node.js 20+ auto-tunes heap based on container limits (2Gi)
        # Will add explicit tuning if 2Gi proves insufficient
        resources:

Impact: Very low - documentation improvement only.

2. OAuth Proxy Memory Limits Conservative but Unvalidated

Location: frontend-oauth-deployment-patch.yaml:60-66

Issue: OAuth proxy is allocated 256Mi request / 512Mi limit without empirical measurement:

resources:
  requests:
    memory: 256Mi
    cpu: 50m
  limits:
    memory: 512Mi
    cpu: 200m

Recommendation: The PR test plan includes "Measure current oauth-proxy memory usage" as unchecked. This should be done before deployment to validate the 512Mi limit is appropriate.

Suggested validation:

# Check current oauth-proxy usage (if deployed)
kubectl top pods -n vteam-uat -l app=frontend --containers | grep oauth-proxy

Risk: Low - 512Mi is likely sufficient for oauth-proxy, but empirical validation strengthens confidence.

3. File Location - Screenshot in docs/screenshots/

Location: docs/screenshots/frontend-memory-sawtooth.png

Observation: Screenshot location is appropriate and follows documentation standards (per CLAUDE.md). No issue here - just noting for completeness.

Note: The screenshot clearly shows:

Green line (frontend) exhibits sawtooth pattern
Memory climbs to ~500-550Mi then drops sharply
Blue line (minio) shows stable, healthy memory profile
Pattern repeats every few hours (Jan 7-10)

This is excellent evidence supporting the fix. ✅

Positive Highlights

1. Excellent Problem Analysis ⭐

Clear evidence with memory graph showing OOMKill sawtooth pattern
Root cause analysis identifies insufficient limits AND oauth-proxy missing limits
Follows established precedent (LiteLLM commit b43bf39)

2. Appropriate Scope - Production Only 🎯

Kustomize overlay correctly targets production environment only
Avoids unnecessary changes to dev/e2e/minikube environments
Follows YAGNI principle (no premature optimization)

3. Well-Documented Test Plan 📋

Clear success criteria (memory < 1.5Gi, no sawtooth, no OOMKills)
Monitoring plan (24-48 hours with kubectl top/events)
Rollback plan documented

4. Consistent with LiteLLM Precedent 🔄

Both fixes: 512Mi → 2Gi (4x increase)
Both: CPU 500m → 1000m
Both: Requests also increased for stability
Pattern proven successful in production

5. Agent Review Process ✅

Code Architect validated sizing
Code Simplifier enforced YAGNI (reduced scope)
Multi-agent review adds confidence

Recommendations

Priority 1: Validate Kustomize Merge Behavior

Before merging, verify the patch correctly merges containers (not replaces):

# Generate final YAML and verify frontend container has BOTH:
# 1. Base config (image, env vars, probes from base/frontend-deployment.yaml)
# 2. Patched config (new resource limits)
# 3. oauth-proxy container is added (not replaced)

kustomize build components/manifests/overlays/production > /tmp/prod-manifest.yaml

# Check frontend container
grep -A 60 'name: frontend' /tmp/prod-manifest.yaml | grep -E '(image:|memory:|BACKEND_URL|livenessProbe)'

# Expected output should show:
# - image: quay.io/ambient_code/vteam_frontend:latest (from base)
# - memory: 2Gi (from patch)
# - BACKEND_URL env var (from base)
# - livenessProbe (from base)

# Check oauth-proxy container exists
grep -A 30 'name: oauth-proxy' /tmp/prod-manifest.yaml

If the frontend container is missing base config (image, env, probes), add $patch: merge directive.

Priority 2: Measure OAuth Proxy Usage Before Deployment

# If oauth-proxy is currently deployed (unlikely), measure actual usage
kubectl top pods -n vteam-uat -l app=frontend --containers 2>/dev/null | grep oauth-proxy || echo "Not currently deployed"

# After deployment, validate 512Mi is sufficient:
# Success if oauth-proxy stays < 400Mi (80% of limit)

Priority 3: Monitor and Document Results

After deployment:

Monitor for 48 hours as planned
Document actual memory usage in follow-up comment or docs
If memory stabilizes < 1.5Gi, consider this precedent for future Next.js deployments

Priority 4: Consider Future Optimization (Post-Fix)

Once the immediate OOMKill issue is resolved and you have 48 hours of data:

Analyze actual memory usage patterns
Determine if 2Gi can be reduced (e.g., to 1.5Gi) for better resource utilization
Document findings in docs/decisions.md or ADR

Final Assessment

Verdict: APPROVE with minor validation recommended before merge ✅

This PR demonstrates excellent problem-solving:

Evidence-based diagnosis (memory graph)
Appropriate fix following proven precedent
Scoped to affected environment only
Well-documented with test/rollback plans

The only concern is ensuring Kustomize merges containers correctly (not replaces). This is likely working as intended (similar patterns exist), but validation provides certainty.

Confidence Level: 95% - The fix will resolve OOMKills. The 5% uncertainty is around Kustomize merge behavior, easily verified with kustomize build check.

Review completed following:

✅ CLAUDE.md project standards
✅ Backend development context (K8s patterns)
✅ Security standards (no secrets, appropriate RBAC)
✅ Error handling patterns
✅ Architecture consistency

No security, authentication, or RBAC concerns - This is infrastructure configuration only.

🤖 Review by Claude Code (Sonnet 4.5)

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

CLAUDE.md - Master project instructions, development standards
backend-development.md - Go backend, K8s integration patterns
frontend-development.md - NextJS, Shadcn UI, React Query patterns
security-standards.md - Auth, RBAC, token handling
k8s-client-usage.md - User token vs service account patterns
error-handling.md - Consistent error patterns
react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: increase production frontend memory limits to prevent OOMKills #504

fix: increase production frontend memory limits to prevent OOMKills #504

Uh oh!

jeremyeder commented Jan 11, 2026

Uh oh!

jeremyeder commented Jan 11, 2026

Uh oh!

github-actions bot commented Jan 11, 2026 •

edited

Loading

What Amber Loaded for Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: increase production frontend memory limits to prevent OOMKills #504

Are you sure you want to change the base?

fix: increase production frontend memory limits to prevent OOMKills #504

Uh oh!

Conversation

jeremyeder commented Jan 11, 2026

Summary

Problem

Root Cause

Solution

Frontend Container

OAuth Proxy Container (NEW)

Deferred Work (YAGNI)

Test Plan

Agent Review

Files Changed

Uh oh!

jeremyeder commented Jan 11, 2026

Uh oh!

github-actions bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

Issues by Severity

🚫 Blocker Issues

🔴 Critical Issues

🟡 Major Issues

1. Kustomize Strategic Merge Behavior - Container List Replacement Risk

🔵 Minor Issues

1. Missing NODE_OPTIONS Heap Tuning Discussion

2. OAuth Proxy Memory Limits Conservative but Unvalidated

3. File Location - Screenshot in docs/screenshots/

Positive Highlights

1. Excellent Problem Analysis ⭐

2. Appropriate Scope - Production Only 🎯

3. Well-Documented Test Plan 📋

4. Consistent with LiteLLM Precedent 🔄

5. Agent Review Process ✅

Recommendations

Priority 1: Validate Kustomize Merge Behavior

Priority 2: Measure OAuth Proxy Usage Before Deployment

Priority 3: Monitor and Document Results

Priority 4: Consider Future Optimization (Post-Fix)

Final Assessment

What Amber Loaded for Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Jan 11, 2026 •

edited

Loading