-
Notifications
You must be signed in to change notification settings - Fork 39
fix: increase production frontend memory limits to prevent OOMKills #504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: increase production frontend memory limits to prevent OOMKills #504
Conversation
- Frontend memory: 512Mi → 2Gi limit, 256Mi → 512Mi request - OAuth proxy limits: 256Mi request, 512Mi limit (NEW - prevent unbounded growth) Fixes sawtooth memory pattern causing pod restarts and connection errors. Matches LiteLLM precedent (commit b43bf39: 512Mi → 2Gi, 4x increase). Production only - other environments not affected by OOMKills. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Claude Code ReviewSummaryThis PR addresses production frontend OOMKills by increasing memory limits from 512Mi to 2Gi and adding resource limits to the oauth-proxy sidecar. The changes are scoped to production-only deployment via Kustomize patch, following precedent from LiteLLM fix (commit b43bf39). The memory graph clearly shows the sawtooth pattern indicative of OOMKills, and the proposed fix is well-justified. Issues by Severity🚫 Blocker IssuesNone - This PR is ready to merge. 🔴 Critical IssuesNone 🟡 Major Issues1. Kustomize Strategic Merge Behavior - Container List Replacement RiskLocation: Issue: Kustomize's strategic merge for container arrays uses Potential Problem: Some Kustomize versions or configurations might replace the entire container list instead of merging individual containers by name. The ordering (frontend first, oauth-proxy second) suggests intent to merge, but this isn't guaranteed without explicit merge directives. Evidence from base: # base/frontend-deployment.yaml:19-40
containers:
- name: frontend
imagePullPolicy: Always
image: quay.io/ambient_code/vteam_frontend:latest
# ... existing config with 512Mi limitRecommended Fix: spec:
template:
spec:
containers:
- name: frontend
$patch: merge # Explicit merge directive
resources:
requests:
memory: 512Mi
cpu: 200m
limits:
memory: 2Gi
cpu: 1000mOR verify that the current approach works as expected by testing the kustomize build output: kustomize build components/manifests/overlays/production | grep -A 50 "kind: Deployment" | grep -A 40 "name: frontend"Risk Level: Medium - The patch likely works (similar patterns exist in 🔵 Minor Issues1. Missing NODE_OPTIONS Heap Tuning DiscussionLocation: PR description - Deferred Work section Observation: The PR correctly defers NODE_OPTIONS heap tuning, citing Node.js 20+ auto-tuning. This is pragmatic (YAGNI principle), but worth documenting the decision basis. Recommendation: Consider adding a comment in the YAML explaining why NODE_OPTIONS isn't set: - name: frontend
# No NODE_OPTIONS --max-old-space-size tuning:
# Node.js 20+ auto-tunes heap based on container limits (2Gi)
# Will add explicit tuning if 2Gi proves insufficient
resources:Impact: Very low - documentation improvement only. 2. OAuth Proxy Memory Limits Conservative but UnvalidatedLocation: Issue: OAuth proxy is allocated 256Mi request / 512Mi limit without empirical measurement: resources:
requests:
memory: 256Mi
cpu: 50m
limits:
memory: 512Mi
cpu: 200mRecommendation: The PR test plan includes "Measure current oauth-proxy memory usage" as unchecked. This should be done before deployment to validate the 512Mi limit is appropriate. Suggested validation: # Check current oauth-proxy usage (if deployed)
kubectl top pods -n vteam-uat -l app=frontend --containers | grep oauth-proxyRisk: Low - 512Mi is likely sufficient for oauth-proxy, but empirical validation strengthens confidence. 3. File Location - Screenshot in docs/screenshots/Location: Observation: Screenshot location is appropriate and follows documentation standards (per CLAUDE.md). No issue here - just noting for completeness. Note: The screenshot clearly shows:
This is excellent evidence supporting the fix. ✅ Positive Highlights1. Excellent Problem Analysis ⭐
2. Appropriate Scope - Production Only 🎯
3. Well-Documented Test Plan 📋
4. Consistent with LiteLLM Precedent 🔄
5. Agent Review Process ✅
RecommendationsPriority 1: Validate Kustomize Merge BehaviorBefore merging, verify the patch correctly merges containers (not replaces): # Generate final YAML and verify frontend container has BOTH:
# 1. Base config (image, env vars, probes from base/frontend-deployment.yaml)
# 2. Patched config (new resource limits)
# 3. oauth-proxy container is added (not replaced)
kustomize build components/manifests/overlays/production > /tmp/prod-manifest.yaml
# Check frontend container
grep -A 60 'name: frontend' /tmp/prod-manifest.yaml | grep -E '(image:|memory:|BACKEND_URL|livenessProbe)'
# Expected output should show:
# - image: quay.io/ambient_code/vteam_frontend:latest (from base)
# - memory: 2Gi (from patch)
# - BACKEND_URL env var (from base)
# - livenessProbe (from base)
# Check oauth-proxy container exists
grep -A 30 'name: oauth-proxy' /tmp/prod-manifest.yamlIf the frontend container is missing base config (image, env, probes), add Priority 2: Measure OAuth Proxy Usage Before Deployment# If oauth-proxy is currently deployed (unlikely), measure actual usage
kubectl top pods -n vteam-uat -l app=frontend --containers 2>/dev/null | grep oauth-proxy || echo "Not currently deployed"
# After deployment, validate 512Mi is sufficient:
# Success if oauth-proxy stays < 400Mi (80% of limit)Priority 3: Monitor and Document ResultsAfter deployment:
Priority 4: Consider Future Optimization (Post-Fix)Once the immediate OOMKill issue is resolved and you have 48 hours of data:
Final AssessmentVerdict: APPROVE with minor validation recommended before merge ✅ This PR demonstrates excellent problem-solving:
The only concern is ensuring Kustomize merges containers correctly (not replaces). This is likely working as intended (similar patterns exist), but validation provides certainty. Confidence Level: 95% - The fix will resolve OOMKills. The 5% uncertainty is around Kustomize merge behavior, easily verified with kustomize build check. Review completed following:
No security, authentication, or RBAC concerns - This is infrastructure configuration only. 🤖 Review by Claude Code (Sonnet 4.5) 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |

Summary
Fixes production frontend pod OOMKills by increasing memory limits from 512Mi to 2Gi and adding resource limits to the oauth-proxy sidecar.
Problem
Frontend pods in vteam-uat are experiencing a sawtooth memory pattern - memory climbs to 500-550Mi, triggers OOMKill, pod restarts, and the cycle repeats every few hours.
Evidence:
Root Cause
Solution
Minimal production-only fix (1 file, 18 lines changed):
Frontend Container
OAuth Proxy Container (NEW)
Rationale:
Deferred Work (YAGNI)
Not included in this PR - add later if needed:
Test Plan
Pre-Deployment:
Post-Deployment:
Success Criteria:
Rollback Plan:
Agent Review
This plan was reviewed by:
Files Changed
components/manifests/overlays/production/frontend-oauth-deployment-patch.yaml(18 lines added)docs/screenshots/frontend-memory-sawtooth.png(183KB - visual evidence)🤖 Generated with Claude Code