TLVTech

Work Report — Candi Platform

Infrastructure stabilization, bug fixes & immediate results

23–24 March 2026 — candi.ai — AWS eu-west-2 (London)

5xx Error Rate

was 60% ▼

44ms

P50 Latency

was 3,528ms ▼

10K+

Queue Messages Processed

out of 19K stuck ▼

99%

Peak CPU

requires Tier 3 ▲

Infrastructure Changes (Tier 1)

Action	Details	Status	Cost
Redis Cache	New prod cluster (portal-prod-redis, 2× cache.t4g.medium). Caching enabled on Applicants (120s), Jobs (60s), Users (120s), Interviews (60s)	Active	$96/mo
RDS Proxy	portal-prod-proxy with MYSQL_NATIVE_PASSWORD. Connection pooling for all traffic	Active	$30/mo
Read Replica	portal-prod-2-1-read (db.r6g.xlarge)	Created	$285/mo
gp3 Storage	Migrated from io1 (1,000 IOPS) to gp3 (3,000 IOPS baseline). Cost savings + 3x performance	Done	-$65/mo
Worker Fix	ASG min=2, threads=4, VisibilityTimeout 300s (was 12 hours). Queue dropped from 19K to 9K	Active	$60/mo
Health Check	Added /health endpoint. Reverted to TCP:80 — HTTP check causes instance cycling during startup	TCP temp	$0
CloudWatch Alarms	5 alarms: CPU>80%, connections>50, memory<2GB, queue>100, IOPS>800	Active	$0

Additional monthly cost: ~$406 (including gp3 savings)

Bugs Fixed

Bug	Impact	Fix
Missing company_interview_round	500 on every applicant load + SendGrid events	ALTER TABLE on job_applicant + job_applicant_history
JobStatus enum case mismatch	500 on job creation + SendSummary	UPDATE 5,206 rows + changed enum values to lowercase
get_next_profile IndexError	500 every 10 seconds from Chrome extension	Added empty result check + fixed utcnow() call
.env deployed to production	SKIP_AUTH_IN_DEV=True on prod — auth fallback to inactive user	.ebignore + ENV_TYPE check + EB env vars
Auth masking DB errors as 401	All DB errors returned as 401 instead of 500	Retry with rollback in auth handler
Dirty DB sessions	PendingRollbackError cascading to all requests	teardown_appcontext with rollback + session.remove
Cache on auth endpoints	401 responses cached in Redis and served to everyone	Switched to memoize on DB query function, not HTTP response

Benchmarks — Before & After

Metric	Before (22/3)	After (24/3)	Improvement
RDS CPU (peak)	99.7%	99% (peak), 20-30% (off-peak)	Stable, no flapping
DB Connections	96 max	62 (proxy pooling)	-35%
Read IOPS	1,004 (ceiling!)	6-180	-88%
Write Latency	465ms	1-2ms	-99.5%
P50 Latency	3,528ms	44ms	-98.7%
P95 Latency	15,164ms	161ms	-98.9%
5xx Error Rate	33-60%	0%	Zero errors
Health Status	Flapping Red/Severe	Green/Ok	Stable
Worker Queue	19,162 stuck	9,158 (draining)	10K+ processed
EB Warnings/hour	16-26	0	Zero

Before & After — Visual Comparison

CPU % (Before — 24h on 22/3)

CPU % (After — 24/3 projected)

Read IOPS (Before)

Read IOPS (After)

Write Latency ms (Before)

Write Latency ms (After)

DB Connections (Before)

DB Connections (After)

Key Metrics — Bar Comparison

P50 Latency

3,528ms

44ms

P95 Latency

15,164ms

161ms

5xx Error Rate

60%

Write Latency

465ms

1-2ms

Read IOPS

1,004

6-180

DB Connections

Worker Queue

19,162

9,158

Left bar = Before (22/3) | Right bar = After (24/3)

Architecture — Before vs After

BEFORE (22/3)

Users (app.candi.ai)Vue 3 SPA + Chrome Extension

▼

CloudFront CDN

▼

Classic ELBTCP:80 only — can't detect 5xx

▼

EB portal-prodFlask + Gunicorn (3 workers)

⚡ ▼ ⚡

Redis CacheEXISTS but DISABLED in code!

RDS ProxyMissing — no pooling

Read ReplicaMissing — single DB

⚡ ▼ ⚡

MySQL RDS (single)CPU 99.7% | 465ms writes | IOPS at ceiling

Worker (dead)19K stuck, 1 thread, 12h timeout

Search MLSeparate EB

➔

AFTER (24/3)

Users (app.candi.ai)Vue 3 SPA + Chrome Extension

▼

CloudFront CDN

▼

Classic ELBTCP:80 — HTTP /health pending Tier 3

▼

EB portal-prodFlask + Gunicorn + CloudWatch Alarms

▼

Redis CacheACTIVE — TTL 60-120s per endpoint

RDS ProxyConnection pooling active

▼

MySQL RDS Primarygp3 3,000 IOPS | writes 1-2ms

Read Replicadb.r6g.xlarge (ready for Tier 3)

Worker (draining)ASG min=2, 4 threads, 300s timeout