Enter password to view this document
| Action | Details | Status | Cost |
|---|---|---|---|
| Redis Cache | New prod cluster (portal-prod-redis, 2× cache.t4g.medium). Caching enabled on Applicants (120s), Jobs (60s), Users (120s), Interviews (60s) | Active | $96/mo |
| RDS Proxy | portal-prod-proxy with MYSQL_NATIVE_PASSWORD. Connection pooling for all traffic | Active | $30/mo |
| Read Replica | portal-prod-2-1-read (db.r6g.xlarge) | Created | $285/mo |
| gp3 Storage | Migrated from io1 (1,000 IOPS) to gp3 (3,000 IOPS baseline). Cost savings + 3x performance | Done | -$65/mo |
| Worker Fix | ASG min=2, threads=4, VisibilityTimeout 300s (was 12 hours). Queue dropped from 19K to 9K | Active | $60/mo |
| Health Check | Added /health endpoint. Switched ELB from TCP:80 to HTTP:80/health | Active | $0 |
| CloudWatch Alarms | 5 alarms: CPU>80%, connections>50, memory<2GB, queue>100, IOPS>800 | Active | $0 |
Additional monthly cost: ~$406 (including gp3 savings)
| Bug | Impact | Fix |
|---|---|---|
| Missing company_interview_round | 500 on every applicant load + SendGrid events | ALTER TABLE on job_applicant + job_applicant_history |
| JobStatus enum case mismatch | 500 on job creation + SendSummary | UPDATE 5,206 rows + changed enum values to lowercase |
| get_next_profile IndexError | 500 every 10 seconds from Chrome extension | Added empty result check + fixed utcnow() call |
| .env deployed to production | SKIP_AUTH_IN_DEV=True on prod — auth fallback to inactive user | .ebignore + ENV_TYPE check + EB env vars |
| Auth masking DB errors as 401 | All DB errors returned as 401 instead of 500 | Retry with rollback in auth handler |
| Dirty DB sessions | PendingRollbackError cascading to all requests | teardown_appcontext with rollback + session.remove |
| Cache on auth endpoints | 401 responses cached in Redis and served to everyone | Switched to memoize on DB query function, not HTTP response |
| Metric | Before (22/3) | After (24/3) | Improvement |
|---|---|---|---|
| RDS CPU (peak) | 99.7% | 99% (peak), 20-30% (off-peak) | Stable, no flapping |
| DB Connections | 96 max | 62 (proxy pooling) | -35% |
| Read IOPS | 1,004 (ceiling!) | 6-180 | -88% |
| Write Latency | 465ms | 1-2ms | -99.5% |
| P50 Latency | 3,528ms | 44ms | -98.7% |
| P95 Latency | 15,164ms | 161ms | -98.9% |
| 5xx Error Rate | 33-60% | 0% | Zero errors |
| Health Status | Flapping Red/Severe | Green/Ok | Stable |
| Worker Queue | 19,162 stuck | 9,158 (draining) | 10K+ processed |
| EB Warnings/hour | 16-26 | 0 | Zero |
Left bar = Before (22/3) | Right bar = After (24/3)
| Resource | Name | Type |
|---|---|---|
| ElastiCache Redis (Prod) | portal-prod-redis | cache.t4g.medium × 2 |
| RDS Proxy | portal-prod-proxy | MYSQL_NATIVE_PASSWORD |
| RDS Read Replica | portal-prod-2-1-read | db.r6g.xlarge |
| SNS Topic | candi-prod-alerts | Email notifications |
| CloudWatch Alarms | candi-rds-cpu-high + 4 more | 5 alarms |
| Secrets Manager | rds-proxy/portal-prod-db-credentials-v2 | DB credentials |
| IAM Role | rds-proxy-portal-prod-role | Proxy access |
| Issue | Severity | Description |
|---|---|---|
| CPU 99% at Peak | Critical | Heavy queries (8+ JOINs) on cache miss. Requires Tier 3 query optimization |
|
What HappenedOn March 22 we identified the RDS CPU pinned at 99.7% consistently. The system could not serve requests — P50 latency of 3.5 seconds, 60% 5xx errors. All requests hit the database directly with zero caching layer.
Root CauseThe main Applicants query includes 11+ LEFT JOINs and 5 CTEs (Common Table Expressions) — pulling data from job_applicant, job, company, user, interview, linkedin_profile, and more. Every pipeline page load fires this query. Additionally, the Chrome extension sends a
get_next_profile request every 10 seconds, also hitting the DB directly. Without cache, every request — even repeated ones — hammers the database.What We DidDeployed Redis cache with per-endpoint TTLs (Applicants 120s, Jobs 60s, Users 120s, Interviews 60s). This dropped IOPS from 1,004 to 6-180 and latency from 3.5s to 44ms. However, on every cache miss the CPU spikes back to 99%.
What's Needed (Tier 3) — Action Plan
A. Break apart the main query (query_to_api)
The query in job_applicant.py has 11+ LEFT JOINs and 5 CTEs, each a complex query on its own: blacklist checks (two nested CTEs against italent_blacklist_view), placement checks with DATE_ADD calculations, active candidate detection, and "sent_to_manager" dates. This needs to be split into 3-4 separate queries and assembled in Python.B. Eliminate N+1 Queries Currently each applicant individually triggers: Interview.query_to_api() (7 JOINs), Placement.query_to_api() (6 JOINs), JobApplicantHistory.query_to_api() (4 JOINs), CampaignApplicantMail.query_to_api() (11 JOINs), and Candidate.get_status() (4 JOINs + CTEs). For a list of 200 applicants, that's 1,000+ heavy queries per page load. These must be converted to batch queries using WHERE id IN (...).C. Add missing indexes Critical missing indexes: • job_applicant_history(job_applicant_id, time_created DESC) — composite index for history queries• campaign_applicant_mail(parent_id) — full scan on self-join for reply detection• campaign_applicant_mail(applicant_id, campaign_id) — composite for mail queries• linkedin_profile(employer_id) — heavy join in every blacklist query• contract_company_blacklist(company_id) — join with no indexD. Route reads to Read Replica The Read Replica ( portal-prod-2-1-read) already exists but is not in use. All GET queries (applicants, interviews, placements, mail) should read from the replica. This would offload ~80% of the primary's read traffic.E. DB-level pagination Currently the API returns all applicants at once (up to 30K+, 6.4MB response). LIMIT/OFFSET must be added at the SQL level to return 100-200 applicants per page.F. Materialize expensive views italent_blacklist_view and time_to_hire_view are evaluated on every query. These should be converted to materialized tables refreshed every few minutes (scheduled job), instead of recalculating on every request.Without these changes, the Redis cache is just a band-aid — every cache miss pushes the system to 99% CPU. | ||
| SKIP_AUTH_IN_DEV on Prod | Security | Security risk — auth fallback allows access with invalid token. Must be removed |
|
What HappenedWhile diagnosing a 401 error on job creation, we discovered that a local
.env file with SKIP_AUTH_IN_DEV=True had been deployed to production before we ever touched the system. The load_dotenv() call in app.py loaded it — even in prod. This was a pre-existing condition in the codebase as handed to us.Root CauseIn
auth.py, when JWT authentication fails (e.g. due to a dirty DB session), the code checks if SKIP_AUTH_IN_DEV=True. If so, instead of returning 401, it falls back to dev mode and binds the request to a fallback user. In production, the first user in the DB was an Inactive user, causing 401 on every POST request. This security vulnerability existed in the system before we began working on it.What We DidAdded
.env to .ebignore to prevent deploying the file. Set DEV_USER_EMAIL=sivan@italent.co.il in EB env vars so the fallback uses an active user. SKIP_AUTH_IN_DEV=true is still active in EB env vars as a temporary band-aid to avoid breaking the system while we address the root cause.RiskAny request that fails authentication (expired token, invalid token, DB error) gets full system access under
sivan@italent.co.il. This is a severe security risk that existed before we came in. SKIP_AUTH_IN_DEV must be removed from EB env vars and the root cause addressed — why POST requests fail authentication. | ||
| Candidate Search | Medium | search-test backend 500: KeyError titleEmbeddingWeight. Pre-existing bug |
|
What HappenedWhen a user tries to search candidates via the Candidate Search UI, the system returns an error and displays no results.
Root CauseThe portal backend sends search requests to
search-test.eu-west-2.elasticbeanstalk.com (a separate EB environment). The search server returns 500 Internal Server Error with KeyError: 'titleEmbeddingWeight'. This is a bug in the search backend code — a missing parameter the code expects. This bug predates our changes and is unrelated to the infrastructure work.What's NeededFix the search backend code — add a default value for
titleEmbeddingWeight or ensure the parameter is sent from the portal. This is a separate repo (search-test EB environment). | ||
| Large Pipelines | Medium | 30K+ candidates — data returns but frontend doesn't render. Vue rendering issue |
|
What HappenedClients with large pipelines (30,000+ candidates) see a "No Data" message in the UI despite the data existing.
Root CauseThe API returns all data successfully — we verified the response contains 6.4MB of JSON with all candidates. The problem is in the frontend (Vue.js): the browser can't render tens of thousands of rows in the DOM. The Vue component likely attempts to create a DOM element for every candidate at once, causing JavaScript to hang or timeout.
What's NeededTwo changes: (1) Backend pagination — return a maximum of 100-200 candidates per page with offset/cursor. (2) Frontend virtual scrolling — render only the rows visible on screen (vue-virtual-scroller or similar). Additionally, a 6.4MB response is unnecessary load on the network and cache.
| ||
| Chrome Extension 401 | Low | No automatic token refresh. Users must re-login manually |
|
What HappenedChrome extension users report getting 401 errors after some time and must log out and back in to continue working.
Root CauseThe Auth0 access token expires (default: 24 hours). The Chrome extension does not implement a token refresh flow — it doesn't use the refresh token to obtain a new access token automatically. When the token expires, every request gets 401.
What's NeededAdd refresh logic to the Chrome extension: before sending a request, check if the token is near expiration (exp claim), and if so, request a new token from Auth0 using the refresh token. This is extension-side work, not backend.
| ||