Nuvo

Who am I?

My name is Harta, and I joined Nuvo in January. Previously, I was a Staff Engineer at Meta on the Facebook Monetization organization. At Nuvo, I lead a number of our infrastructure efforts. Lately, I’ve been especially excited about scaling our task queue infrastructure and platformizing how Nuvo integrates with customer ERP systems.

Context

When a startup experiences an outage, small teams like ours are often stretched thin, scouring endless lines of code for bugs and hoping to land on an effective fix.

In late January, Nuvo experienced a couple of outages. We used AI as a debugging partner to narrow the search space, validate hypotheses, and connect signals across Datadog, Postgres, and recent code changes.

The result: issues that could have taken days to find and weeks to resolve were identified and fixed quickly, with higher confidence and less thrash. The incidents also reinforced that AI can play a role not only in incident response, but as an automatic check during review when we ship changes that may impact performance.

Incident #1: Postgres CPU usage spiked to 100%

On January 9, we increased polling on our supplier dashboard based on feedback from key customers who rely on it for real-time operations. The change increased database requests across all suppliers. CPU utilization rose, but it didn’t immediately cause an outage.

On January 21, we shipped another supplier dashboard change, which added load on top of the already-elevated baseline.

On January 22, Postgres CPU hit 100% during peak load. API latency spiked and requests started timing out, causing a partial outage for about 30 minutes.

As a short-term mitigation, we increased the database’s CPU and RAM (we’re on Render, so this was a quick change), which brought load back under control.

Investigation

Emily Liu, an infrastructure engineer, used Claude to build a script that surfaced our top queries, correlated them with CPU trends, and highlighted query patterns that had increased week over week. This made it easy to compare query behavior before and during the outage.

Early on, we confirmed this wasn’t a broad database slowdown—CPU usage was highly concentrated in a handful of query patterns. That reframed the investigation from, “What is causing the CPU spike?” to a more specific question: were we looking at

a regression,
a product change that inadvertently increased load, or
a query shape that had always been expensive but had finally hit a tipping point as traffic grew?

Using the script, Emily pinpointed the day query load stepped up. Claude then reviewed the commits from that day to identify the likely trigger. She traced it to a set of changes that substantially increased polling from the supplier dashboard. In other words, a product change exposed a structurally expensive query.

Before AI, building a script like this would have taken significant time and care—easily several days of engineering work—and would have been a staple tool on an infrastructure team. With AI, she generated it in minutes and moved much faster toward a root cause.

Similarly, reviewing a large range of historical commits to root cause an outage is usually time-consuming (and tedious). AI handled it in seconds. End-to-end, the time to identify the root cause dropped from a day or two to roughly an hour or two.

Mapping the fix

Once Emily identified the root cause, the next step was addressing it. We had two options:

Reduce polling: would improve database load, but at the cost of a worse real-time experience.
Optimize the queries: would preserve the product experience, but carried a higher risk of subtle correctness bugs. Highly optimized query patterns are easy to get wrong when hand-crafting them, and getting them right can take time.

We chose the second option. These queries power a core workflow in our product, and the increased polling had been added in response to customer complaints.

Additionally, our dataset had grown to roughly 700K+ conversation events and 265K+ applications, and the query pattern no longer scaled. Even if we reduced polling, the underlying issue would remain: as the dataset grew, the query would continue to degrade.

Optimizing the query

Emily’s script surfaced two closely related query patterns that were dominating CPU:

A customer applications query with heavy table joins (~21.74% of total DB time)
A query that was doing full table scans across ~700K rows instead of using indexes (~14.97% of total DB time)

Together, these two query shapes accounted for roughly 37% of overall DB CPU work—more than enough to explain the symptoms we were seeing.

The root cause was structural. The original query used DISTINCT ON over the entire conversation_event table to compute the latest event per application, and it used correlated EXISTS subqueries to check alert state. As a result, Postgres did work proportional to the full events table (~700K rows), regardless of how many applications the caller actually needed. Under production concurrency, that translated directly into the CPU spikes we observed.

To fix this, we replaced the DISTINCT ON and correlated subqueries with LATERAL joins—effectively flipping the execution model. Instead of materializing “latest event per app” globally and filtering afterward, the rewritten query scopes each lookup to a specific application row. That makes the cost proportional to the supplier’s application count, not the size of the events table.

Before deploying, we validated semantic equivalence using stratified sampling across supplier size buckets. We ran both query variants side-by-side against production data and confirmed zero mismatches in sort order, alert flags, and edge cases. EXPLAIN (ANALYZE, BUFFERS) output confirmed the expected reduction in buffer hits and execution time across all buckets, with the largest gains on high-application-count suppliers. AI was leveraged to write the equivalence script as well as rapidly verify the outputs matched semantically.

The result was >10× faster performance for large application counts, sub-millisecond latency for most queries, and a meaningful reduction in overall database pressure during peak traffic, eliminating the retry-amplification loop that had been cascading into timeouts.

In this process, AI accelerated:

Hypothesis generation and validation regarding the outage.
Analysis of 50+ queries across the codebase.
Cross-referencing of GitHub changes with database performance impacts.

Overall, we were able to complete a comprehensive analysis of the issue in less than a day instead of spending several days manually scripting. As a team committed to building and shipping products, we take advantage of AI’s scaling effects to support and preempt infrastructure issues as they come up.

Monitoring

This incident also pushed us to sharpen our monitoring and paging strategy. Bryan, a product engineer on our AI team, drove two key follow-ups:

Expand monitoring coverage to include all critical production events.
Onboard PagerDuty so on-call engineers are paged for critical events, rather than relying on Slack escalations.

His approach leaned heavily on AI. He used Claude to map our production services as represented in Terraform. With that service map, he worked with Claude to identify the monitors we needed for each service. And because Claude had access to our Datadog MCP, it could see what monitors already existed and avoid proposing redundant alerts.

He took a T-shaped approach: broad baseline coverage across the system, paired with deeper monitoring in areas where issues had historically occurred:

Postgres (which stores all application data)
Redis
Task queue workers
The GraphQL application service

In addition, he added baseline health checks across the rest of our infrastructure, including an SFTP service we run for customers, various proxies, and our staging environment.

For our database, he went particularly deep. He grouped alerts into three categories and assigned priorities accordingly:

Active outages or severe risk: high CPU or memory usage, TID wraparound risk, or critically high disk utilization.
Signals of an impending problem: high numbers of active connections at the PgBouncer or database layers, query queueing, elevated lock timeouts, and unusual or sustained spikes in query rates.
Sources of long-term technical debt: idle-in-transaction sessions, long-running queries, and deadlocks (and other issues that quietly accumulate load over time).

Finally, leveraging its MCP integrations, Claude built the monitors directly in Datadog and wired them to Slack or PagerDuty depending on severity. Claude was able to complete this in minutes, saving us significant manual setup work.

Incident #2: A Terraform change accidentally downsized the database

A few days later, on January 26, a Terraform configuration error unintentionally resized our database back down to its previous CPU and RAM allocation, effectively undoing the headroom we’d added while rolling out the query optimization. The impact was immediate: under production traffic, CPU climbed straight back to 100%.

Fortunately, because our alerting had improved after Incident #1, we were paged right away.

Investigation

This time, the issue was clearer and the fix was simpler: we needed to bump the database size back up. AI still played a role afterward, helping us tighten our Terraform process so the same class of mistake would be less likely to recur.

Prevention

To prevent this kind of infra regression going forward, we invested in better processes and tooling around Terraform. One of our TLMs, Brian, took the lead.

Before these outages, we often ran terraform plan + terraform apply locally and opened PRs afterward. With a growing team and more frequent infrastructure changes (including changes that contributed to this database incident), we wanted to harden the workflow.

Brian chose Atlantis, an open-source, GitHub-integrated tool that centralizes Terraform workflows and runs plan/apply via PRs. That solved several problems:

Better visibility into infra changes (no more “hidden” local terraform apply)
Required PR approvals before applying changes to core environments
PR-level locking to prevent two PRs from modifying the same Terraform project concurrently
Apply-time checks to ensure the plan is still up to date
A hook point for our code review bots to vet plans for unintended side effects

As a side effect, Atlantis also let us tighten what individual engineers can directly access, and it ensured all infra changes flow through a PR process.

We evaluated a few hosting options and ultimately went with a dedicated GCP project. We ran Atlantis on Cloud Run, with all instances sharing the same NFS and using Redis for PR-level locks.

Without AI-assisted development, it’s unlikely we would have prioritized this improvement right now since customer requests were already consuming significant engineering bandwidth. Historically, we would have taken a simpler approach (e.g., requiring engineers to attach the Terraform plan in PRs and be more careful). Because AI enabled rapid development, we were able to harden our infrastructure process without slowing down the roadmap.

Conclusion

These incidents left us with a clear understanding of where AI can be the most effective partner in building. It compressed the most time-consuming parts of the response so that the Engineering team could move quickly from identifying symptoms to implementing a verified fix without sacrificing correctness. On the database side, the outcome was a query plan that scales with what the endpoint needs. On the infrastructure side, the outcome was a Terraform workflow that makes accidental regressions harder to ship. Moving forward, the team is investing in both: performance work that reduces load at the source, and operational rigor that keeps the system stable as we scale.

I love how AI tools have made us dramatically faster and more productive. If you’re excited about applying these new tools to hard problems, come join us.

Retrospective: How We Leveraged AI To Resolve Critical Infrastructure Incidents

Who am I?

Context

Incident #1: Postgres CPU usage spiked to 100%

Investigation

Mapping the fix

Optimizing the query

Monitoring

Incident #2: A Terraform change accidentally downsized the database

Investigation

Prevention

Conclusion

Heading

Heading