How we cut a 3-hour AWS observability investigation down to 11 minutes

Table of contents

We were standing up a new proof-of-concept environment for an internal AI workload an EC2 instance running a FastAPI backend, instrumented with CloudWatch for metrics and log delivery. Standard setup. The kind of thing an engineer should be able to provision, validate, and walk away from in under an hour.

Except it went quiet, creating an AWS observability blind spot

At 15:27 UTC, the CloudWatch Agent on the instance stopped publishing metrics and never recovered. The FastAPI application's log group existed in CloudWatch but contained zero data no streams, no bytes, nothing. From the outside, the instance looked healthy despite the AWS observability failure: EC2 hypervisor metrics were normal, SSM heartbeats continued. But our observability layer had silently collapsed.

The old way to debug this AWS observability failure would have been a 2-3 hour exercise.: SSH into the instance, tail log files, check systemd service statuses, audit SSM parameter history, cross-reference CloudWatch event timestamps manually. Instead, we let AWS DevOps Agent run the investigation.

The instance looked fine from the outside. The observability layer had silently collapsed from the inside. That gap is exactly what makes this class of failure expensive to debug manually.

The topology view above shows how the agent mapped our environment: three logical nodes which build infrastructure, code execution engine, and data storage all initially classified as Unknown, since the resources were provisioned via Terraform without the tag conventions the agent uses for auto-discovery. Despite this, the agent proceeded with investigation using direct AWS service API calls.

What the Agent found

AWS DevOps Agent began investigating the moment the CloudWatch alarm fired, restoring AWS observability into the failing environment. No human opened a dashboard. No one typed a query. The agent worked through the signal chain autonomously, and within 11 minutes it had produced a complete root cause analysis with two confirmed findings and two clearly documented data gaps.

Here is the investigation timeline as it played out:

14:47:58	EVENT	CloudWatch Agent configured on ai-devops-agent-poc from SSM parameter store. Metrics collection begins.
14:48:06	EVENT	SSM parameters deleted within 8 seconds of the configuration being applied.
14:48:20	EVENT	SSM parameter deletion confirmed complete. Config now only lives on the instance filesystem.
14:48–15:27	DATA	CloudWatch Agent publishes system metrics (CPU, memory, disk, network, processes) every 60 seconds. Everything appears normal.
15:15–15:25	EVENT	225 MB of network downloads package installation in progress on the instance.
15:27 UTC	EVENT	CloudWatch Agent stops publishing metrics, creating a critical AWS observability gap. Never recovers SSM heartbeat and EC2 hypervisor metrics continue normally instance is running.
+10m53s	FINDING	Root cause identified: SSM parameters were deleted 8 seconds after the CWAgent config was applied. On attempted config refresh at ~15:27, the agent failed to re-read configuration, causing a crash or failed state.
+11m01s	FINDING	Second finding: FastAPI application was never successfully started. Log group 'fastapi-app-log' has zero streams and zero bytes. Security group only exposes port 80 FastAPI/Uvicorn defaults to port 8000.
+11m18s	GAP	Data gap logged: OS-level logs inaccessible. EC2 GetConsoleOutput and SSM GetCommandInvocation not authorized for investigation role.
+11m26s	GAP	Second data gap: SSM command invocation output unreadable. Agent explicitly documents what evidence would resolve remaining uncertainty.

‍

AWS DevOps Agent did not guess around the IAM gaps. It documented them explicitly naming which APIs were unauthorized, what data those APIs would have provided, and what evidence would be needed to confirm the remaining hypotheses. This is the correct behaviour for a production investigation tool.

‍

The two root causes

Root cause 1 - The 8-second configuration race condition

This AWS observability issue is subtle and easy to miss without the agent's timestamp correlation.

The CloudWatch Agent was configured at 14:47:58 using an SSM parameter that specified which log files to collect and which CloudWatch log groups to write to. That configuration was applied successfully; the agent began publishing metrics 10 seconds later at 14:48.

The problem: the SSM parameter was deleted at 14:48:06 just 8 seconds after it was used to configure the agent.

CloudWatch Agent periodically attempts to refresh its configuration from the SSM parameter path it was initialized with. When it attempted that refresh at approximately 15:27 UTC likely triggered by the package installation activity or a scheduled refresh cycle, the parameter no longer existed. The refresh failed, the agent crashed or entered a failed state, and metrics collection stopped permanently.

This explains everything: why metrics published cleanly from 14:48 to 15:27, why the stop was abrupt with no degradation curve, and why the FastAPI log group received zero data despite the log group itself existing. The AWS observability log collection configuration was lost when the agent failed

THE TIMING EVIDENCE

CWAgent published metrics from 14:48 to 15:27 - exactly 39 minutes. Network downloads peaked at 15:15–15:25 (225 MB of package installs). The agent refresh coincided with installation completing. The SSM param was gone. The config refresh failed. Metrics stopped.

‍Root cause 2 - FastAPI was never running

The FastAPI application's CloudWatch log group within the AWS observability stack 'fastapi-app-log' was created at the same time as the instance launch but received zero data throughout the entire lifecycle of this investigation.

The agent identified two independent signals pointing to the application never starting:

The security group only exposes port 80 (SSH + HTTP). FastAPI running via Uvicorn defaults to port 8000. If the application started and bound to 8000, it would be unreachable from outside and crucially, no traffic would reach it to generate logs.

The process counts on the instance increased modestly from ~115 to ~125 processes at 15:20 consistent with installation processes spinning up, not with a running application server which would show a more significant and sustained jump.

Combined with the 225 MB download at 15:15–15:25, the pattern is clear: packages were being installed but the application either never started or failed immediately on startup, before producing any log output.

Devops agent finding rootcause summary and generating mitigation plan

The screenshot above shows the investigation summary as the agent presented it - two numbered impact findings with full evidence chains, timestamped to the second, with a one-click 'Generate mitigation plan' prompt ready to run.

Where the Agent reached its limits and why that matters

The agent did not pretend it had complete AWS observability information. Two data gaps were explicitly logged, both caused by IAM permissions on the investigation role that did not cover OS-level access.

Gap 1 - No OS-level visibility in the AWS observability workflow

EC2 GetConsoleOutput not authorized would have shown instance boot logs and kernel messages

SSM GetCommandInvocation not authorized would have shown the CWAgent status check output from the command run at 14:48:39

No system logs (syslog, dmesg, journald) shipped to CloudWatch only on the local filesystem

CloudWatch Agent's own log files at /opt/aws/amazon-cloudwatch-agent/logs/ inaccessible without SSH or SSM access

These gaps mean we cannot confirm definitively whether the CWAgent crashed due to OOM, configuration error, or was manually stopped. The SSM parameter deletion timing is strong circumstantial evidence but the agent was correct to label it as a finding requiring additional evidence rather than a confirmed root cause.

Gap 2 - SSM history gone

The SSM parameter that held the CloudWatch Agent configuration was deleted within 20 seconds of being used. Its content specifically, which log file paths it was monitoring is permanently gone. Without that data, we cannot confirm whether the log collection configuration pointed to the correct FastAPI log output path in the first place.

The agent's documented AWS observability gaps are as valuable as its findings. They are an explicit checklist of what IAM permissions to add and what data to ship before the next investigation which is a better outcome than a human investigator silently skipping over the same blind spots.

‍

Time saved, costs reduced, and the numbers proved it

This is not a hypothetical AWS observability MTTR comparison. Here is what the manual investigation of this incident would have required:

Investigation Step	Manual (est.)	With AWS DevOps Agent
CloudWatch alarm → investigation start	5–10 min (human ack + login)	0 min (autonomous)
Cross-referencing metrics + deploy timeline	30–45 min	Included in 11-min run
Identifying SSM deletion as config root cause	45–90 min (requires CloudTrail audit)	10m53s
Documenting FastAPI port mismatch	15–30 min (SSH + process check)	11m01s
Logging IAM gaps as evidence checklist	Rarely done; usually skipped	Auto-documented
Producing mitigation plan	10–20 min (writing runbook)	One click
TOTAL	~2–3.5 hours	< 12 minutes

‍

On-call engineer loaded rate: approximately $120–180/hour fully loaded. Manual investigation of this incident: 2–3.5 hours minimum. Cost of that investigation: $240–$630 in engineer time, for a single mid-severity incident.

That math compounds fast across a team running multiple services. The AWS observability agent does not replace the engineer who validates the hypothesis and executes the fix but it eliminates the context-assembly phase entirely, which is 70–80% of incident time for this class of failure. ‍

Three things we're fixing based on this

1. IAM permissions for the investigation role

The agent documented exactly which APIs it could not call: EC2 GetConsoleOutput and SSM GetCommandInvocation. Both are read-only. Both are safe to add. We are extending the investigation role to include these permissions so the agent has OS-level visibility on the next run.

2. System logs to CloudWatch for stronger AWS observability

The CloudWatch Agent's own log files and the instance's system logs (syslog, journald) need to ship to CloudWatch, not just sit on the filesystem. This is a one-line addition to the CWAgent configuration. Without it, the agent and any human investigator is blind to instance-level failures.

3. SSM parameter lifecycle management

Parameters used to configure long-running agents should not be deleted after use. Either retain them, or use AWS Secrets Manager with versioning, or bake the configuration into the AMI or user data directly. The 8-second deletion window that caused this entire incident is an infrastructure hygiene issue, not a tooling limitation.

The incident itself was minor, but the AWS observability lessons were significant. The value was in what the investigation surfaced: three concrete infrastructure hygiene gaps that would have caused future incidents to be harder to debug than this one.

‍

Our take on AWS DevOps Agent after this

We went into this evaluation skeptical. 'AI SRE' is a phrase that gets thrown around loosely and often means 'a dashboard with a chatbot attached.' AWS DevOps Agent is not that.

What it does for AWS observability, correlating metrics, log metadata, deployment events, and SSM history across a timestamp chain is exactly the kind of join operation that takes a human engineer 45 minutes and costs nothing for the agent to do in seconds. The SSM parameter deletion timing correlation is a good example: a human investigator would need to think to look at CloudTrail for parameter deletions, know the CWAgent refresh interval, and manually join those two signals. The agent did it as part of its standard investigation flow.

The IAM gaps it surfaced are better documentation than most postmortems we have written manually. The explicit 'evidence needed' notes in the gap records are a checklist for what to fix before the next incident something that rarely makes it into postmortem action items when humans write them under pressure.

WHAT WE RECOMMEND

If you are AWS-native and CloudWatch is already your primary AWS observability layer: run a 30-day parallel evaluation against your existing on-call. Connect your CI/CD. Enable the investigation role with at least GetConsoleOutput and SSM read permissions. The topology quality determines investigation depth, so fix your resource tagging first. The MTTR reduction is real and it shows up in the first investigation.

What we are watching: autonomous remediation with proper guardrails, deeper compliance audit trail support, and how investigation quality holds on novel failure modes that don't match historical patterns.

‍

Access our whitepaper on Production-Grade AI Systems

Click here

Get A Demo

How we cut a 3-hour AWS observability investigation down to 11 minutes

Sarankumar S

What the Agent found

The two root causes

Root cause 1 - The 8-second configuration race condition

‍Root cause 2 - FastAPI was never running

Where the Agent reached its limits and why that matters

Gap 1 - No OS-level visibility in the AWS observability workflow

Gap 2 - SSM history gone

Time saved, costs reduced, and the numbers proved it

Three things we're fixing based on this

1. IAM permissions for the investigation role

2. System logs to CloudWatch for stronger AWS observability

3. SSM parameter lifecycle management

Our take on AWS DevOps Agent after this

Sarankumar S

Deveshi Dabbawala

Accelerate Your AI Adoption

Get an Executive Briefing

HQ

India

Access our whitepaper on Production-Grade AI Systems

Click here

Get A Demo

How we cut a 3-hour AWS observability investigation down to 11 minutes

Sarankumar S

What the Agent found

The two root causes

Root cause 1 - The 8-second configuration race condition

‍Root cause 2 - FastAPI was never running

Where the Agent reached its limits and why that matters

Gap 1 - No OS-level visibility in the AWS observability workflow

Gap 2 - SSM history gone

Time saved, costs reduced, and the numbers proved it

Three things we're fixing based on this

1. IAM permissions for the investigation role

2. System logs to CloudWatch for stronger AWS observability

3. SSM parameter lifecycle management

Our take on AWS DevOps Agent after this

Similar Blogs

Explore more

The complete guide to AWS DevOps Agent

Sarankumar S

Day 4 at AWS re:Invent: Experience-Based Acceleration (EBA) partners announced and a big bang close

Deveshi Dabbawala

Accelerate Your AI Adoption

Get an Executive Briefing​

HQ

India​

Get an Executive Briefing

India