← β†’ navigate Β· N = speaker notes Β· ESC close
AWS X DevOps Keynote Β· 2026
P0 Incident Β· On-call: You

Your Agent Built It. Who the 🀬 Runs It?

Soon, you'll be on call for code you never read.
Your job is no longer to know every line.
It's to manage the system that runs it.

Ran Tavory
AWS X DevOps Β· 2026
The Evolution of Coding Agents 2021 β†’ 2026
AUTOCOMPLETE ERAPredictive assistance
AI IDE ERAConversational editing
AGENT ERAAutonomous execution
MULTI-AGENT ERAOrchestrated systems
2021
GitHub Copilot
Single functionsBoilerplate~10–50 LOC
sec – min
2023
Cursor
Multi-file editsRefactors~100–500 LOC
minutes
2023
Aider
Repo patchingIterative debugGit-aware CLI
tens of min
2024
Windsurf
Autonomous navCross-repoHour-scale
multi-hour
2025
Claude Code
Repo-wide flowsLarge refactorsDay-scale exec
hours – days
2026
Codex
Long-running tasksPlanning + execTool orchestration
multi-day
2026
Cursor 3
Coordinated agentsPersistent memoryMulti-day projects
multi-day
Task horizon growth
30 sectoken prediction
5 minconv. editing
1 houragentic loops
1 workdayrepo-wide tasks
multi-daysupervised execution
Ran Tavory
03 / 26

Coding agents, by how long a task they can complete.

2019 2021 2023 2025 2026 Data: METR Time Horizons 1.1 (metr.org) Β· CC-BY AIDigest ↑ length of tasks AIs can do 1 hour 0 secs GPT-2 GPT-3 GPT-3.5 GPT-4 GPT-4o o1 · 1 hour ↑ length of tasks AIs can do ~10 hrs 1 workday 1 hour 3 hrs 0 secs GPT-2 GPT-3 GPT-3.5 GPT-4 GPT-4o o1 o3 GPT-5 GPT-5.2 Opus 4.6 ~10 hrs

Source: METR time-horizons study β€” exponential growth, no sign of slowing.

The longer the task, the lower your intimacy with the output.

A new Moore's Law for AI agents.

05 / 26
Werner Vogels: You build it, you run it

Werner Vogels Β· Amazon Β· 2006

  • Ownership and quality
  • Reduced friction
  • Cultural shift

You didn't build it.

Your agent built it.

So who the 🀬
runs it?

2
06 / 26

Three forces making production harder.

  • Force 1 β€” Volume
  • Force 2 β€” Breadth
  • Force 3 β€” Lost intimacy
Force 1 β€” Volume 07 / 26

More code. More surface area for failure.

  • 1 billion commits on GitHub in 2025 β€” 25% YoY growth
  • 26.9% of production code is AI-authored (early 2026; n=4.2M devs)
  • Honest range: 27–42% depending on "generated" vs "assisted"
  • Quality is following volume: copy/pasted lines 8.3% β†’ 12.3% (GitClear 2025, n=211M lines)
GitHub uptime calendar March 2026 β€” 99.52% with several incident days

GitHub Β· March 2026 Β· 99.52% uptime

Dario Amodei

"In six months AI would be writing 90 percent of code"

Dario Amodei Β· Anthropic CEO

GitClear 2025 Β· n=211M lines Β· GitHub Octoverse 2025
Force 2 β€” Breadth 08 / 26

The definition of "developer" expanded.

63%
of vibe coders are non-developers
Vercel, 2026
  • PMs, designers, marketers shipping working software via v0
  • Traditional developers now ship code in domains they've never worked in: MCP, auth, infra glue
  • "Vibe coding has redefined who can code β€” most vibe coders aren't developers at all." β€” Vercel

AI would replace developers?

The definition of 'developer' expanded to anyone who can prompt.

Force 2 β€” Breadth: Security 09 / 26

When non-specialists ship code, attack surface grows.

  • 30 CVEs filed against MCP servers in 60 days (Jan–Feb 2026)
  • ~β…“ of 2,614 surveyed MCP implementations susceptible to command injection
  • Named production Remote Code Execution bugs in LangFlow, LiteLLM, GPT Researcher, Flowise, Agent Zero
OX Security advisory Apr 2026 Β· dev.to MCP vulnerability tracker

The gap between "easy to ship" and "safe to ship" is enormous.

Force 3 β€” Lost Intimacy 10 / 26

The hunch is gone.

YOU WROTE IT
🚨 2:17 am β€” null ref in auth
auth middleware
touched Tuesday
"probably here"
token service
wrote in March
user model
Noa owns this
db layer
refactored Oct Β· PR#482
βœ“ fixed in 18 min
you wrote or reviewed every line
AGENT WROTE IT
🚨 2:17 am β€” null ref in auth
auth middleware
generated Apr 3
"no idea"
?
token service
never read
user model
never read
db layer
never read
⟳ starts from zero
you've never read any of this

"When code breaks, you are not reconstructing intent. You are reverse-engineering a decision process that was never explained and no longer exists."

β€” David Monnerat

The codebase is accumulating decisions
that nobody made.

The Pivot 12 / 26

You can fight bad outputs. You can't fight the volume curve.

  • Ranting about "AI slop" doesn't reduce slop β€” the volume ships either way
  • The choice: whether you have the operational muscle to run code you didn't write

Where do we go from here?

3
13 / 26

Managing without reading.

  • You've managed code you didn't write before β€” every tech lead has
  • Trust, but verify β†’ Trust, but make it prove it
  • Your job: defining what good behaviour looks like, detecting deviation

Your job

1. Define what good behaviour looks like

2. Detect deviation

What agents need

1. Goal

2. A way to verify their work

14 / 26

You've done this before. The verbs just change object.

You did this (humans)You'll do this (agents)
Hire engineerSpin up agent
Set expectations & 1:1sDefine Goals, SLOs & instrument
Code reviewsBehavioural reviews via traces / metrics
Performance evaluationsSLO compliance reports
manage by outcome

Stop reviewing code.

Start reviewing observable behavior.

15 / 26

Scenario: agent ships a payment feature.

What the agent ships

βœ… tests pass (47/47)
βœ… no lint errors
βœ… coverage 89%
βœ… ready to merge

What you demand instead

β†’ Add metrics: payment.retry.attempts, payment.retry.exhausted
β†’ Run synthetic failure in staging
β†’ Show me the trace + metric values
β†’ Verify dashboard in prod yourself
17 / 26

Scenario: auth endpoint.

p99 latency < 300ms under expected load
error rate < 0.1%, no single type > 50% of failures
token refresh success rate > 99.5%
≀ 5 failed logins per IP/min before rate-limit kicks in
100% of auth events emit structured log: user-id + outcome

These rules aren't in the code. They're in the contract. Enforced via observability, not line-by-line review.

4
18 / 26

Four shifts.

  1. Observability-first deployment← the test
  2. SLOs as the production contract
  3. Policy-as-code guardrails← the assertions
  4. AI-generated investigation reports
Shift 1 of 4 19 / 26

Observability-first.

  • Instrument before you ship β€” it's the deploy gate, the unit-test of production
  • "Agent-observable code" = structured logs, semantic metric labels, trace span attributes that explain intent

Not intent

ERROR: failed

Intent

payment.retry.exhausted
user=42 reason=upstream_timeout
attempt=4 budget=3

Code is agent-observable if a person (or an agent) can reason about its behaviour without reading the source. That means structured logs with semantic event names, metrics with documented units and labels, traces with attributes that explain intent.

Shift 2 of 4 20 / 26

SLOs as the contract.

  • The SLO defines what "working" means β€” it's how the agent verifies its own work
  • Pre-deploy: agent must prove the SLO holds under load β€” not "tests pass"
  • Post-deploy: SLO compliance overrides every other signal, including the test suite
"Production is where the rigor goes." β€” Charity Majors

SLOs turn your intent from the vibe of just "Working" to being a measurable contract.

As a manager you set OKRs for your employees.
As an agent manager you set SLOs.

Shift 3 of 4 21 / 26

Policy-as-code guardrails.

  • The deploy gate is code β€” version-controlled, enforced by CI
  • Common gates: latency budget, error rate, dependency health, security scan, IaC compliance
  • Agent-written code passes through the same gate as human code β€” automatically

Canonical tool: OPA (Open Policy Agent) β€” CNCF graduated, used at Netflix, Goldman Sachs, Google Cloud.

What's next

Emerging Policy-as-Prompt
research and tools

Tooling exists β€” ecosystem is early

Shift 4 of 4 22 / 26

Start from the investigation report.

Old: tail logs

$ tail -f /var/log/app.log | grep ERROR
[03:11:58] ERROR req_id=a4f2 status=500
[03:11:59] ERROR req_id=a4f3 status=500
[03:12:00] ERROR req_id=a4f4 status=500
[03:12:01] ERROR req_id=a4f5 status=500
[03:12:01] WARN db_query slow=true ms=4823
[03:12:02] ERROR req_id=a4f6 status=500
[03:12:02] ERROR req_id=a4f7 status=500
^C

You are the investigation engine.

New: AI investigation report

// AI Investigation Report β€” 03:12 UTC
Latency spike at 02:47:03 UTC β†’ deploy cart-service@7f3a91
N+1 query in OrderHistoryService.getRecentOrders()
Fix: prefetch_related('items') line 47
Confidence: HIGH Β· [trace] [deploy] [git blame]

You evaluate the hypothesis.

  • Real today: PagerDuty SRE Agent (Oct 2025), incident.io (Jul 2025), Azure AI SRE (May 2025 preview)

Use your AI
bullshit meter

⟢

If the report is wrong, ask three questions:

  1. 1. Is my instrumentation right?
  2. 2. Is my analysis right?
  3. 3. Did the agent have enough context?
Shift 5 of 4 23 / 26

The new on-call skill.

Old

βœ• Read the code. β€” Force 1: volume is exponential
βœ• Find the bug. β€” Force 2: breadth spans stacks you don't own
βœ• Know every service. β€” Force 3: lost intimacy, code you never wrote

Operational metrics matter more, not less: MTTR, MTBF, SLO compliance, error budget burn rate.

Things to be good at

  1. 1. Reading distributed traces
  2. 2. Recognising failure-mode patterns across stacks you didn't write
  3. 3. Validating AI hypotheses
  4. 4. Knowing when to escalate vs. when to trust the report
5
24 / 26

The reframe.

"You build it, you run it."

⇕

"You ship agents, you own their outcome."

Accountability hasn't moved. The keystrokes have.

Call to Action 25 / 26

Three questions before every agent-written deploy.

  1. 01

    What does good behaviour look like?

    Have you defined the SLOs?

  2. 02

    How will I know if it's misbehaving?

    Is the instrumentation in place?

  3. 03

    Who bears the responsibility?

    Is responsibility clear?

"If you can't answer those three, you're not ready to ship β€” regardless of who wrote the code."

AWS X DevOps Β· 2026

Your job isn't writing/reading code. It's writing/reading systems.

Ran Tavory Β· @rantav

Thank you.