AWS X DevOps Keynote · 2026

P0 Incident · On-call: You

Your Agent Built It. Who the 🤬 Runs It?

Soon, you'll be on call for code you never read.
Your job is no longer to know every line.
It's to manage the system that runs it.

Ran Tavory

AWS X DevOps · 2026

The Evolution of Coding Agents 2021 → 2026

AUTOCOMPLETE ERAPredictive assistance

AI IDE ERAConversational editing

AGENT ERAAutonomous execution

MULTI-AGENT ERAOrchestrated systems

2021

GitHub Copilot

Single functionsBoilerplate~10–50 LOC

sec – min

2023

Cursor

Multi-file editsRefactors~100–500 LOC

minutes

2023

Aider

Repo patchingIterative debugGit-aware CLI

tens of min

2024

Windsurf

Autonomous navCross-repoHour-scale

multi-hour

2025

Claude Code

Repo-wide flowsLarge refactorsDay-scale exec

hours – days

2026

Codex

Long-running tasksPlanning + execTool orchestration

multi-day

2026

Cursor 3

Coordinated agentsPersistent memoryMulti-day projects

multi-day

Task horizon growth

30 sectoken prediction

5 minconv. editing

1 houragentic loops

1 workdayrepo-wide tasks

multi-daysupervised execution

whoami

Research Engineering @ TII
LLMs, RAG, Agents

04 / 24

Coding agents, by how long a task they can complete.

Source: METR time-horizons study — exponential growth, no sign of slowing.

The longer the task, the lower your intimacy with the output.

A new Moore's Law for AI agents.

05 / 24

Werner Vogels · Amazon · 2006

Ownership and quality
Reduced friction
Cultural shift

You didn't build it.

Your agent built it.

So who the 🤬
runs it?

2

06 / 24

Three forces making production harder.

Force 1 — Volume
Force 2 — Breadth
Force 3 — Lost intimacy

Force 1 — Volume 07 / 24

More code. More surface area for failure.

1 billion commits on GitHub in 2025 — 25% YoY growth
26.9% of production code is AI-authored (early 2026; n=4.2M devs)
Honest range: 27–42% depending on "generated" vs "assisted"
Quality is following volume: copy/pasted lines 8.3% → 12.3% (GitClear 2025, n=211M lines)

GitHub uptime calendar March 2026 — 99.52% with several incident days

GitHub · March 2026 · 99.52% uptime

"In six months AI would be writing 90 percent of code"

Dario Amodei · Anthropic CEO

GitClear 2025 · n=211M lines · GitHub Octoverse 2025

Force 2 — Breadth 08 / 24

The definition of "developer" expanded.

63%

of vibe coders are non-developers
Vercel, 2026

PMs, designers, marketers shipping working software via v0
Traditional developers now ship code in domains they've never worked in: MCP, auth, infra glue
"Vibe coding has redefined who can code — most vibe coders aren't developers at all." — Vercel

AI would replace developers?

The definition of 'developer' expanded to anyone who can prompt.

Force 2 — Breadth: Security 09 / 24

When non-specialists ship code, attack surface grows.

30 CVEs filed against MCP servers in 60 days (Jan–Feb 2026)
~⅓ of 2,614 surveyed MCP implementations susceptible to command injection
Named production Remote Code Execution bugs in LangFlow, LiteLLM, GPT Researcher, Flowise, Agent Zero

OX Security advisory Apr 2026 · dev.to MCP vulnerability tracker

The gap between "easy to ship" and "safe to ship" is enormous.

Force 3 — Lost Intimacy 10 / 24

The hunch is gone.

YOU WROTE IT

🚨 2:17 am — null ref in auth

auth middleware

touched Tuesday

"probably here"

token service

wrote in March

user model

Noa owns this

db layer

refactored Oct · PR#482

✓ fixed in 18 min

you wrote or reviewed every line

AGENT WROTE IT

🚨 2:17 am — null ref in auth

auth middleware

generated Apr 3

"no idea"

?

token service

never read

user model

never read

db layer

never read

⟳ starts from zero

you've never read any of this

The codebase is accumulating decisions
that nobody made.

The Pivot 11 / 24

You can fight bad outputs. You can't fight the volume curve.

Ranting about "AI slop" doesn't reduce slop — the volume ships either way
The choice: whether you have the operational muscle to run code you didn't write

Where do we go from here?

3

12 / 24

Managing without reading.

You've managed code you didn't write before — every tech lead has
Trust, but verify → Trust, but make it prove it

Your job

1. Define goal

2. Detect deviation

What agents need

1. Goal

2. A way to verify their work

13 / 24

You've done this before. The verbs just change object.

You did this (humans)	You'll do this (agents)
Hire engineer	Spin up agent
Set expectations & 1:1s	Define Goals, SLOs & instrument
Code reviews	Behavioural reviews via traces / metrics
Performance evaluations	SLO compliance reports

manage by outcome

Review code.

↓

Review observable behavior.

14 / 24

Scenario: agent ships a payment feature.

What the agent ships

            ✅ tests pass (47/47)

            ✅ no lint errors

            ✅ coverage 98%

            ✅ ready to merge

            ✅ code review

What you demand additionally

            → Add metrics: payment.retry.attempts, payment.retry.exhausted

            → Run synthetic failure in staging

            → Show me the trace + metric values

            → Verify dashboard in prod yourself

4

15 / 24

Four shifts.

Observability-first deployment← the test
SLOs as the production contract
Policy-as-code guardrails← the assertions
AI-generated investigation reports

Shift 1 of 4 16 / 24

Observability-first.

Instrument before you ship — it's the deploy gate, the unit-test of production
"Agent-observable code" = structured logs, semantic metric labels, trace span attributes that explain intent

Not intent

ERROR: failed

Intent

payment.retry.exhausted
user=42 reason=upstream_timeout
attempt=4 budget=3

Agent-observable code:

— Reason without reading the source.
— Structured logs
— Semantic event names
— Metrics with documented units and labels
— Traces with attributes that explain intent

Shift 2 of 4 17 / 24

SLOs as the contract.

The SLO defines what "working" means — it's how the agent verifies its own work
Pre-deploy: agent must prove the SLO holds under load — not "tests pass"
Post-deploy: SLO compliance overrides every other signal, including the test suite

"Production is where the rigor goes." — Charity Majors

SLOs turn your intent from the vibe of just "Working" to being a measurable contract.

As a manager you set OKRs for your employees.
As an agent manager you set SLOs.

Shift 3 of 4 18 / 24

Policy-as-code guardrails.

The deploy gate is code — version-controlled, enforced by CI
Common gates: latency budget, error rate, dependency health, security scan, IaC compliance
Agent-written code passes through the same gate as human code — automatically

Canonical tool: OPA (Open Policy Agent) — CNCF graduated, used at Netflix, Goldman Sachs, Google Cloud.

What's next

Emerging Policy-as-Prompt
research and tools

Tooling exists — ecosystem is early

Shift 4 of 4 19 / 24

Start from the investigation report.

Old: tail logs

            $ tail -f /var/log/app.log | grep ERROR

            [03:11:58] ERROR req_id=a4f2 status=500

            [03:11:59] ERROR req_id=a4f3 status=500

            [03:12:00] ERROR req_id=a4f4 status=500

            [03:12:01] ERROR req_id=a4f5 status=500

            [03:12:01] WARN  db_query slow=true ms=4823

            [03:12:02] ERROR req_id=a4f6 status=500

            [03:12:02] ERROR req_id=a4f7 status=500

            ^C

You are the investigation engine.

New: AI investigation report

            // AI Investigation Report — 03:12 UTC

            Latency spike at 02:47:03 UTC → deploy cart-service@7f3a91

            N+1 query in OrderHistoryService.getRecentOrders()

            Fix: prefetch_related('items') line 47

            Confidence: HIGH · [trace] [deploy] [git blame]

You evaluate the hypothesis.

Real today: PagerDuty SRE Agent (Oct 2025), incident.io (Jul 2025), Azure AI SRE (May 2025 preview)

Use your AI
bullshit filter

⟶

If the report is wrong, ask three questions:

1. Is instrumentation right?
2. Is agent analysis correct?
3. Did the agent have enough context?

Shift 5 of 4 20 / 24

The new on-call skill.

Old

✕ Read the code. — Force 1: volume is exponential

✕ Find the bug. — Force 2: breadth spans stacks you don't own

✕ Know every service. — Force 3: lost intimacy, code you never wrote

Operational metrics matter more, not less: MTTR, MTBF, SLO compliance, error budget burn rate.

Things to be good at

1. Reading distributed traces
2. Recognising failure-mode patterns across stacks you didn't write
3. Validating AI hypotheses
4. Knowing when to escalate vs. when to trust the report

5

21 / 24

The reframe.

"You build it, you run it."

⇕

"You ship agents, you own their outcome."

Accountability hasn't moved. The keystrokes have.

Call to Action 22 / 24

Three questions before every agent-written deploy.

01

What does good behaviour look like?

Have you defined the SLOs?
02

How will I know if it's misbehaving?

Is the instrumentation in place?
03

Who bears the responsibility?

Is responsibility clear?

"If you can't answer those three, you're not ready to ship — regardless of who wrote the code."

AWS X DevOps · 2026

Your job isn't writing/reading code.
It's writing/reading systems.

Ran Tavory · @rantav

Thank you.

Your Agent Built It. Who the 🤬 Runs It?

Coding agents, by how long a task they can complete.

Three forces making production harder.

More code. More surface area for failure.

The definition of "developer" expanded.

When non-specialists ship code, attack surface grows.

The hunch is gone.

You can fight bad outputs. You can't fight the volume curve.

Managing without reading.

You've done this before. The verbs just change object.

Scenario: agent ships a payment feature.

Four shifts.

Observability-first.

SLOs as the contract.

Policy-as-code guardrails.

Start from the investigation report.

The new on-call skill.

The reframe.

Three questions before every agent-written deploy.

When AI Builds Itself