← Back to Writing

December 2025

Running Infrastructure at Scale

Lessons from the trenches on reliability, operations, and the work that compounds.

Running infrastructure at scale is different from building it. Building is creative. Running is disciplined. Building is about features. Running is about reliability. Building gets the glory. Running keeps the lights on.

Most infrastructure companies underinvest in operations until something breaks. Then they scramble. Then they promise to do better. Then they underinvest again. The cycle repeats until a major incident forces real change.

Operational excellence isn't a feature you ship. It's a culture you build. And it compounds over time.

What Breaks at Scale

Everything that works at small scale breaks at large scale. The question isn't if, but when and how badly.

The 10x Breaks

At 10x your current load, you'll discover:

• Database queries that were fast enough are now slow

• Connection pools that were big enough are now exhausted

• Caches that were optional are now essential

• Logs that were manageable are now overwhelming

• Manual processes that were tolerable are now impossible

These breaks are predictable. You can see them coming if you look. The database query that takes 100ms at 1,000 requests/second will take 1 second at 10,000 requests/second. The math is straightforward. The fix is not.

The 100x Breaks

At 100x, the architecture itself breaks:

• Single databases need to become clusters

• Monoliths need to become services

• Synchronous operations need to become async

• Single-region deployments need to become multi-region

• Manual deployments need to become automated

These aren't optimizations. They're rewrites. And they're expensive. A company that doesn't plan for 100x will spend 6-12 months rebuilding when they hit the wall.

The 1000x Breaks

At 1000x, you're operating at a different level entirely:

• You need dedicated teams for each major system

• You need custom tooling for operations

• You need formal incident management processes

• You need capacity planning as a discipline

• You need reliability engineering as a function

Most infrastructure companies never reach 1000x. But the ones that do look completely different from where they started. The code is different. The architecture is different. The team is different. The culture is different.

The Operational Disciplines

Running infrastructure at scale requires disciplines that most startups don't have. These aren't optional at scale. They're survival.

Monitoring and Observability

You can't fix what you can't see. At scale, you need visibility into every layer of the stack.

The Observability Stack:

• Metrics: Request rates, latencies, error rates, saturation

• Logs: Structured, searchable, retained appropriately

• Traces: Request flows across services

• Alerts: Actionable, not noisy, with clear ownership

The goal isn't to collect data. It's to answer questions. When something breaks, you need to know what, where, when, and why within minutes. If your observability can't answer those questions, it's not working.

Good monitoring tells you something is wrong. Great observability tells you why.

Incident Response

Incidents will happen. The question is how you respond. A mature incident response process includes:

• Clear escalation paths (who gets paged, when)

• Defined roles (incident commander, communications, technical lead)

• Communication templates (status page updates, customer notifications)

• Runbooks for common issues

• Post-incident reviews (blameless, focused on systems)

The worst time to figure out your incident response process is during an incident. Practice before you need it. Run game days. Simulate failures. Build the muscle memory.

On-Call

Someone needs to be responsible for the system 24/7. On-call is how you make that real.

Bad on-call burns out engineers. They get paged constantly for non-issues. They can't sleep. They dread their rotation. They quit.

Good on-call is sustainable. Pages are rare and meaningful. Runbooks exist for common issues. Escalation paths are clear. The on-call engineer has the authority and tools to fix problems.

On-Call Health Metrics:

• Pages per week (target: <5 during business hours, <2 overnight)

• Time to acknowledge (target: <5 minutes)

• Time to resolve (varies by severity)

• False positive rate (target: <10%)

Capacity Planning

Running out of capacity is a self-inflicted wound. You know your growth rate. You know your resource consumption. You can predict when you'll hit limits.

Capacity planning means:

• Tracking resource utilization over time

• Modeling growth scenarios

• Identifying bottlenecks before they bite

• Planning infrastructure changes with lead time

• Budgeting for capacity investments

The companies that do this well never have capacity emergencies. The companies that don't are always firefighting.

Change Management

Most outages are caused by changes. A deployment. A configuration update. A database migration. A network change. Something changed, and something broke.

Good change management reduces this risk:

• All changes go through a defined process

• Changes are reviewed before deployment

• Changes are deployed incrementally (canary, blue-green)

• Changes can be rolled back quickly

• Changes are tracked and auditable

This feels like bureaucracy when you're small. It's essential when you're large. The cost of a bad change at scale is measured in customer trust and revenue.

The Technical Debt Tax

Technical debt accumulates silently. Every shortcut, every "we'll fix it later," every feature shipped without tests adds to the balance. At scale, this debt comes due.

Technical debt is like financial debt. A little is fine. A lot is dangerous. And the interest compounds.

Signs your technical debt is becoming critical:

• Deployments are scary (and getting scarier)

• Simple changes take weeks

• New engineers take months to become productive

• The same bugs keep coming back

• Nobody understands how certain systems work

• Performance is degrading without clear cause

Paying down technical debt requires dedicated investment. Not "we'll do it when we have time" (you never will). Dedicated sprints. Dedicated engineers. Dedicated budget. The companies that do this stay healthy. The companies that don't eventually collapse under the weight.

The Culture of Reliability

Operational excellence isn't a checklist. It's a culture. And culture comes from the top.

In a reliability culture:

• Reliability is a feature, not an afterthought

• Incidents are learning opportunities, not blame games

• On-call is respected and compensated

• Technical debt is tracked and prioritized

• Operational work is valued as much as feature work

• Everyone understands the cost of downtime

Building this culture takes years. Destroying it takes one bad incident response, one blamed engineer, one ignored warning.

The Compounding Effect

Operational investments compound. Good monitoring makes incidents shorter. Shorter incidents mean less customer impact. Less customer impact means better retention. Better retention means more revenue to invest in operations.

The reverse is also true. Poor operations lead to longer incidents, more customer churn, less revenue, and less investment in operations. It's a death spiral.

The best infrastructure companies treat operations as a competitive advantage, not a cost center.

We've seen this pattern across dozens of infrastructure companies. The ones that invest early in operational excellence grow faster, retain customers better, and build more durable businesses. The ones that don't eventually hit a wall they can't climb.

What We Look For

When we evaluate infrastructure companies, we look at operations as closely as we look at product. Questions we ask:

• What's your uptime over the last 12 months?

• How many incidents did you have? What caused them?

• What's your on-call rotation like?

• Show me your monitoring dashboards

• Walk me through your last major incident

• How do you handle technical debt?

• What breaks if you 10x tomorrow?

The answers tell us whether the company is built to scale or built to break.

If you're running infrastructure and want to talk about operational challenges, reach out. We've been in the trenches. We know what works.

Jarred Taylor

Capital at the inflection.