High-Level Design

Prologue

The Wall at Fifteen People

Every engineering organization above roughly fifteen people hits the same wall. Features that should take weeks take months, not because the engineers are slow, but because nobody agreed on how to build the thing before building started. Rework, mismatched assumptions, and silent conflicts between services eat the calendar.

High-Level Design documents exist to prevent this. Done right, they compress weeks of ambiguity into a few pages of shared understanding. Done wrong, they are five pages of polished nothing.

2–4

pages. If it is longer, scope is too wide or the author is hiding uncertainty behind volume.

1

question answered. "Given what we need and what we have, what is the build path?"

30min

review. Two reviewers, 24 hours of lead time, then ship.

Part I

What an HLD Is, and Why It Matters Right Now

An HLD sits between the product spec (which says what to build) and the code (which is the build itself). It is the document that names how, and it defends the choice against the obvious alternatives.

Where the HLD lives

Product Spec HLD Code ───────────── ───────────── ───────────── WHAT to build → HOW to build it → The build itself Features, UX Architecture, Implementation Requirements Tradeoffs, Phases details

A useful HLD is short (two to four pages), decision-forward (it names rejected options and explains why), and actionable (every engineer can answer "what do I build tomorrow?" without another meeting).

This is where AI-assisted development changes the math. The implementation layer is now cheap. The remaining moat is deciding what code should exist. The HLD is the artifact where that decision lives, and the difference in agent output is dramatic.

Two briefs, two outcomes

Without HLD → "Add LLM analysis to the pipeline" Result: generic integration, wrong abstraction, no timeout handling, duplicates existing NLP service. With HLD → "AnalysisService, async via SQS, 3x retry w/ backoff, rule-based fallback, reuse NormalizationModule." Result: focused implementation that fits the system.

Engineers who can produce HLDs direct entire systems. Engineers who cannot implement other people's designs. AI has widened that gap from a career advantage to a career defining split.

Part II

The Process: Six Phases

A sequence, not a template. Run each phase in order. Each one forces a question you cannot defer.

01

Problem Definition

02

Current Survey

03

Quantify

04

Minimize

05

Decide Visibly

06

Pre-mortem

Phase 01

Problem Definition

The most common failure in design is solving the wrong problem precisely. Invest here disproportionately. Four components, every time.

Problem Definition

Problem

Checkout P95 latency is 900ms. Target: under 200ms.

Audience

End users on the purchase flow. Not internal ops tools.

Non-goals

InventoryService refactor, payment provider migration, checkout UI redesign, back-office tooling improvements.

Cost of inaction

At current latency, checkout abandonment is ~12%. Each 100ms reduction correlates with ~1.5% recovery based on industry data.

Non-goals are the most underrated line item. They prevent scope creep before design begins. Cost of inaction forces you to justify the work itself. Some features should not be built at all, and this section catches them early.

Phase 02

Current System Survey

Document what already exists before proposing anything new. Most bad designs come from authors who skipped this step and proposed solutions that duplicate, conflict with, or break existing systems.

Current state, checkout path

PaymentService (owns checkout flow) ├── Synchronous call to InventoryService (~120ms) ├── Synchronous call to PricingService (~80ms) ├── Synchronous call to FraudCheck (~400ms) ← bottleneck └── Synchronous call to PaymentGateway (~200ms) Existing retry logic: none Existing caching: PricingService has 60s TTL Teams consuming PaymentService events: ├── Analytics (async, Kafka) ├── Notifications (async, SQS) └── Fulfillment (sync REST) ← will break if API changes

The survey answers three questions: what already works (do not duplicate it), what are the real constraints (not theoretical, the things actually slow or fragile or at capacity), and who depends on this (every downstream consumer is a coordination cost).

Phase 03

Quantify Requirements

Vague requirements produce wrong architectures. Translate words into numbers. The translation is where most design quality is won or lost.

Vague requirement	Quantified version	Design implication
"Fast"	P95 under 200ms	Need to parallelize FraudCheck or make it async
"Scalable"	Handle 3x current load (15k RPS) within 6 months	Current single-instance FraudCheck will not scale; need horizontal scaling or a queue
"Reliable"	99.95% uptime (4.4h downtime/year)	Requires redundancy on the payment path, no single points of failure
"Many users"	10M DAU, 5k RPS peak	Connection pooling, read replicas, caching layer needed

Quantification turns opinion debates into arithmetic. The 200ms target from Phase 1 reappears here as a hard constraint, and it will later eliminate caching as an option in Phase 5 (caching only helps repeat buyers, so it cannot meet the target for all users). One number, one decision closed.

The architectures for 100, 10,000, and 10 million users are different systems. I once watched a team spend six weeks building a Kafka pipeline for a service handling 200 requests per day. A Postgres table with a polling job would have worked for two years. Designing for the wrong number is equally expensive in both directions.

Phase 04

Minimize Components

Start with the fewest possible moving parts. Justify each addition against a specific quantified requirement from Phase 3. The burden of proof sits on addition, not on simplicity.

Question	Yes	No
Does a quantified requirement demand async processing?	Add a message queue. Reference the requirement by number.	Do not add a queue. Synchronous is simpler to debug, deploy, and monitor.
Is there an organizational boundary (separate team owns this domain)?	Separate service is justified on ownership grounds alone.	Can the existing service absorb this logic?
Does the existing service's resource profile (CPU, memory, latency budget) support the additional load?	Keep it in the existing service. One fewer deployment target.	Justify separation with the load numbers from Phase 3.

Every component is a new failure mode, a new deployment dependency, and a new thing to monitor at 3am.

Phase 05

Make Decisions Visible

Every HLD has two or three defining architectural choices. They need explicit side by side treatment, not a mention. Always include "do nothing" as a real option, and always name the downside of the chosen one.

Decision · How to handle the FraudCheck bottleneck

Option A

Do nothing

Pros

No risk. No engineering work.

Cons

Checkout latency stays at 900ms. Abandonment stays at ~12%.

Option B

Make FraudCheck async

Pros

Cuts 400ms from the critical path. Meets the 200ms target. Fastest win.

Cons

Orders proceed before fraud verdict. Need a compensation flow for flagged orders. ~2 days of extra work.

Option C

Cache fraud verdicts

Pros

Keeps sync flow. Lower risk than full async.

Cons

Only helps repeat buyers. ~30% hit rate estimated. Does not meet target for new users.

Phase 06

Design for Failure (Pre-mortem)

Assume the system has already failed. Work backwards. Try it now: spend sixty seconds on a project you shipped recently and ask "how would I guarantee this breaks?" The question activates a different mental model than "how will this work" and consistently surfaces risks optimistic planning misses.

01

FraudCheck queue backs up during a flash sale

Dead letter queue + alert at 1000 depth + automatic fallback to rule-based scoring

02

Payment gateway returns 500s for 10 minutes

Circuit breaker (open after 5 failures in 30s), retry with exponential backoff, user-facing "try again in 2 minutes" instead of silent failure

03

Async fraud verdict arrives after order is shipped

Hold shipment for 60s window. If no verdict, proceed + flag for manual review. Accept ~0.1% false-positive cost

04

Schema migration on FraudCheck table locks writes

Online migration (pt-online-schema-change or pg_repack). Zero-downtime deploy requirement documented

"The six phases are not a template to fill in mechanically. They are a sequence of questions that, taken seriously, surface problems while they are still cheap to fix."

Part III

The Template

Seven sections. Scale each to its actual complexity. Most HLDs skip the optional phases section entirely.

1.Context→One paragraph. The world today + the problem.

2.Goals / Non-goals→What this will and will not do.

3.Design overview→One-paragraph summary before going deep.

4.Detailed design→Components, data flow, tech choices, tradeoffs, diagrams.

5.Alternatives→Every serious option + why rejected.

6.Cross-cutting→Security, observability, privacy. Brief.

7.Phases→Only if rollout is complex. Otherwise omit.

Part IV

The Review, Where Most HLDs Die

Writing the document is half the work. The other half is getting useful feedback without devolving into bike-shedding about naming conventions or diagram alignment.

01 · Roster

Who should review

Two or three people, maximum. One senior engineer who can validate structural decisions. One owner of a downstream dependency you touch. Optionally, one domain expert on the new technology. More than three and the feedback becomes contradictory noise.

02 · Signal

What good feedback looks like

Useful feedback does one of three things: names a failure mode you missed, challenges an assumption with data, or proposes a simpler alternative. "I would use Kafka instead of SQS" without a reason tied to your requirements is a preference statement. Acknowledge and move on.

03 · Meeting

How to run the review

Thirty minutes, not sixty. Share the doc 24 hours ahead. Open with five minutes flagging the two or three decisions you want scrutiny on. Discussion, not presentation. If anyone says "I haven't read it yet", reschedule. An uninformed review is worse than none.

04 · Exit

When to stop iterating

An HLD is a decision document, not a consensus document. If two of three reviewers are aligned and dissent is non-critical, publish, note the dissent, ship. Waiting for unanimous agreement is how docs sit in review for three weeks while the team builds nothing.

Part V

The Checkout Latency HLD

Six phases applied to the checkout problem threaded through this article. This is what a finished HLD looks like in practice. Steal the structure.

High-Level Design · Draft

Checkout Latency Reduction

Context

Our checkout flow has a P95 latency of 900ms, driven primarily by a synchronous FraudCheck call that takes 400ms on average. Checkout abandonment is approximately 12%, and industry data suggests each 100ms reduction correlates with a 1.5% recovery in completion rate. This design proposes making FraudCheck asynchronous to bring checkout P95 under 200ms.

Goals

Reduce checkout P95 latency to under 200ms.
Maintain fraud detection accuracy at current levels.
Ship within one sprint (two weeks) with zero downtime.

Non-goals

Refactoring InventoryService or PricingService.
Migrating payment providers.
Redesigning the checkout UI.
Improving back-office tooling latency.

Design overview

Move FraudCheck off the synchronous checkout path. Orders are submitted immediately after payment confirmation. FraudCheck runs asynchronously via SQS, with a 60-second hold window before shipment. Orders not cleared within the window proceed and are flagged for manual review.

Detailed design

Current path (sync, ~900ms total):

User → PaymentService → InventoryService (120ms) → PricingService (80ms) → FraudCheck (400ms) → PaymentGateway (200ms) → Response

Proposed path (async FraudCheck, ~500ms on the critical path):

User → PaymentService → InventoryService (120ms) → PricingService (80ms, cached) → PaymentGateway (200ms) → Response (order confirmed) ↓ → SQS → FraudCheckWorker ↓ → Verdict stored → Fulfillment hold window

Key components.

SQS queue for FraudCheck jobs. Standard queue, not FIFO (ordering is not required). Dead letter queue after 3 failed attempts.
FraudCheckWorker consumes from SQS, calls existing FraudCheck logic, writes verdict to the orders table.
Shipment hold in FulfillmentService: 60-second window. If verdict arrives, apply it. If not, release shipment and flag for manual review.

Tech choices.

SQS over Kafka because we do not need replay, ordering, or fan-out. SQS is already in our infrastructure. Kafka would add operational overhead for no benefit at this scale.
Verdict stored in existing orders table (new fraud_status column: pending, cleared, flagged, manual_review) rather than a separate table, to avoid cross-table consistency issues.

Alternatives considered

Do nothing. Checkout abandonment stays at 12%. Cost: roughly $X/month in lost conversions.
Cache fraud verdicts. Only helps repeat buyers (~30% hit rate). Does not meet the 200ms target for new users. Stale verdicts introduce risk with no clear expiration policy.
Parallelize all calls. Reduces latency but FraudCheck at 400ms still dominates. Best case ~420ms, still above target. Added complexity for insufficient gain.

Cross-cutting concerns

Security. Fraud verdicts are sensitive. SQS messages encrypted at rest (AWS default). Worker IAM role scoped to orders table only.
Observability. New CloudWatch metrics on queue depth, processing latency, and manual review rate. Alert if queue depth exceeds 1000 or manual review rate exceeds 0.5%.
Privacy. No new PII stored. Fraud verdicts reference order IDs, not user data directly.

Failure modes and mitigations

Queue backup during flash sale: fallback to synchronous rule-based scoring (degraded accuracy, maintained latency).
Payment gateway 500s: circuit breaker, exponential backoff, user-facing retry message.
Late verdict after shipment: accept ~0.1% false-positive cost, flag for post-shipment review.
Schema migration on orders table: pg_repack for zero-downtime column addition.

Rollout phases

Week 1: Deploy FraudCheckWorker and SQS queue. Run in shadow mode (async check runs but does not affect the sync path). Compare verdicts.
Week 2: Cut over. Route checkout through async path. Monitor queue depth and manual review rate for 48 hours before declaring stable.

Part VI

Building the Skill Without a Project

No HLD to write at work right now? Three ways to build the muscle.

Approach 01

Reverse-engineer your own work

Take something you already built. Write the HLD after the fact. You will find decisions you made by accident that should have been deliberate. Judgment sharpens fastest here, because you already know how the story ends.

Approach 02

Design against open source

Pick a project you use (Postgres, Redis, Next.js). Find a feature request. Write the HLD for it across all six phases. Public issue discussion gives you a built-in review.

Approach 03

Review someone else's design

Find an RFC or ADR in any open source project and write a review. What is missing? What assumptions are unstated? What failure mode did they miss? Reviewing teaches pattern recognition faster than writing.

For published examples, the Google, Stripe, and Uber engineering blogs describe decisions in HLD style. Pay attention to how they handle "alternatives considered". It is almost always the most revealing part of any design doc.

Epilogue

The Thinking the Artifact Forces

An HLD outlives the sprint. It outlives the quarter. It becomes the institutional memory that stops the next team from repeating the same mistakes.

But the real leverage is not the artifact. It is the thinking the artifact forces: a sequence of questions that surface problems while they are still cheap to fix. Before the pull request. Before the deployment. Before the 3am page.

The highest leverage thing you can do this week is not write more code. It is write one clear document that makes the next three months of work obvious to everyone involved, get two people to poke holes in it, then build with confidence.

See you in the trenches.

About the Author

I'm Boyan Balev, a Senior Engineer with 7 years building data systems across outsourcing, startups, and enterprise environments. Currently working on MandateWire revamp.

I write about system design, infrastructure economics, and the pragmatic side of engineering at The Trenches.

Medium LinkedIn GitHub