Skip to the content.

Design a system that scales to millions of users on AWS

Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.

Step 1: Outline use cases and constraints

Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.

Without an interviewer to address clarifying questions, we’ll define some use cases and constraints.

Use cases

Solving this problem takes an iterative approach of: 1) Benchmark/Load Test, 2) Profile for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat, which is good pattern for evolving basic designs to scalable designs.

Unless you have a background in AWS or are applying for a position that requires AWS knowledge, AWS-specific details are not a requirement. However, much of the principles discussed in this exercise can apply more generally outside of the AWS ecosystem.

We’ll scope the problem to handle only the following use cases

Constraints and assumptions

State assumptions

Calculate usage

Clarify with your interviewer if you should run back-of-the-envelope usage calculations.

Handy conversion guide:

Step 2: Create a high level design

Outline a high level design with all important components.

Imgur

Step 3: Design core components

Dive into details for each core component.

Use case: User makes a read or write request

Goals

Start with a single box

Use Vertical Scaling:

Trade-offs, alternatives, and additional details:

Start with SQL, consider NoSQL

The constraints assume there is a need for relational data. We can start off using a MySQL Database on the single box.

Trade-offs, alternatives, and additional details:

Assign a public static IP

Use a DNS

Add a DNS such as Route 53 to map the domain to the instance’s public IP.

Trade-offs, alternatives, and additional details:

Secure the web server

Trade-offs, alternatives, and additional details:

Step 4: Scale the design

Identify and address bottlenecks, given the constraints.

Users+

Imgur

Assumptions

Our user count is starting to pick up and the load is increasing on our single box. Our Benchmarks/Load Tests and Profiling are pointing to the MySQL Database taking up more and more memory and CPU resources, while the user content is filling up disk space.

We’ve been able to address these issues with Vertical Scaling so far. Unfortunately, this has become quite expensive and it doesn’t allow for independent scaling of the MySQL Database and Web Server.

Goals

Store static content separately

Move the MySQL database to a separate box

Secure the system

Trade-offs, alternatives, and additional details:

Users++

Imgur

Assumptions

Our Benchmarks/Load Tests and Profiling show that our single Web Server bottlenecks during peak hours, resulting in slow responses and in some cases, downtime. As the service matures, we’d also like to move towards higher availability and redundancy.

Goals

Trade-offs, alternatives, and additional details:

Users+++

Imgur

Note: Internal Load Balancers not shown to reduce clutter

Assumptions

Our Benchmarks/Load Tests and Profiling show that we are read-heavy (100:1 with writes) and our database is suffering from poor performance from the high read requests.

Goals

Trade-offs, alternatives, and additional details:

Add MySQL read replicas

Trade-offs, alternatives, and additional details:

Users++++

Imgur

Assumptions

Our Benchmarks/Load Tests and Profiling show that our traffic spikes during regular business hours in the U.S. and drop significantly when users leave the office. We think we can cut costs by automatically spinning up and down servers based on actual load. We’re a small shop so we’d like to automate as much of the DevOps as possible for Autoscaling and for the general operations.

Goals

Add autoscaling

Users+++++

Imgur

Note: Autoscaling groups not shown to reduce clutter

Assumptions

As the service continues to grow towards the figures outlined in the constraints, we iteratively run Benchmarks/Load Tests and Profiling to uncover and address new bottlenecks.

Goals

We’ll continue to address scaling issues due to the problem’s constraints:

SQL scaling patterns include:

To further address the high read and write requests, we should also consider moving appropriate data to a NoSQL Database such as DynamoDB.

We can further separate out our Application Servers to allow for independent scaling. Batch processes or computations that do not need to be done in real-time can be done Asynchronously with Queues and Workers:

Trade-offs, alternatives, and additional details:

Additional talking points

Additional topics to dive into, depending on the problem scope and time remaining.

SQL scaling patterns

NoSQL

Caching

Asynchronism and microservices

Communications

Security

Refer to the security section.

Latency numbers

See Latency numbers every programmer should know.

Ongoing