⌘K

- Home
- Library

Home

Library

⌘K

System Design Interview: Design Whatsapp w/ a Ex-Meta Senior Manager | COFYT

thesickboy125

9 months ago

System Design Interview: Design Whatsapp w/ a Ex-Meta Senior Manager

New

System Design Interview: Design Whatsapp w/ a Ex-Meta Senior Manager

Sources

youtube.com

Answer

Ask me anything about this video:

can you provide a data models and apis for the design

Answer

Can you give detailed design

Answer

give the functional & non func requirements, and the capacity estimations

Answer

thesickboy125

System Design Interview: Design Whatsapp w/ a Ex-Meta Senior Manager

Sources

youtube.com

System Design Interview: Design Whatsapp w/ a Ex-Meta Senior Manager

Answer

About this Video

Video Title: System Design Interview: Design Whatsapp w/ a Ex-Meta Senior Manager
Channel: Hello Interview - SWE Interview Preparation
Speakers: Stefan
Duration: 00:58:12

Introduction

This video provides a detailed walkthrough of designing WhatsApp, a common system design interview question. The speaker, Stefan, a former senior manager at Meta and Amazon with extensive interview experience, guides viewers through a structured approach using a delivery framework to effectively tackle the problem.

Key Takeaways

Structured Approach: The video emphasizes a step-by-step approach to system design interviews, including defining requirements (functional and non-functional), identifying core entities, designing the API, and creating a high-level design, followed by deep dives into specific areas. This framework helps candidates manage their time and ensure comprehensive coverage.
Requirement Specification: The importance of clearly defining functional and non-functional requirements is stressed. Functional requirements outline the system's capabilities (e.g., sending messages, creating chats), while non-functional requirements specify performance characteristics (e.g., low latency, high throughput). Quantifying non-functional requirements whenever possible is recommended.
Scalability and Fault Tolerance: The video demonstrates how to design for scalability and fault tolerance. Starting with a simple, single-node design and iteratively evolving it to handle billions of users and potential failures is illustrated using techniques like load balancing, consistent hashing, and Redis Pub/Sub.
Choosing the Right Technology: The video showcases how to choose appropriate technologies based on system needs. The selection of websockets for real-time communication, S3 for media storage, and Redis Pub/Sub for inter-server communication is explained based on their respective strengths and limitations.
Iterative Design: The video highlights the iterative nature of system design. The design evolves as new challenges are identified and addressed, reflecting real-world system development. Proactive communication with the interviewer regarding limitations and trade-offs is encouraged.

Ask me anything about this video:

can you provide a data models and apis for the design

Answer

The transcript doesn't provide explicit data models or APIs in a structured format like ER diagrams or OpenAPI specifications. However, it describes the essential components. I can summarize the implied data models and APIs based on the discussion:

Data Models (Implied):

Users: user_id (PK), username, ...other user attributes... (Likely a separate database or service outside the core chat system)
Chats: chat_id (PK), chat_name, creation_timestamp, ...other chat metadata...
Chat Participants: chat_id (PK), user_id (SK), join_timestamp, ...other participant metadata... (Note: The transcript suggests a composite key for efficient querying of participants within a chat and a global secondary index (GSI) on user_id for finding all chats a user participates in.)
Messages: message_id (PK), chat_id, sender_user_id, timestamp, message_content
Inbox: recipient_client_id (PK), message_id (SK), delivery_status, delivery_attempt_timestamp (This table tracks undelivered messages for offline clients. recipient_client_id combines user_id and client_id to handle multiple devices.)
Attachments: (This data is not stored directly in the main database; pre-signed URLs point to objects in S3 blob storage.)

APIs (Implied - WebSocket-based):

The APIs are described as commands sent between clients and servers via WebSockets. The transcript doesn't give precise request/response structures, but outlines the core functionalities:

Client to Server Commands:

create_chat(participants): Creates a new chat with specified participants.
send_message(chat_id, message_content): Sends a text message.
send_media(chat_id): Initiates media upload (receives pre-signed URL in response).
modify_participants(chat_id, add/remove_users): Adds or removes participants from a chat.

Server to Client Commands:

new_message(message): Notifies the client of a new message.
chat_created(chat): Notifies the client of a new chat they've joined.
participants_changed(chat, changes): Notifies the client of changes in chat participants.
presence_update(user_id, status): (Optional; depends on presence indicator implementation) Notifies of changes in user online status.

Important Notes:

Client IDs: The design needs client IDs to manage multiple devices per user. The transcript implies the inclusion of client IDs in the Inbox model and PubSub topic subscriptions.
Error Handling: The transcript doesn't describe error handling, but a production system would require robust error handling mechanisms for all API calls.
Authentication and Authorization: Security mechanisms for authentication and authorization (e.g., JWTs, API keys) are not detailed in the transcript but are critical aspects of a real-world WhatsApp-like system.
Database Choice: While DynamoDB is mentioned, the choice of database (and other services) is a design decision that would be discussed further in a real interview. The suitability of DynamoDB would depend on various factors.

To obtain a fully detailed specification, you would need to expand on this information, providing precise data types, constraints, and API request/response structures. This would typically be done using tools and notations like ER diagrams for databases and OpenAPI/Swagger for APIs.

Can you give detailed design

Answer

The transcript provides a high-level design, iteratively refining it to address scalability and other challenges. I can't create a completely detailed design from just the transcript, as that would involve making significant design choices not explicitly stated. However, I can offer a more structured outline expanding on the information provided, highlighting key design decisions and potential areas for further refinement:

I. System Architecture:

The system is client-server based, utilizing WebSockets for real-time communication and a distributed architecture to handle high scale.

Clients: Mobile and desktop applications (iOS, Android, Web, Desktop) connecting via WebSockets.
Load Balancer (Layer 4): Distributes incoming WebSocket connections across multiple chat servers using a least-connections strategy. This preserves connection state and ensures high availability.
Chat Servers (Stateful): Manage WebSocket connections, process messages, interact with databases, and communicate with each other via Redis Pub/Sub. Each server handles a subset of users determined by consistent hashing.
Chat Registry: A service (e.g., a simple key-value store like Redis or a distributed service like etcd/ZooKeeper) mapping user IDs to the responsible chat server. Clients query this registry to determine the server to connect to.
Data Storage: DynamoDB for structured data (users, chats, messages, inbox); S3 for media storage.
Redis Pub/Sub: Enables efficient inter-server communication for delivering messages to users connected to different servers.
Cleanup Service: Periodically removes old messages and inbox entries to manage storage space and comply with data retention policies.

II. Data Models (Detailed):

These expand on the previous summary, including specific data types and constraints (examples only – the exact choices depend on the specific database technology):

Users:
- user_id (UUID, Primary Key)
- username (String, Unique, not null)
- email (String, Unique, not null)
- phone_number (String, Unique, nullable)
- ...other user attributes... (e.g., profile picture URL)
Chats:
- chat_id (UUID, Primary Key)
- chat_name (String, nullable)
- creation_timestamp (Timestamp)
- ...other metadata... (e.g., is_group, last_message_timestamp)
Chat Participants:
- chat_id (UUID, Partition Key)
- user_id (UUID, Sort Key)
- join_timestamp (Timestamp)
- ...other metadata... (e.g., last_read_timestamp)
- (Note: This model leverages DynamoDB's composite key for efficient query)
Messages:
- message_id (UUID, Primary Key)
- chat_id (UUID)
- sender_user_id (UUID)
- timestamp (Timestamp)
- message_content (JSON, containing text, media URLs, etc.)
Inbox:
- recipient_client_id (Composite Key: user_id, client_id)
- message_id (UUID, Sort Key)
- delivery_status (Enum: PENDING, DELIVERED, FAILED)
- delivery_attempt_timestamp (Timestamp)
Clients: (Required for multi-device support)
- client_id (UUID, Primary Key)
- user_id (UUID)
- device_type (Enum: iOS, Android, Web, Desktop)
- last_active_timestamp (Timestamp)

III. APIs (Detailed – Conceptual):

This expands with more realistic request/response structures (JSON examples):

Client to Server (WebSocket):

create_chat:
- Request: {"command": "create_chat", "participants": ["user1", "user2"]}
- Response: {"status": "success", "chat_id": "chat123"} or error message.
send_message:
- Request: {"command": "send_message", "chat_id": "chat123", "message": "Hello!"}
- Response: {"status": "success"} or error message.
send_media:
- Request: {"command": "send_media", "chat_id": "chat123"}
- Response: {"status": "success", "presigned_url": "s3://..."} or error message.

Server to Client (WebSocket):

new_message:
- Message: {"type": "new_message", "message": {...message details...}}
chat_created:
- Message: {"type": "chat_created", "chat": {...chat details...}}

IV. Scaling Considerations:

The transcript covers consistent hashing and Redis Pub/Sub for inter-server communication. Further details would involve:

Consistent Hashing Implementation: Choosing a specific consistent hashing algorithm and implementing it within the chat registry.
Redis Pub/Sub Configuration: Determining the number of Redis nodes and appropriate topic naming conventions.
Monitoring and Alerting: Implementing monitoring and alerting to detect and respond to issues like overloaded servers or database performance bottlenecks.
Sharding Strategies: Defining how to shard data across multiple DynamoDB tables if necessary.
Caching: Implementing caching strategies (e.g., Redis) to reduce database load.

V. Fault Tolerance:

Redundancy: Having multiple instances of each service (chat servers, chat registry, etc.).
Error Handling: Implementing robust error handling and retry mechanisms.
Circuit Breakers: Using circuit breakers to prevent cascading failures.

This more detailed outline expands on the transcript's high-level design. A truly comprehensive design document would require considerably more detail and address many more aspects (security, data backups, etc.) The choices made would depend on the specific requirements and constraints.

give the functional & non func requirements, and the capacity estimations

Answer

Based on the transcript, here's a breakdown of functional and non-functional requirements, along with capacity estimations. Remember that these estimations are based on assumptions and the information given; a real-world project would require much more detailed analysis.

I. Functional Requirements:

These describe what the system should do:

User Registration and Authentication: Users should be able to register accounts and securely log in using their credentials (phone number, email, etc.).
Chat Creation: Users should be able to create individual or group chats.
Message Sending and Receiving: Users should be able to send and receive text messages and media attachments (images, videos, audio) in real-time.
Offline Message Delivery: Messages should be stored and delivered to users even when they are offline.
Media Handling: The system should handle the storage and delivery of media attachments efficiently, using appropriate storage solutions (like S3).
Presence Indicators (Optional): The system should optionally allow users to indicate their online/offline status and possibly provide notifications about the status of contacts.
Group Management: In group chats, users should be able to add or remove participants.

II. Non-Functional Requirements:

These describe how the system should perform:

Low Latency: Messages should be delivered with low latency (the transcript suggests under 500 milliseconds).
High Throughput: The system should handle a very high volume of messages per second (billions of users sending numerous messages daily).
Guaranteed Delivery: Messages should be reliably delivered to recipients, even in case of temporary network issues.
Scalability: The system should be able to scale horizontally to handle a growing number of users and messages.
Fault Tolerance: The system should remain operational even if some components fail.
Data Consistency: Data should be consistent across the system.
Data Retention: Messages should be stored for a defined period (the transcript mentions 30 days) and then purged.
Security: User data and communications should be protected from unauthorized access.

III. Capacity Estimations (Assumptions and Calculations):

These are rough estimates based on assumptions and the information in the transcript:

Number of Users: Billions (as per the transcript's example).
Messages per User per Day: 100 (Assumption)
Total Messages per Day: 100 Billion (100 messages/user/day * 1 Billion users)
Message Size (Average): 1 KB (Assumption, accounting for text and small media)
Total Data per Day: 100 Terabytes (100 Billion messages * 1 KB/message)
Data Retention (Days): 30 (Transcript)
Total Storage Requirement (Estimated): A few hundred terabytes (accounting for 30 days retention and some safety margin). This is a large but manageable amount for modern cloud storage.
Chat Servers: The transcript mentions WhatsApp using approximately 2 million connections per server. To handle billions of users, many thousands of chat servers would be needed. Precise numbers depend on the connection density per user and desired redundancy.
Database Capacity: The capacity of the DynamoDB and S3 will depend heavily on message volume, media file sizes, and query patterns. Thorough performance testing would be necessary. The capacity would need to scale to accommodate the total storage and throughput requirements calculated above.
Redis Pub/Sub Capacity: The capacity of the Redis cluster will depend on the message rate for inter-server communication. This will require careful consideration of network bandwidth and Redis cluster configuration.

Important Disclaimer: These are back-of-the-envelope calculations. Accurate capacity planning requires detailed performance testing, load modeling, and considering factors not mentioned in the transcript. The numbers presented should be treated as illustrative examples, not precise predictions.

About this Video

Video Title: System Design Interview: Design Whatsapp w/ a Ex-Meta Senior Manager
Channel: Hello Interview - SWE Interview Preparation
Speakers: Stefan
Duration: 00:58:12

Introduction

Key Takeaways

Structured Approach: The video emphasizes a step-by-step approach to system design interviews, including defining requirements (functional and non-functional), identifying core entities, designing the API, and creating a high-level design, followed by deep dives into specific areas. This framework helps candidates manage their time and ensure comprehensive coverage.
Requirement Specification: The importance of clearly defining functional and non-functional requirements is stressed. Functional requirements outline the system's capabilities (e.g., sending messages, creating chats), while non-functional requirements specify performance characteristics (e.g., low latency, high throughput). Quantifying non-functional requirements whenever possible is recommended.
Scalability and Fault Tolerance: The video demonstrates how to design for scalability and fault tolerance. Starting with a simple, single-node design and iteratively evolving it to handle billions of users and potential failures is illustrated using techniques like load balancing, consistent hashing, and Redis Pub/Sub.
Choosing the Right Technology: The video showcases how to choose appropriate technologies based on system needs. The selection of websockets for real-time communication, S3 for media storage, and Redis Pub/Sub for inter-server communication is explained based on their respective strengths and limitations.
Iterative Design: The video highlights the iterative nature of system design. The design evolves as new challenges are identified and addressed, reflecting real-world system development. Proactive communication with the interviewer regarding limitations and trade-offs is encouraged.

Data Models (Implied):

Users: user_id (PK), username, ...other user attributes... (Likely a separate database or service outside the core chat system)
Chats: chat_id (PK), chat_name, creation_timestamp, ...other chat metadata...
Chat Participants: chat_id (PK), user_id (SK), join_timestamp, ...other participant metadata... (Note: The transcript suggests a composite key for efficient querying of participants within a chat and a global secondary index (GSI) on user_id for finding all chats a user participates in.)
Messages: message_id (PK), chat_id, sender_user_id, timestamp, message_content
Inbox: recipient_client_id (PK), message_id (SK), delivery_status, delivery_attempt_timestamp (This table tracks undelivered messages for offline clients. recipient_client_id combines user_id and client_id to handle multiple devices.)
Attachments: (This data is not stored directly in the main database; pre-signed URLs point to objects in S3 blob storage.)

APIs (Implied - WebSocket-based):

The APIs are described as commands sent between clients and servers via WebSockets. The transcript doesn't give precise request/response structures, but outlines the core functionalities:

Client to Server Commands:

create_chat(participants): Creates a new chat with specified participants.
send_message(chat_id, message_content): Sends a text message.
send_media(chat_id): Initiates media upload (receives pre-signed URL in response).
modify_participants(chat_id, add/remove_users): Adds or removes participants from a chat.

Server to Client Commands:

new_message(message): Notifies the client of a new message.
chat_created(chat): Notifies the client of a new chat they've joined.
participants_changed(chat, changes): Notifies the client of changes in chat participants.
presence_update(user_id, status): (Optional; depends on presence indicator implementation) Notifies of changes in user online status.

Important Notes:

Client IDs: The design needs client IDs to manage multiple devices per user. The transcript implies the inclusion of client IDs in the Inbox model and PubSub topic subscriptions.
Error Handling: The transcript doesn't describe error handling, but a production system would require robust error handling mechanisms for all API calls.
Authentication and Authorization: Security mechanisms for authentication and authorization (e.g., JWTs, API keys) are not detailed in the transcript but are critical aspects of a real-world WhatsApp-like system.
Database Choice: While DynamoDB is mentioned, the choice of database (and other services) is a design decision that would be discussed further in a real interview. The suitability of DynamoDB would depend on various factors.

I. System Architecture:

The system is client-server based, utilizing WebSockets for real-time communication and a distributed architecture to handle high scale.

Clients: Mobile and desktop applications (iOS, Android, Web, Desktop) connecting via WebSockets.
Load Balancer (Layer 4): Distributes incoming WebSocket connections across multiple chat servers using a least-connections strategy. This preserves connection state and ensures high availability.
Chat Servers (Stateful): Manage WebSocket connections, process messages, interact with databases, and communicate with each other via Redis Pub/Sub. Each server handles a subset of users determined by consistent hashing.
Chat Registry: A service (e.g., a simple key-value store like Redis or a distributed service like etcd/ZooKeeper) mapping user IDs to the responsible chat server. Clients query this registry to determine the server to connect to.
Data Storage: DynamoDB for structured data (users, chats, messages, inbox); S3 for media storage.
Redis Pub/Sub: Enables efficient inter-server communication for delivering messages to users connected to different servers.
Cleanup Service: Periodically removes old messages and inbox entries to manage storage space and comply with data retention policies.

II. Data Models (Detailed):

These expand on the previous summary, including specific data types and constraints (examples only – the exact choices depend on the specific database technology):

Users:
- user_id (UUID, Primary Key)
- username (String, Unique, not null)
- email (String, Unique, not null)
- phone_number (String, Unique, nullable)
- ...other user attributes... (e.g., profile picture URL)
Chats:
- chat_id (UUID, Primary Key)
- chat_name (String, nullable)
- creation_timestamp (Timestamp)
- ...other metadata... (e.g., is_group, last_message_timestamp)
Chat Participants:
- chat_id (UUID, Partition Key)
- user_id (UUID, Sort Key)
- join_timestamp (Timestamp)
- ...other metadata... (e.g., last_read_timestamp)
- (Note: This model leverages DynamoDB's composite key for efficient query)
Messages:
- message_id (UUID, Primary Key)
- chat_id (UUID)
- sender_user_id (UUID)
- timestamp (Timestamp)
- message_content (JSON, containing text, media URLs, etc.)
Inbox:
- recipient_client_id (Composite Key: user_id, client_id)
- message_id (UUID, Sort Key)
- delivery_status (Enum: PENDING, DELIVERED, FAILED)
- delivery_attempt_timestamp (Timestamp)
Clients: (Required for multi-device support)
- client_id (UUID, Primary Key)
- user_id (UUID)
- device_type (Enum: iOS, Android, Web, Desktop)
- last_active_timestamp (Timestamp)

III. APIs (Detailed – Conceptual):

This expands with more realistic request/response structures (JSON examples):

Client to Server (WebSocket):

create_chat:
- Request: {"command": "create_chat", "participants": ["user1", "user2"]}
- Response: {"status": "success", "chat_id": "chat123"} or error message.
send_message:
- Request: {"command": "send_message", "chat_id": "chat123", "message": "Hello!"}
- Response: {"status": "success"} or error message.
send_media:
- Request: {"command": "send_media", "chat_id": "chat123"}
- Response: {"status": "success", "presigned_url": "s3://..."} or error message.

Server to Client (WebSocket):

new_message:
- Message: {"type": "new_message", "message": {...message details...}}
chat_created:
- Message: {"type": "chat_created", "chat": {...chat details...}}

IV. Scaling Considerations:

The transcript covers consistent hashing and Redis Pub/Sub for inter-server communication. Further details would involve:

Consistent Hashing Implementation: Choosing a specific consistent hashing algorithm and implementing it within the chat registry.
Redis Pub/Sub Configuration: Determining the number of Redis nodes and appropriate topic naming conventions.
Monitoring and Alerting: Implementing monitoring and alerting to detect and respond to issues like overloaded servers or database performance bottlenecks.
Sharding Strategies: Defining how to shard data across multiple DynamoDB tables if necessary.
Caching: Implementing caching strategies (e.g., Redis) to reduce database load.

V. Fault Tolerance:

Redundancy: Having multiple instances of each service (chat servers, chat registry, etc.).
Error Handling: Implementing robust error handling and retry mechanisms.
Circuit Breakers: Using circuit breakers to prevent cascading failures.

I. Functional Requirements:

These describe what the system should do:

User Registration and Authentication: Users should be able to register accounts and securely log in using their credentials (phone number, email, etc.).
Chat Creation: Users should be able to create individual or group chats.
Message Sending and Receiving: Users should be able to send and receive text messages and media attachments (images, videos, audio) in real-time.
Offline Message Delivery: Messages should be stored and delivered to users even when they are offline.
Media Handling: The system should handle the storage and delivery of media attachments efficiently, using appropriate storage solutions (like S3).
Presence Indicators (Optional): The system should optionally allow users to indicate their online/offline status and possibly provide notifications about the status of contacts.
Group Management: In group chats, users should be able to add or remove participants.

II. Non-Functional Requirements:

These describe how the system should perform:

Low Latency: Messages should be delivered with low latency (the transcript suggests under 500 milliseconds).
High Throughput: The system should handle a very high volume of messages per second (billions of users sending numerous messages daily).
Guaranteed Delivery: Messages should be reliably delivered to recipients, even in case of temporary network issues.
Scalability: The system should be able to scale horizontally to handle a growing number of users and messages.
Fault Tolerance: The system should remain operational even if some components fail.
Data Consistency: Data should be consistent across the system.
Data Retention: Messages should be stored for a defined period (the transcript mentions 30 days) and then purged.
Security: User data and communications should be protected from unauthorized access.

III. Capacity Estimations (Assumptions and Calculations):

These are rough estimates based on assumptions and the information in the transcript:

Number of Users: Billions (as per the transcript's example).
Messages per User per Day: 100 (Assumption)
Total Messages per Day: 100 Billion (100 messages/user/day * 1 Billion users)
Message Size (Average): 1 KB (Assumption, accounting for text and small media)
Total Data per Day: 100 Terabytes (100 Billion messages * 1 KB/message)
Data Retention (Days): 30 (Transcript)
Total Storage Requirement (Estimated): A few hundred terabytes (accounting for 30 days retention and some safety margin). This is a large but manageable amount for modern cloud storage.
Chat Servers: The transcript mentions WhatsApp using approximately 2 million connections per server. To handle billions of users, many thousands of chat servers would be needed. Precise numbers depend on the connection density per user and desired redundancy.
Database Capacity: The capacity of the DynamoDB and S3 will depend heavily on message volume, media file sizes, and query patterns. Thorough performance testing would be necessary. The capacity would need to scale to accommodate the total storage and throughput requirements calculated above.
Redis Pub/Sub Capacity: The capacity of the Redis cluster will depend on the message rate for inter-server communication. This will require careful consideration of network bandwidth and Redis cluster configuration.