This video provides a detailed walkthrough of designing WhatsApp, a common system design interview question. The speaker, Stefan, a former senior manager at Meta and Amazon with extensive interview experience, guides viewers through a structured approach using a delivery framework to effectively tackle the problem.
Structured Approach: The video emphasizes a step-by-step approach to system design interviews, including defining requirements (functional and non-functional), identifying core entities, designing the API, and creating a high-level design, followed by deep dives into specific areas. This framework helps candidates manage their time and ensure comprehensive coverage.
Requirement Specification: The importance of clearly defining functional and non-functional requirements is stressed. Functional requirements outline the system's capabilities (e.g., sending messages, creating chats), while non-functional requirements specify performance characteristics (e.g., low latency, high throughput). Quantifying non-functional requirements whenever possible is recommended.
Scalability and Fault Tolerance: The video demonstrates how to design for scalability and fault tolerance. Starting with a simple, single-node design and iteratively evolving it to handle billions of users and potential failures is illustrated using techniques like load balancing, consistent hashing, and Redis Pub/Sub.
Choosing the Right Technology: The video showcases how to choose appropriate technologies based on system needs. The selection of websockets for real-time communication, S3 for media storage, and Redis Pub/Sub for inter-server communication is explained based on their respective strengths and limitations.
Iterative Design: The video highlights the iterative nature of system design. The design evolves as new challenges are identified and addressed, reflecting real-world system development. Proactive communication with the interviewer regarding limitations and trade-offs is encouraged.
The transcript doesn't provide explicit data models or APIs in a structured format like ER diagrams or OpenAPI specifications. However, it describes the essential components. I can summarize the implied data models and APIs based on the discussion:
Data Models (Implied):
Users: user_id (PK), username, ...other user attributes... (Likely a separate database or service outside the core chat system)
Chats: chat_id (PK), chat_name, creation_timestamp, ...other chat metadata...
Chat Participants: chat_id (PK), user_id (SK), join_timestamp, ...other participant metadata... (Note: The transcript suggests a composite key for efficient querying of participants within a chat and a global secondary index (GSI) on user_id for finding all chats a user participates in.)
Messages: message_id (PK), chat_id, sender_user_id, timestamp, message_content
Inbox: recipient_client_id (PK), message_id (SK), delivery_status, delivery_attempt_timestamp (This table tracks undelivered messages for offline clients. recipient_client_id combines user_id and client_id to handle multiple devices.)
Attachments: (This data is not stored directly in the main database; pre-signed URLs point to objects in S3 blob storage.)
APIs (Implied - WebSocket-based):
The APIs are described as commands sent between clients and servers via WebSockets. The transcript doesn't give precise request/response structures, but outlines the core functionalities:
Client to Server Commands:
create_chat(participants): Creates a new chat with specified participants.send_message(chat_id, message_content): Sends a text message.send_media(chat_id): Initiates media upload (receives pre-signed URL in response).modify_participants(chat_id, add/remove_users): Adds or removes participants from a chat.Server to Client Commands:
new_message(message): Notifies the client of a new message.chat_created(chat): Notifies the client of a new chat they've joined.participants_changed(chat, changes): Notifies the client of changes in chat participants.presence_update(user_id, status): (Optional; depends on presence indicator implementation) Notifies of changes in user online status.Important Notes:
Client IDs: The design needs client IDs to manage multiple devices per user. The transcript implies the inclusion of client IDs in the Inbox model and PubSub topic subscriptions.
Error Handling: The transcript doesn't describe error handling, but a production system would require robust error handling mechanisms for all API calls.
Authentication and Authorization: Security mechanisms for authentication and authorization (e.g., JWTs, API keys) are not detailed in the transcript but are critical aspects of a real-world WhatsApp-like system.
Database Choice: While DynamoDB is mentioned, the choice of database (and other services) is a design decision that would be discussed further in a real interview. The suitability of DynamoDB would depend on various factors.
To obtain a fully detailed specification, you would need to expand on this information, providing precise data types, constraints, and API request/response structures. This would typically be done using tools and notations like ER diagrams for databases and OpenAPI/Swagger for APIs.
The transcript provides a high-level design, iteratively refining it to address scalability and other challenges. I can't create a completely detailed design from just the transcript, as that would involve making significant design choices not explicitly stated. However, I can offer a more structured outline expanding on the information provided, highlighting key design decisions and potential areas for further refinement:
I. System Architecture:
The system is client-server based, utilizing WebSockets for real-time communication and a distributed architecture to handle high scale.
II. Data Models (Detailed):
These expand on the previous summary, including specific data types and constraints (examples only – the exact choices depend on the specific database technology):
Users:
user_id (UUID, Primary Key)username (String, Unique, not null)email (String, Unique, not null)phone_number (String, Unique, nullable)...other user attributes... (e.g., profile picture URL)Chats:
chat_id (UUID, Primary Key)chat_name (String, nullable)creation_timestamp (Timestamp)...other metadata... (e.g., is_group, last_message_timestamp)Chat Participants:
chat_id (UUID, Partition Key)user_id (UUID, Sort Key)join_timestamp (Timestamp)...other metadata... (e.g., last_read_timestamp)Messages:
message_id (UUID, Primary Key)chat_id (UUID)sender_user_id (UUID)timestamp (Timestamp)message_content (JSON, containing text, media URLs, etc.)Inbox:
recipient_client_id (Composite Key: user_id, client_id)message_id (UUID, Sort Key)delivery_status (Enum: PENDING, DELIVERED, FAILED)delivery_attempt_timestamp (Timestamp)Clients: (Required for multi-device support)
client_id (UUID, Primary Key)user_id (UUID)device_type (Enum: iOS, Android, Web, Desktop)last_active_timestamp (Timestamp)III. APIs (Detailed – Conceptual):
This expands with more realistic request/response structures (JSON examples):
Client to Server (WebSocket):
create_chat:
{"command": "create_chat", "participants": ["user1", "user2"]}{"status": "success", "chat_id": "chat123"} or error message.send_message:
{"command": "send_message", "chat_id": "chat123", "message": "Hello!"}{"status": "success"} or error message.send_media:
{"command": "send_media", "chat_id": "chat123"}{"status": "success", "presigned_url": "s3://..."} or error message.Server to Client (WebSocket):
new_message:
{"type": "new_message", "message": {...message details...}}chat_created:
{"type": "chat_created", "chat": {...chat details...}}IV. Scaling Considerations:
The transcript covers consistent hashing and Redis Pub/Sub for inter-server communication. Further details would involve:
V. Fault Tolerance:
This more detailed outline expands on the transcript's high-level design. A truly comprehensive design document would require considerably more detail and address many more aspects (security, data backups, etc.) The choices made would depend on the specific requirements and constraints.
Based on the transcript, here's a breakdown of functional and non-functional requirements, along with capacity estimations. Remember that these estimations are based on assumptions and the information given; a real-world project would require much more detailed analysis.
I. Functional Requirements:
These describe what the system should do:
User Registration and Authentication: Users should be able to register accounts and securely log in using their credentials (phone number, email, etc.).
Chat Creation: Users should be able to create individual or group chats.
Message Sending and Receiving: Users should be able to send and receive text messages and media attachments (images, videos, audio) in real-time.
Offline Message Delivery: Messages should be stored and delivered to users even when they are offline.
Media Handling: The system should handle the storage and delivery of media attachments efficiently, using appropriate storage solutions (like S3).
Presence Indicators (Optional): The system should optionally allow users to indicate their online/offline status and possibly provide notifications about the status of contacts.
Group Management: In group chats, users should be able to add or remove participants.
II. Non-Functional Requirements:
These describe how the system should perform:
Low Latency: Messages should be delivered with low latency (the transcript suggests under 500 milliseconds).
High Throughput: The system should handle a very high volume of messages per second (billions of users sending numerous messages daily).
Guaranteed Delivery: Messages should be reliably delivered to recipients, even in case of temporary network issues.
Scalability: The system should be able to scale horizontally to handle a growing number of users and messages.
Fault Tolerance: The system should remain operational even if some components fail.
Data Consistency: Data should be consistent across the system.
Data Retention: Messages should be stored for a defined period (the transcript mentions 30 days) and then purged.
Security: User data and communications should be protected from unauthorized access.
III. Capacity Estimations (Assumptions and Calculations):
These are rough estimates based on assumptions and the information in the transcript:
Number of Users: Billions (as per the transcript's example).
Messages per User per Day: 100 (Assumption)
Total Messages per Day: 100 Billion (100 messages/user/day * 1 Billion users)
Message Size (Average): 1 KB (Assumption, accounting for text and small media)
Total Data per Day: 100 Terabytes (100 Billion messages * 1 KB/message)
Data Retention (Days): 30 (Transcript)
Total Storage Requirement (Estimated): A few hundred terabytes (accounting for 30 days retention and some safety margin). This is a large but manageable amount for modern cloud storage.
Chat Servers: The transcript mentions WhatsApp using approximately 2 million connections per server. To handle billions of users, many thousands of chat servers would be needed. Precise numbers depend on the connection density per user and desired redundancy.
Database Capacity: The capacity of the DynamoDB and S3 will depend heavily on message volume, media file sizes, and query patterns. Thorough performance testing would be necessary. The capacity would need to scale to accommodate the total storage and throughput requirements calculated above.
Redis Pub/Sub Capacity: The capacity of the Redis cluster will depend on the message rate for inter-server communication. This will require careful consideration of network bandwidth and Redis cluster configuration.
Important Disclaimer: These are back-of-the-envelope calculations. Accurate capacity planning requires detailed performance testing, load modeling, and considering factors not mentioned in the transcript. The numbers presented should be treated as illustrative examples, not precise predictions.