System Design

Add New Chapter


Concepts to Know to Design a System


  • Vertical Scaling:
    • Add more memory, CPU and Hard-drive to an existing host.
    • It can be expensive and also has a limitation on how much memory and CPU we can add to a single host.
    • But it doesn’t have distributed systems problem.
  • Horizontal Scaling:
    • Keep one host small but add another host.
    • Can infinitely keep adding more hosts but we need to deal with all the distributed system challenges.
  • Horizontal scaling is more preferred than Vertical Scaling.

CAP (Consistency, Availability, Partition Tolerance) Theorem

  • Also k/a Brewer’s theorem it states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
    • Consistency: Every read receives the most recent write or an error.
    • Availability: Every request receives a (non-error) response – without the guarantee that it contains the most recent write.
    • Partition tolerance or Failure acceptance of Distributed System: System continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.
  • In particular, the CAP theorem implies that in the presence of a network partition or failure, one has to choose between consistency and availability.
  • Note: Consistency as defined in the CAP theorem is quite different from the consistency guaranteed in ACID database transactions.
  • No distributed system is safe from network failures, thus network partitioning generally has to be tolerated.
  • In the presence of a network partition or failure, one is then left with two options: consistency or availability.
    • Consistency over Availability: System will return an error or a time-out if particular information cannot be guaranteed to be up to date due to network partitioning.
    • Availability over Consistency: System will always process the query and try to return the most recent available version of the information, even if it cannot guarantee it is up to date due to network partitioning.
  • In the absence of network failure – that is, when the distributed system is running normally – both availability and consistency can be satisfied.
  • CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made.


  • ACID: Atomic, Consistency, Isolation, Durability
    • Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability.
  • BASE: Basically Available Soft State Eventual Consistency
    • Database systems designed around the BASE philosophy, common in the NoSQL movement for example, choose availability over consistency.
    • When we start using NoSQL databases we need to understand which part of ACID properties we are willing to sacrifice.

Partitioning / Sharding Data

  • Let’s suppose we need to store Trillions of records and there is no way we can store that much records in one node of database.
  • We need to store it in many different nodes of a database that’s where sharding comes into the picture.
  • How we can shard or split the data so that every node of a database is responsible for some parts of those records.
  • Consistent Hashing: The technique heavily used is sharding or partitioning of data and we definitely need to know how it works, it’s pros and cons and other things.

Optimistic vs. Pessimistic Locking

  • Optimistic Locking:
    • While doing a database transaction we do not acquire any locks.
    • But when we are ready to commit our transaction at that point we check to see if no other transaction updated the record which we are working on.
  • Pessimistic Locking:
    • While doing a database transaction we acquire all the locks beforehand and then we commit the transaction.
  • Both of them have their pros and cons and we need to know which of these locking to which scenarios.

Strong vs. Eventual Consistency

  • Strong Consistency:
    • Reads will always see the latest writes.
    • Used in Relational Databases.
  • Eventual Consistency:
    • Reads will see some writes initially and eventually it will see all the writes.
    • In NoSQL Databases we need to decide whether we want strong or eventual consistency.
  • The benefit the eventual consistency has is it provides higher availability.

Relational vs. No-SQL Databases

  • Todays time most of the people prefer to use NoSQL databases but do not discard Relational Databases just yet.
  • Relational Database provides all the nice ACID properties.
  • NoSQL databases scales a little bit better and has higher availability.
  • Depending on the problem and situation we need to decide which will fit better.

Types of No-SQL

  • Key-Value:
    • These are the simplest kind where we have a key and we have a value and it stores the key value pair into the database.
  • Wide-Column:
    • All row can have many different formats (many different kind of columns) and it can also have many different columns.
  • Document Based:
    • If we have a semi-structured data or an XML or JSON data and we want to persist them into database we use document based database.
  • Graph Based:
    • If we have entities and we have edges or relationship b/w those entities so basically if we have a graph to store we use graph based NoSQL databases.


  • Used to speed-up our request.
  • Use it when we know that some data is going to be used more frequently then we store it in cache so that it can be accessed quickly.
  • 2 types of caching:
    1. If every node does its own caching i.e the cache data is not shared b/w nodes.
    2. Spirit Cache: The cache data is shared b/w different nodes.
  • If we are into caching we need to consider few things:
    • Cache can not be the source of truth.
      - Cache data has to be pretty small as it tends to keep all data in-memory.
      - We need to consider some of the eviction policies around cache.

Data-centres / Racks / Hosts

  • Need to be aware how data-centres are architecture and how they are arranged today.
  • Data-centres have Racks and Racks have hosts.
  • Need to have an understanding of the latencies b/w talking cross hosts or cross racks or even cross data-centres.
  • What the worst can happen if a host goes down or even the complete racks goes down or even worst if the entire data-centres goes down.

CPU / Memory / Hard-Drive / Network Bandwidth

  • All of these are limited resources so when we design our system we need to consider:
    • How do we work around these limitations?
    • How do we improve the throughput latencies?
    • How do we scale the system around these limitations?

Random vs. Sequential Read/Write on Disk

  • We know that read and write are slow on the disk.
  • But the sequential read and writes are amazing for the disks.
  • Need to design the system around sequential reads and writes.
  • Need to try avoiding random reads and writes which are order of magnitude slower than sequential reads and writes for the disk.

HTTP vs. HTTP2 vs. Web-Sockets

  • HTTP:
    • Request-Reply kind of architecture b/w client and server.
    • Pretty much entire web runs on HTTP.
  • HTTP2:
    • It does some of the deficiencies of HTTP like it can do multiple requests over a single connection
  • Web-Sockets:
    • It is fully Bi-directional communication b/w client and server.
  • Need to know the differences b/w them and their inner workings.

TCP/IP Model

  • There are 4 layers in this model and we need to know about what are they and how they work.

IPv4 vs. IPv6

  • IPv4 has 32 bit-addresses and IPv6 has 128-bit addresses.
  • We are running out of IPv4 addresses and so the world is migrating towards IPv6.
  • Need to have understanding of their details and also about how the routing works.


  • TCP:
    • It is connection oriented reliable connection but bit slow transfer of packets as acknowledgement is needed.
    • Used in system when transferring of packets, we can’t afford to loose packets like transfer of sensitive documents etc.
  • UDP:
    • Unreliable connection but faster transfer of packets as no acknowledgement needed.
    • Used in situation when we can afford to loose some packets but we want them faster like video streaming systems etc.

DNS Lookup

  • It does the translation of domain address into IP address.
  • Need to know its working, hierarchy and caching around them.


  • TLS (Transport Layer Security):
    • Used to secure communication b/w client and server both in terms of Privacy and Data-integrity.
    • When used with HTTP it pretty much becomes HTTPs.

Public Key Infrastructure & Certificate Authority

  • Public Key Infrastructure is used to manage our public key and the digital certificates.
  • Certificate Authority is a trusted entity which tells us if the public key is from the correct party.
  • Example:- If we type and if it is going over HTTPs, the we will get a public key back and certificate authority tells that it is definitely coming from KodeFork and not from a third party who has hacked b/w us and KodeFork.

Symmetric vs. Asymmetric Encryption

  • Asymmetric Encryption:
    • It is computationally more expensive so it should be used to send small amount of data preferably a symmetric key.
    • Example: Public-Private Key Encryption
  • Symmetric Encryption:
    • Example: AES

Load Balancers L4 vs. L7

  • Load balancers sit in the front of a service and delegate the client requests to one of the nodes behind the service.
  • This delegation can be based on Round-Robin basis or the load average on the nodes behind that service.
  • Load balancers can operate at L4 or L7 in OSI Model.
  • At L4: Load Balancers considers both client and destination IP addresses and Port numbers to do the routing.
  • At L7: Which is an HTTP Level it uses HTTP URI to do the routing.
  • Most of the load balancers operate at L7.


  • CDN (Content Delivery Network):
    • Let’s suppose we are watching Netflix from California so what Netflix does is it puts movies and series in a content delivery network close to us.
    • So when we are streaming the movies it can be streamed right there from the CDN close to us instead of all the way from the data-centre.
    • This helps in both the performance and latency for the end-user.
  • Edge:
    • A ver similar concept where we do processing close to the end user.
    • Another advantage Edge provides is it has a dedicated network from the edge to all the way to the data-centre.
    • So our request can be routed through this dedicated network instead of going over the general internet.

Bloom Filters & Count-Min Sketch

  • Bloom Filters:
    • Space-efficient probabilistic data structure.
    • It is used to decide if an element is a member of a set or not.
    • It can have a false positive but will never have false negatives.
    • यदि present नहीं है तब बोल सकता है की present है, but present है तब कभी नहीं बोलेगा की present नहीं है |
    • Example:- Google username availability prediction by gmail.
    • So if our design can tolerate false positive we can use bloom filters.
  • Count-Min Sketch:
    • A similar data structure, but it is used to count the frequency of events.
    • Suppose we have millions of events and we want to keep track of top k events.
    • Then we can consider using count-min sketch instead of keeping the count of all the events.
    • So, for a fraction of space it will give an answer which will be close enough to the actual answer with some error rate.

Paxos: Consensus over Distributed Hosts

  • Paxos is used to derive consensus over distributed hosts.
  • Before Paxos came finding consensus was a very hard problem.
  • Example:- Leader election among a distributed host
  • May not need to know the internal workings but good to know some of the use-cases which Paxos solves.

Design Patterns & Object-Oriented Design

  • Design-Patterns:
    • Need to know things like factory methods, Singleton patterns and others.
  • Object-Oriented Design:
    • Classes, Objects, Methods
    • Inheritance
    • Polymorphism
    • Abstraction

Virtual Machines & Containers

  • Virtual Machines:
    • A way of giving us an operating system on top of shared resource such that we feel like we are exclusive owner of the hardware.
    • But in reality that hardware is shared b/w different isolated operating systems.
  • Containers:
    • A way of running our application and its dependencies in an isolated environment.
    • Containers have become extremely important and they run a lot in production environment these days.

Publisher-Subscriber OR Queue Bases

  • Some publisher publishes a message to the queue and a subscriber receives that message from the queue.
  • This pattern has become extremely important in system design these days.
  • We should definitely use them whenever we have the opportunity.
  • One thing to remember is that a customer facing request should not be directly exposed to a Publisher-Subscriber system.


  • It is used to do distributed and parallel processing of big data.
  • Map is filtering and sorting the data and Reduce is summarizing the data.
  • It’s very important in big data application systems.

Multi-threading, Concurrency, Locks, Synchronization, CAS (Compare and Swap)

  • Very important concepts while dealing with multi-threaded applications or systems.
  • Some programming language like Java comes with these things built-in but in some languages like C we have to depend on the platform specific implementations.