Distributed System Design - Message Queue and PubSub

A message drive architecture decoupled teams and let them run on their own;

Check out PubSub vs Message Queue

Kafka

Distributed publish-subscribe Messaging System

Kafka is a fast, scalable, distributed in nature by its design, partitioned and replicated commit log service.

Why pull

no reconfiguration in kafka to add new consumers
consumer can go offline then resume from where it left off
consumer can pull and process data at its own speed

Highlights

It is designed as a distributed system which is very easy to scale out.
It offers high throughput for both publishing and subscribing.
It supports multi-subscribers and automatically balances the consumers during failure.
It persist messages on disk and thus can be used for batched consumption such as ETL, in addition to real time applications.

How it works

To uniquely identify a message: topic + partition + offset
Message: a payload of bytes
Topic: a category or feed name to which messages are published.
Producer: anyone who can publish messages to a Topic.
The published messages are then stored at a set of servers called Brokers or Kafka Cluster.
Consumer can subscribe to one or more Topics and consume the published Messages by pulling data from the Brokers.
Producer can choose their favorite serialization method to encode the message content.
Unlike traditional iterators, the message stream iterator never terminates. If currently no message is there to consume, the iterator blocks until new messages are published to the topic.
To balance load, a topic is divided into multiple partitions and each broker stores one or more of those partitions.
Multiple producers and consumers can publish and retrieve messages at the same time.
Each partition of a topic corresponds to a logical log.
Physically, a log is implemented as a set of segment files of equal sizes. Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file. Segment file is flushed to disk after configurable numbers of messages have been published or after a certain amount of time elapsed. Messages are exposed to consumer after it gets flushed.
Unlike traditional message system, a message stored in Kafka system doesn’t have explicit message ids.Messages are exposed by the logical offset in the log.This avoids the overhead of maintaining auxiliary, seek-intensive random-access index structures that map the message ids to the actual message locations. Messages ids are incremental but not consecutive. To compute the id of next message adds a length of the current message to its logical offset.
Consumer issues asynchronous pull request to the broker to have a buffer of bytes ready to consume. Each asynchronous pull request contains the offset of the message to consume.
Kafka brokers are stateless. This means that the consumer has to maintain how much it has consumed.
It is very tricky to delete message from the broker as broker doesn't know whether consumer consumed the message or not. Kafka innovatively solves this problem by using a simple time-based SLA for the retention policy. A message is automatically deleted if it has been retained in the broker longer than a certain period.
consumer can deliberately rewind back to an old offset and re-consume data.
Each Kafka broker is coordinating with other Kafka brokers using ZooKeeper. Producer and consumer are notified by ZooKeeper service about the presence of new broker in Kafka system or failure of the broker in Kafka system.
Kafka producer doesn’t wait for acknowledgements from the broker and sends messages as faster as the broker can handle.
Kafka has a more efficient storage format. On average, each message had an overhead of 9 bytes in Kafka, versus 144 bytes in ActiveMQ. This is because of overhead of heavy message header, required by JMS and overhead of maintaining various indexing structures. LinkedIn observed that one of the busiest threads in ActiveMQ spent most of its time accessing a B-Tree to maintain message metadata and state.
Kafka has a more efficient storage format; fewer bytes were transferred from the broker to the consumer in Kafka.
The broker in both ActiveMQ and RabbitMQ containers had to maintain the delivery state of every message. LinkedIn team observed that one of the ActiveMQ threads was busy writing KahaDB pages to disks during this test. In contrast, there were no disk write activities on the Kafka broker. Finally, by using the sendfile API, Kafka reduces the transmission overhead

Other Options

Amazon Simple Notification Service (SNS)
GCP PubSub
ActiveMQ
RabbitMQ
ZeroMQ

Home Grown

Why build instead of buy?

Scales better for their needs;
Cheaper than outsourced services;
Full control over the overall architecture.

Segment's Centrifuge: database-as-a-queue https://segment.com/blog/introducing-centrifuge/

BrowserStack: Building A Pub/Sub Service In-House Using Node.js And Redis.