Deep Dive Into a Debezium Community Connector: The Scylla CDC Source Connector

debezium.io
4 min read
fairly easy
Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.
Even at a stage of proof-of-concept, great performance was a paramount requirement. Scylla databases can scale to hundreds of nodes and PBs of data, so it became clear that a single Kafka Connect worker node (even multithreaded) could not handle the load of a big Scylla cluster.

Thankfully, we took that into consideration while implementing CDC functionality in Scylla. Generally, you can think of Change Data Capture as a time-ordered queue of changes. To allow for horizontal scaling, Scylla maintains a set of multiple time-ordered queues of changes, called streams. When there is only a single consumer of the CDC log, it has to query all streams to properly read all changes. A benefit of this design is that you can introduce additional consumers, assigning a disjunct set of streams to each one of them. As a result, you can greatly increase the parallelism of processing the CDC log.

That's the approach we implemented in the Scylla CDC Source Connector. When starting, the connector first reads the identifiers of all available streams. Next, it distributes them among many Kafka Connect tasks (configurable by tasks.max ).

Each created Kafka Connect task (that can run on a separate Kafka Connect node) reads CDC changes from its assigned set of streams. If you double the number of tasks, each task will have to read only a half of the number of streams - half of data throughput, making it possible to handle a higher load.

Solving large stream count problem

While designing CDC functionality in Scylla, we had to carefully…
Read full article