Enterprise Ecosystem: Change Data Capture (Organized Chaos)
The technological landscape of the modern enterprise is complex. The need to continuously build, scale, and maintain the entire ecosystem is an omnipresent force that we constantly trade off against the need to deliver on time—or better yet—ahead of schedule. Strategies that organizations use to appropriately balance this trade-off are numerous, multifaceted and vary widely. Think about defining SDLC processes, identifying technologies to be used, system architecture, resource allocation planning, tool adoption & licensing etc.
This post delves into the concept of Change Data Capture. Specifically, how it can be used at scale to build a unified event backbone that establishes data-level consistency and event-driven architectures across distributed systems.
Specific areas that we touch on:
– Change Data Capture (CDC) Use Cases
– Associated Advantages & Disadvantages
– Technology Options for Getting Started
CDC Use Cases:
Change Data Capture basically means the real-time (or near-real-time) capture of changes to critical data points that drive action within the business at hand. Traditionally, disparate systems across the enterprise are the source of this data, acting as both the system of record for the data that it owns and a custodian of derived data that is sourced from elsewhere.
The modern era is right to usher in the micro-service oriented architectures that call for complete separation of concern across systems, which inherently involves minimizing the duplication of data across sources. The main consequence of this being that micro-services should communicate with each other as needed to perform whatever actions are required (i.e. don’t copy the data).
It is common to find yourself (or your company) supporting multiple systems that share and duplicate data, with blurry lines that fail to cleanly separate multiple systems’ functions and data ownership. Often times batch processing techniques are used to ETL the data between systems, yielding point to point integrations that form a web network effect.
Regardless of the level of maturity of your ecosystem, change data capturing (in some form) will generally provide advantages that improve some or all of the following:
- Data consistency
- Event-driven (real-time) delivery
- Separation of system concern
Technology options for implementing CDC vary. The best option for your change capture use case depends largely on the database technology being used, and on the high-level use case (see below for our side by side tech review of these options). 2 common, and potentially very different, change capture use cases are:
- Database Replication
- High availability
- Disaster-recovery
- Read-Only-Replica(s)
- Event Sourcing
- Event-driven system architectures
- Pub/sub style architectures
In either case, the natural goal is to evolve and standardize a more consistent approach to system integration and data access, which can be broken down into 3 potential scenarios that are visualized below:
Point to Point
Event Driven, Consistent, Database Exposed
Complete Separation of Concerns (Micro-services Oriented)
Strong use cases for DB level change data capture include the following:
- The need to distribute data that is manipulated/maintained by a variety of actors
- Application level change capturing is not feasible (3rd party vendor, etc…)
- DB Transaction Logs available (more on that later…)
CDC Advantages and Disadvantages
Before diving into an in-depth review of Change Data Capture options (of which there are tons), let’s summarize some of the pros/cons commonly encountered when weighing CDC based approaches with others:
Advantages:
- Avoid application level change capture (can be complex: distributed transactions, commit/rollback …)
- Capture off-line changes (direct DB changes that bypass the application are safe
- Real-time or near-real time
Disadvantages:
- Schema change dependencies
- Licensing costs
- Required data transformation(s) (can be solved with the right event backbone…)
CDC Options:
Capturing changes can be broken down into either a push based, or polling (read) based approach. Some examples of common CDC techniques can be summarized as follows:
- Timestamp or Rising Sequential Value Comparison (Polling) – Periodic polling of a database by an independent actor that filters for records based on a timestamp or numeric value
- Trigger Based Change Capture – Periodic polling of a database by an independent actor that identifies newly changed records based on tables containing trigger entries
- Log Based Change Capture – Perhaps the most reliable, and efficient option (when offered by the database in question). This form of CDC involves the propagation change events based on writes to the underlying database change log that backs the entire database state.
Technologies and Options
All of the following databases will provide basic TIMESTAMP & TRIGGER based change capture solutions. The following table outlines some of the options you have depending on your DB.
Relevant Information by Database:
** Please note, there are many other places to understand more about capabilities per DB. This is just a launch pad.
** Please note, cost & licensing requirements vary across databases.
MySQL:
- Trigger Based & Binary Logging: https://datacater.io/blog/2021-08-25/mysql-cdc-complete-guide.html
- Hevo Data (potential licensing required): https://hevodata.com/learn/mysql-cdc/#:~:text=Hevo%20Data%20is%20a%20fully,target%20databases%20and%20data%20warehouses.
- Datacoral (potential licensing required): https://docs.datacoral.com/ingest_connectors/db/mysql_cdc/
- Netflix Blog for MySQL CDC: https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b
- Materialize (potential licensing required): https://materialize.com/docs/guides/cdc-mysql/
PostgreSQL:
- Trigger Based Logging: https://eager.io/blog/audit-postgres/#:~:text=Fortunately%20Postgres%20triggers%20allow%20you,edit%20you%20care%20to%20know
- Logical Replication (& Trigger Based Logging): https://datacater.io/blog/2021-09-02/postgresql-cdc-complete-guide.html
- Hevo Data (potential licensing required): https://hevodata.com/blog/postgresql-cdc/
- Materialize (potential licensing required): https://materialize.com/docs/guides/cdc-postgres/#:~:text=Change%20Data%20Capture%20(CDC)%20allows,on%20top%20of%20CDC%20data.
Oracle:
- CDC Overview: https://hevodata.com/learn/oracle-cdc/
- Oracle Docs: https://docs.oracle.com/cd/B28359_01/server.111/b28313/cdc.htm
MS SQL Server:
DB Agnostic CDC Approaches:
Standing up CDC processes that are DB agnostic should be the goal of any system. Below is a glimpse into some technologies that fit the DB agnostic bill.
- Debezium (Kafka Based): https://debezium.io/
- Confluent JDBC Connector (Kafka Based): https://docs.confluent.io/kafka-connect-jdbc/current/index.html
- Mulesoft: https://blogs.mulesoft.com/dev-guides/how-to-tutorials/howto-extract-transform-load-etlchange-data-capture/
- AWS RDS Read Replicas: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html
How PDG Can Help
PDG Consulting is a technology and business-focused consulting firm with direct experience with many of the technologies and concepts touched on above. Regardless of the current state of your technology ecosystem, PDG can help identify pain points and define the roadmap to achieve a cleaner, organized suite of services and applications. And finally, we can help you execute that plan.
About the Author
I am Daniel Perez, and am currently an Engineering Director at PDG with over 8 years of experience delivering enterprise architecture, quality solutions, integrations, and workflows to Fortune 500 technology companies in the media & entertainment, manufacturing, telecom, consumer product, and other industries. My history includes an ongoing passion for full stack development, automated testing, and overall management of complex software development projects and processes.
Latest
Liberty Hill and PDG: Visualizing Justice through Data
March 1, 2023
See how PDG's custom data visualization platform is helping Liberty Hill pinpoint the data needed to tell this story and fuel campaigns that aim to end the practice of arresting and incarcerating youth and putting in its place investments in youth development in our newest Customer Success Story.
Proof of Concept: Facilitating the Future of M&E Enterprises in the Cloud
Technology,OTT,Media & Entertainment
February 27, 2023
For media and entertainment (M&E) enterprises, moving to the cloud offers many benefits in future-proofing their frameworks. Learn more from our software engineers about how to properly facilitate best practices for cloud computing in today's article.
Is Blockchain the next GPT?
January 30, 2023
Curious to know if Blockchain technologies is displaying all the signs of becoming the next GPT? Our Founding Partner at PDG, Brennan Binford, discuses the concept of Blockchain and what you should expect in the future of General Purpose Technology.
by Brennan Binford - PDG Consulting