In IT environments, those in management-level roles are responsible for strategizing, overseeing, and ensuring that the operation of information technology resources within an organization runs smoothly.
As such, managers need to gain and maintain a functional understanding of concepts and techniques crucial to modern IT and DataOps environments. Thanks to the explosion we’re now witnessing in businesses relying on real-time data access, compliance and regulatory readiness, and advanced analytics, one such technique is growing exponentially more important for ensuring smooth IT: change data capture (CDC).
To help here, we’ll baseline by defining what CDC is exactly, then break down common scenarios it’s applied in and common types of tools used in those applications. To bring it all home, we’ll provide a sense of where change data capture seems to be headed.
What is change data capture (CDC)?
Change data capture (CDC) is a technique used to identify and capture changes made to data, typically inserts, updates, and deletes. In doing so, CDC ensures that changes made to the source data schema are effectively identified and processed by downstream systems (e.g., data warehouses, analytics platforms, or other data stores).
As a technique, there are several methods one can use to operationalize CDC:
- Log-based CDC: This method directly reads a database’s transaction logs to identify changes. Log-based CDC is a comparatively non-intrusive method for capturing all changes, which makes it widely used across industries.
- Database-native CDC: More modern databases may feature built-in capabilities for change data capture. Being functionality native to the database, this method of CDC minimizes the need for custom development or utilization of vendor support.
- Trigger-based CDC: As the name suggests, this method relies on database triggers to capture changes. In situations where real-time data synchronization is crucial, trigger-based CDC can capture changes in real time at the expense of adding overhead to the source system.
- Polling-based CDC: When real-time data isn’t a priority to an organization, this method of identifying changes by periodically querying the source system may be preferable. While it can also add load to the source system, polling-based CDC is one of the simpler methods to implement.
Change data capture in 7 basic steps
Regardless of which method and CDC solution an organization chooses to implement, the change data capture process itself consists of seven basic sequential steps:
1. Initialization: Before any change monitoring begins, a snapshot or baseline of the current state of data will be established. Without this, the CDC process would have no reference point to compare against to detect changes.
With a baseline created, the CDC process is set up by specifying the source system, datasets, or tables to be monitored. Next, the type of changes to capture can be selected in addition to the destination or downstream systems in play.
2. Monitoring: Once initialized, the CDC solution sets about monitoring the source data system. Again, the monitoring method will depend on the type of change data capture implemented (i.e., trigger-based CDC, polling-based, etc.).
At this point, the solution will detect changes through the desired means as they occur in the source system.
3. Change recording: The CDC solution will then record changes as they’re detected in its change log. Information about each change will typically include the change type, any altered data, corresponding timestamps, and other relevant metadata.
In some systems, the order of changes matters (e.g., financial, customer relationship management, and transactional systems). In these cases, the CDC process can ensure that all identified changes are recorded in the correct sequence. This sequencing within the change log can be crucial for maintaining data integrity in downstream systems.
4. Data propagation: Next, the captured changes need to be transmitted and applied from the source system to the target system(s). This step is crucial in environments where real-time or near-real-time data availability is mission-critical. Therefore, handling this data propagation through an extract, transform, and load (ETL) pipeline process is common.
An ETL pipeline first extracts changes from the change log for transmission. If and as necessary, the pipeline transforms these changes into a format suitable for the destination system(s). Once data type conversions, data enrichment, or other transformations are complete, the ETL pipeline loads the changes into targeted downstream systems.
5. Processing in downstream systems: Systems targeted downstream can now integrate these changes into their own datasets. Often, this data integration involves updating existing records, inserting new records, or deleting records.
While not mandatory, it’s also a good idea to validate data once integrated, as this ensures the changes are applied correctly and there are no inconsistencies.
6. Feedback and error handling: While the CDC process is technically complete at this point, many solutions also provide feedback mechanisms to available monitoring tools or the source system itself. Valuable feedback provided can include success confirmations, error messages, or statistics about the changes that have been captured.
If any issues do occur during the change data capture process, the CDC solution will also offer mechanisms for handling these errors, ensuring data does not grow inaccurate. Typically manifesting as data transformation errors or connectivity issues, these mechanisms may include options for sending alerts, retrying operations, or logging errors for review.
7. Cleanup and maintenance: Regular maintenance is also critical, as change logs can grow in size over time. Log maintenance ensures older entries are archived or purged, which keeps the change log manageable.
The performance of the CDC process should also be monitored over time, confirming it consistently meets latency requirements. Regardless of which change data capture method is ultimately employed, performance monitoring keeps the process from affecting the source system any more than it has to.
Some illustrative CDC scenarios
The concept of CDC is pretty simple. But with modern data systems becoming ubiquitous, it’s humbling when you start to get a sense of how many different business functions it benefits.
Real-time data warehousing
Organizations that rely on time-sensitive business intelligence (BI) analytics need to keep their data warehouses updated in real or near-real time.
Change data capture helps make this possible by first monitoring the operational databases for changes. Then, as changes occur, they’re captured and recorded in a change log. These changes can then be propagated to the data warehouse, either in real time or in small batches.
Done properly, decision-makers at the organizations can then access up-to-date dashboards and reports, fueling informed decisions that can be made in the moment.
Master data management (MDM)
Many enterprise organizations have to utilize multiple systems. And it’s common for each system to have its own version of business-related data, like customer data. In these situations, change data capture can help when stakeholders need access to a unified view of their customers.
Here, CDC would monitor each system for any changes to the customer data. Changes made would be captured and could be integrated into a central MDM system. In doing so, change data capture would eliminate discrepancies across systems and improve overall data quality, in addition to enabling a consistent view of customer data.
ETL for data integration
Retail companies commonly need to integrate sales data from their physical and online stores into a central analytics platform. A change data capture solution can monitor both the online store databases and the point of sale (POS) systems in each physical store.
The CDC solution can then capture sales transactions as they occur. Captured changes can then be transported to a central analytics platform through an extract, transform, load (ETL) process, ensuring factors like currency conversions and data enrichment occur before data arrives.
This potent combination of a CDC solution and ETL makes it possible for leadership to get a unified view of sales across all channels.
Database replication and migration
It’s also common for an organization to migrate from old to newer database systems as they scale. Alternatively, business operations may necessitate creating and maintaining a replica of a primary database, usually for reporting or backup purposes.
In either case, the ability of a CDC solution to monitor the source database for changes becomes imperative in this scenario. This is because captured changes can then be applied to the target database (or databases) ensuring they remain in sync with the source.
As a result of this sync, database migrations can occur smoothly, without downtime, while teams and stakeholders can be confident any replicas stay up-to-date.
Auditing and compliance
Financial institutions overwhelmingly need to track changes to all transactions for auditing and regulatory compliance. Change data capture can monitor these millions of transactions as they occur every day.
In the case of highly regulated industries like finance, these captured transactions will then be stored in an immutable audit log. In doing so, an organization is ever-ready to provide an audit trail as needed, ensuring regulatory compliance.
Much the same as the scenarios above, CDC plays a crucial role in making sure data is timely, consistent, and accurate across a variety of systems. What’s more, the ETL example in particular showcases how change data capture can be woven into a traditional data integration process, making them more responsive and functional for real-time tasks.
5 categories of CDC-friendly tools that can complement your change data capture needs
Several tools and platforms are available for implementing CDC, ranging from database-native features to specialized third-party solutions. But our goal here is to give you less things to focus on, not more. That said, here are five broad categories that can enhance any CDC process, with some specific examples for each:
1. ETL tools with CDC capabilities: Tools like Talend, Informatica PowerCenter, and our very own Shipyard can assist with important ETL functionality, as discussed above.
Talend, an open-source data integration platform, offers CDC components for certain databases. PowerCenter is a widely used ETL tool that provides CDC capabilities for efficient, incremental data loads. Shipyard’s always-on-monitoring provides immediate alerting, detailed logging, and seamless investigations.
2. Database-native CDC features: It’s much easier these days to also find databases that come with built-in CDC capabilities.
Oracle GoldenGate offers a comprehensive software package for replication and real-time data integration, in addition to CDC capabilities for Oracle databases and other supported platforms.
Microsoft SQL Server provides built-in CDC features that capture insert, update, and delete activity in tables. It also makes the corresponding details available in easily consumed relational tables. MySWL’s binary log contains events that describe database changes which, with the proper tools, can be used to process the log for CDC purposes.
3. Open source tools: While they should be used with caution in some scenarios, open source CDC platforms like Debezium and Maxwell can be leveraged to (in Debezium’s case) stream database changes into Apache Kafka or (with Maxwell) read MySQL BinLogs and write row updates as JSON to Kafka, Kinesis, or other streaming platforms.
Alternatively, Apache NiFi, an open source data integration tool, can be configured to capture changes from databases and stream them to a variety of destinations.
4. Commercial solutions: Attunity Replicate, Striim, and IBM InfoSphere Data Replication are three commercial solutions that can complement a change data capture process.
Replicate provides comprehensive data replications and ingestion solutions that support CDC for an exceptional range of source and target systems.
Striim and IBM both offer real-time data integration platforms that offer CDC capabilities. Additionally, Striim also offers data processing and analytics capabilities to its users, while InfoSphere Data Replication provides log-based CDC, supporting a variety of target and source platforms.
5. Cloud-native solutions: Cloud-native solutions offer a multitude of benefits that can make your CDC work better and harder while increasing scalability. And these tools may deserve an additional look if they happen to be part of a software ecosystem an organization is already investing in.
Google Cloud Datastream, a serverless CDC and replication service, can stream database changes to Google Cloud services in real time. Azure Data Factory, Microsoft’s cloud-based ETL service supports CDC for some source databases, allowing for incremental data loads.
Finally, while primarily designed for database migrations, AWS Database Migration Service (DMS) supports ongoing replication using CDC, allowing changes to be captured and replicated to target databases in real time.
Details are good, tools are key, but what does the future hold for CDC?
A wonderful question! The future of change data capture is being shaped by evolving data architectures, increasing data volumes, and the growing need for real-time analytics and decision-making. This is a good reminder that, as managers, we need to look at what the future of CDC may hold, even as we work to get the most out of its current capabilities.
First, we expect integrations with streaming platforms to grow tighter. This means event-driven architectures will need to be increasingly adopted to integrate with streaming platforms like Apache Kafka, Apache Pulsar, and others.
The commoditization of cloud computing should drive more CDC solutions to become cloud-native in turn. Increasingly seamless integrations with cloud services, scalability, and overall effectiveness will result. As more organizations adopt multi-cloud and hybrid cloud strategies, change data capture tools will need to increasingly support data synchronization and replication across on-premises and cloud environments.
We should expect to see expanding support for more data sources, increasingly far beyond traditional relational databases. This ever-widening variety of sources will naturally include NoSQL databases, data lakes, and more devices that make up the Internet of Things (IoT).
As more organizations focus on data governance (as they most definitely should) CDC tools should begin to integrate more closely with metadata management solutions. This will ensure changes in data continue to be accompanied by updates to metadata lineage and data quality metrics.
Finally, we should absolutely expect AI integrations and machine learning algorithms to help optimize data transfer, detect anomalies in data changes, and provide predictive analytics on data flows. Advanced CDC solutions may incorporate machine learning algorithms to optimize data transfer, detect anomalies in data changes, and provide better and more predictive analytics.
That’s a lot to keep one’s head around. This is exactly why you should take a moment to sign up for our weekly newsletter. Leave the staying-ahead to our team (we love it), and sit back and enjoy a steady stream of insider knowledge and insights regarding CDC, data orchestration, DevOps, and more.