Lädt...


🔧 How to Prevent Duplication in Data Aggregation with BladePipe


Nachrichtenbereich: 🔧 Programmierung
🔗 Quelle: dev.to

Data aggregation is to combine data from multiple sources into a single body, and present the data in a summarized format. In the data movement process, there is great possibility that tables from different sources have the same table name or primary key/unique key value, resulting in duplication in the target instance.

Don't worry. BladePipe is equipped with the capabilities to prevent potential table name conflicts and duplicate primary/unique key values when aggregating data from different sources.

The highlights include:

  • Adding common virtual columns
  • Setting a virtual column as the primary key of the target table
  • Concatenating metadata as target table names
  • No-code intuitive interface

About BladePipe

BladePipe is a real-time end-to-end data replication tool, simplifying your data movement between diverse data sources, including databases, message queues, real-time data warehouses, etc.

By using the technique of Change Data Capture (CDC), BladePipe can track, capture and deliver data changes automatically and accurately with ultra-low latency, greatly improving the efficiency of data integration. It provides sound solutions for use cases requiring real-time data replication, fueling data-driven decision-making and business agility.

Highlights

Adding Various Virtual Columns

For different use cases and data sources, you can generate the following virtual columns in BladePipe, as shown in the table below.

BladePipe also supports setting multiple virtual columns for a table, setting specific virtual columns for certain tables, and batch setting.

Virtual Column Type Description Valid Operations
Specified Value Add a new column to the target table with a specified number or string filled in INSERT
Data Sync Time Add a new column to the target table with the time that the data arrives at BladePipe filled in INSERT
Source InstID_Schema_Table_PKValues Add a new column to the target table and the values are generated by concatenating the source Instance ID, Schema, Table, and Primary Key INSERT, UPDATE, DELETE
Source InstID_DB_Schema_Table_PKValues Add a new column to the target table and the values are generated by concatenating the source Instance ID, Catalog, Schema, Table, and Primary Key INSERT, UPDATE, DELETE
Source DB_Schema_Table_PKValues Add a new column to the target table and the values are generated by concatenating the source Catalog, Schema, Table, and Primary Key INSERT, UPDATE, DELETE
Source Schema_Table_PKValues Add a new column to the target table and the values are generated by concatenating the source Schema, Table, and Primary Key INSERT, UPDATE, DELETE
Source Table_PKValues Add a new column to the target table and the values are generated by concatenating the source Table and Primary Key INSERT, UPDATE, DELETE

Setting a Virtual Column as the Target Primary Key (Unique Key)

When aggregating data from multiple sources into a single table, conflicts often arise if there are primary key or unique constraints. A typical example is that when combining MySQL data with auto increment primary keys from different regions into a target database, duplicate key values are moved in a table and primary key conflicts often occur.

To address this issue, BladePipe allows to set a virtual column as the target primary key. By setting SourceInstanceID_SCHEMA_Table_PrimaryKey or another virtual column as the target primary key, we can maintain the primary key or unique key constraint.

This is applicable to specific tables. Batch setting works.

Concatenating Metadata as Target Table Names

During data aggregation, you may need to keep data in separate tables instead of a single table in the target instance, which introduces the problem of table name conflicts.

BladePipe offers several table name concatenation rules for different data sources. During schema migration, the tables are renamed. The metadata mapping works in both data migration and synchronization.

Table Name Mapping Rule Description
Concatenate by SCHEMA_TABLE (metadata mirroring) Concatenate the source Schema and Table to form the target table name
Concatenate by SCHEMA_TABLE (converted to lowercase) Concatenate the source Schema and Table and convert the name to lowercase for the target table
Concatenate by SCHEMA_TABLE (converted to uppercase) Concatenate the source Schema and Table and convert the name to uppercase for the target table
Concatenate by DB_SCHEMA_TABLE (metadata mirroring) Concatenate the source Catalog, Schema, and Table to form the target table name
Concatenate by DB_SCHEMA_TABLE (converted to lowercase) Concatenate the source Catalog, Schema, and Table and convert the name to lowercase for the target table
Concatenate by DB_SCHEMA_TABLE (converted to uppercase) Concatenate the source Catalog, Schema, and Table and convert the name to uppercase for the target table

Supported Data Pipelines

Now the features mentioned above are applicable to the following data pipelines:

Source Target
MySQL StarRocks, Doris, SelectDB, ClickHouse, MySQL, PostgreSQL, Oracle, Kafka
Oracle StarRocks, Doris, MySQL, Oracle, SQL Server, Kafka
SQL Server MySQL, StarRocks, Kafka
PostgreSQL MySQL

Procedure

Here we build a data pipeline from MySQL to StarRocks, showing how to prevent duplication of table names and primary/unique key values in data aggregation.

Step 1: Install BladePipe

Follow the instructions in Install Worker (Docker) or Install Worker (Binary) to download and install a BladePipe Worker.

Step 2: Add DataSources

  1. Log in to the BladePipe Cloud.
  2. Click DataSource > Add DataSource.
  3. Select the source and target DataSource type, and fill out the setup form respectively.

Step 3: Create a DataJob

  1. Click DataJob > Create DataJob.
  2. Select the source and target DataSources, and click Test Connection to ensure the connection to the source and target DataSources are both successful.
  3. Select Incremental for DataJob Type, together with the Full Data option.
  4. Select the tables. Click Mapping Rules, select SCHEMA_TABLE (mirror), and the target table name will be concatenated based on the rule.
    gather_img_1

  5. In Data Processing page, click Batch Operation > Set Virtual Columns, and add the virtual columns as needed.
    gather_img_2

  6. Click Batch Operation > Set Target Primary Key. In this demonstration, we select vir_instid_sche_tab_pk as the target primary key.
    gather_img_3

  7. Confirm the DataJob creation.

    Now the DataJob starts. BladePipe will automatically run the following DataTasks:

    • Schema Migration: The schemas of the source tables will be migrated to the target instance.
    • Full Data: All existing data of the source tables will be fully migrated to the target instance.
    • Incremental: Ongoing data changes will be continuously synchronized to the target instance. gather_data_4

Conclusion

BladePipe offers strong capabilities to prevent duplication of table names and primary/unique key values in data aggregation, thus making it easier for you to aggregate data and make full use of it.

...

🔧 How to Prevent Duplication in Data Aggregation with BladePipe


📈 82.92 Punkte
🔧 Programmierung

🔧 A CDC Way to Sync Data from ElasticSearch to ElasticSearch with BladePipe


📈 33.96 Punkte
🔧 Programmierung

🔧 A Top HR Service Company Replaces its System Seamlessly with BladePipe


📈 31.24 Punkte
🔧 Programmierung

🔧 Build a CDC Pipeline from MySQL to Clickhouse with BladePipe


📈 31.24 Punkte
🔧 Programmierung

🕵️ Announcing the DATA DUPLICATION VILLAGE at DEF CON 23


📈 25.23 Punkte
🕵️ Hacking

🕵️ Data Duplication Village at DEF CON 24!


📈 25.23 Punkte
🕵️ Hacking

🕵️ DEF CON 25 Data Duplication Village!


📈 25.23 Punkte
🕵️ Hacking

🎥 DEF CON 26 DATA DUPLICATION VILLAGE - Lior Kolnik - The Memory Remains Cold Drive Memory Forensics


📈 25.23 Punkte
🎥 IT Security Video

🎥 DEF CON 26 DATA DUPLICATION VILLAGE - Mauro Caseres - Owning Gluster FS with GEVAUDAN


📈 25.23 Punkte
🎥 IT Security Video

🎥 DEF CON 26 DATA DUPLICATION VILLAGE - Jessica Smith - Beginners Guide to Musical Scales of Cyberwar


📈 25.23 Punkte
🎥 IT Security Video

🎥 DEF CON 26 DATA DUPLICATION VILLAGE - Andy Klein - Facts figures, fun from managing 100000 HDDs


📈 25.23 Punkte
🎥 IT Security Video

🔧 Similar Event De-duplication per Period


📈 22.5 Punkte
🔧 Programmierung

💾 Microsoft Windows LUAFV Delayed Virtualization Cross Process Handle Duplication Privilege Escalation


📈 22.5 Punkte
💾 IT Security Tools

🍏 Figma's AI duplication of Apple designs due to a lack of vetting


📈 22.5 Punkte
🍏 iOS / Mac OS

🐧 Music duplication and organization software, anyone?


📈 22.5 Punkte
🐧 Linux Tipps

🪟 Windows 11 File Explorer has a new tab duplication feature in the latest beta


📈 22.5 Punkte
🪟 Windows Tipps

🐧 Runtime duplication in Flatpak


📈 22.5 Punkte
🐧 Linux Tipps

🪟 Seamless Drive Duplication: Cloning with Image for Windows


📈 22.5 Punkte
🪟 Windows Tipps

🔧 Autoblogging AI bots and content duplication


📈 22.5 Punkte
🔧 Programmierung

📰 Telstra recommends amending existing telco Acts instead of creating duplication


📈 22.5 Punkte
📰 IT Nachrichten

📰 UDdup - Urls De-Duplication Tool For Better Recon


📈 22.5 Punkte
📰 IT Security Nachrichten

🐧 Duplication of Effort and Unix philosophy (NOT MY POST)


📈 22.5 Punkte
🐧 Linux Tipps

🎥 DEF CON Safe Mode Lock Picking Village - Tony Virelli - Key Duplication - Not Just for the Movies


📈 22.5 Punkte
🎥 IT Security Video

🕵️ cPanel up to 11.54.0.3 Duplication Remote Code Execution


📈 22.5 Punkte
🕵️ Sicherheitslücken

🔧 Reducing Code Duplication in GitHub Actions Workflows (1/2)


📈 22.5 Punkte
🔧 Programmierung

🕵️ Polylang <= 2.5 - CSRF in categories and media duplication


📈 22.5 Punkte
🕵️ Sicherheitslücken

🪟 Microsoft Windows LUAFV Delayed Virtualization Cross Process Handle Duplication Privilege Escalation


📈 22.5 Punkte
🪟 Windows Server

📰 Google AI Introduces Parfait: A Privacy-First AI System for Secure Data Aggregation and Analytics


📈 19.69 Punkte
🔧 AI Nachrichten

📰 Google AI Introduces Parfait: A Privacy-First AI System for Secure Data Aggregation and Analytics


📈 19.69 Punkte
🔧 AI Nachrichten

📰 Consumer privacy risks of data aggregation: What should organizations do?


📈 19.69 Punkte
📰 IT Security Nachrichten

🔧 #89 — Calculate Proportion Using the Aggregation Values of the Data from the Same Group


📈 19.69 Punkte
🔧 Programmierung

matomo