Lädt...

🔧 25 minutes 👉 23 seconds, Data into an Oracle table — Apache SeaTunnel performance optimization


Nachrichtenbereich: 🔧 Programmierung
🔗 Quelle: dev.to

1. Introduction

In JDBC source batch processing tasks, iftable_pathandpartition_columnare configured, the engine will dynamically partition the data. We can optimize partition intervals by analyzing sample data to avoid data skew issues. However, it has been observed that even whenwhere_conditionis configured, the dynamic partitioning algorithm still partitions the entire table. This leads to excessive time consumption in the partitioning phase, especially when retrieving a small amount of data from a large table. To address this, we need to modify the relevant processes to optimize performance.

All SQL statements in the following explanations use MySQL as an example. Different data sources have specific subclass methods implemented via overwriting.

Key Problem Analysis

  • 55GB MySQL Table Case: Original implementation took*25 minutesto read 1000 rows
    Optimized version reduced to
    23 seconds*(PR #8760)

1.1 Main Process of Data Partitioning

The entry point for dynamic partitioning in the code is thesplitTableIntoChunksmethod in theDynamicChunkSplitterclass. The red-highlighted boxes in the flowchart indicate the parts that need modifications, which will be detailed in the following sub-processes.

1.1.1 Querying Minimum and Maximum Values

It is necessary to include thewhere_conditionconfiguration in the source and append it to the query.

1.1.2 General Column Partitioning

1. Querying the Total Number of Records

  • Add a condition to ensure that the “yes” branch is executed only whenwhere_conditionis empty.
  • Modify the “no” branch to includewhere_conditionvalidation and concatenate it into the corresponding query statement. The query rules are as follows:
  • Ifqueryis configured, use:

SELECT COUNT(*) FROM () T

  • Otherwise, use:

SELECT COUNT(*) FROM

  • Ifwhere_conditionis configured, append it to the end of the query.

2. Partitioning Data Ranges

Refer to the following sub-process for details.

1.1.2.1 Pagination-Based Partitioning

Querying the next partition boundary (**nextChunkEnd**)

  1. Max Query Section
  • Ifwhere_conditionis configured, append it to theLIMITquery layer.
  1. Min Query Section
  • Ifwhere_conditionis configured, append it to the query.

1.1.2.2 Sample-Based Partitioning

  • Ifwhere_conditionis configured, append it to the query.

1.1.3 Date Column Partitioning

This section reuses the logic from*1.1.2.1*, requiring only a single modification.

1.2 Usage of Partitions

This section does not require modifications. The analysis here aims to understand how partitions are utilized, ensuring the correctness, necessity, and risks of the previous modifications.

Once the data is partitioned, it is distributed to the worker’sSourceSeaTunnelTask. Finally, it is used in theopenmethod of theJdbcInputFormatclass. The primary process is as follows:

From the process above, it is evident that itwhere_conditionis appended to the generated SQL in the final step. If itwhere_conditionis not considered during partition generation, some partitions may end up querying no data whenwhere_conditionis applied. When numerous such partitions exist, it not only impacts partitioning performance but also degrades data retrieval performance due to a high volume of ineffective queries.

1.3 Optimization Results

After optimization, local testing showed that the time required to filter and retrieve 1,000 rows from a 55GB MySQL table usingwhere_conditionand write them to an Oracle table reduced from*25 minutes to 23 seconds*.

The corresponding PR for this optimization can be found here:PR #8760.

...

🔧 Apache SeaTunnel optimization strategy for synchronizing MySQL to Doris


📈 37.32 Punkte
🔧 Programmierung

🔧 Building a Trillion-Scale Data Sync System: The Untold Story of Apache SeaTunnel


📈 30.4 Punkte
🔧 Programmierung

🔧 Exploring Data Integration and the Evolution of Apache SeaTunnel Architecture


📈 30.4 Punkte
🔧 Programmierung

🔧 VTS: An Open-Source Vector Data Migration Tool Based on Apache SeaTunnel


📈 30.4 Punkte
🔧 Programmierung

🔧 Achieving True Zero Loss and Zero Duplication: Deep Dive into SeaTunnel's Data Consistency


📈 29.46 Punkte
🔧 Programmierung

🔧 Source Code Analysis of Apache SeaTunnel Zeta Engine (Part 1): Server Initialization


📈 27.77 Punkte
🔧 Programmierung

🔧 Building a Semantic Search System with Apache SeaTunnel and Amazon Bedrock


📈 27.77 Punkte
🔧 Programmierung

🕵️ CVE-2023-49198 | Apache SeaTunnel Web 1.0.0 MySQL URL path traversal


📈 27.77 Punkte
🕵️ Sicherheitslücken

🔧 Seamlessly Merging and Syncing MySQL Databases with Apache SeaTunnel


📈 27.77 Punkte
🔧 Programmierung

🔧 You’re Invited: Apache SeaTunnel Biweekly Community Meeting on April 8, 2025


📈 27.77 Punkte
🔧 Programmierung

🔧 My Journey in the Apache SeaTunnel Community: Contributions, Challenges, and Reflections


📈 27.77 Punkte
🔧 Programmierung

🔧 Apache SeaTunnel Won an Award for "2024 Open Source Community"!


📈 27.77 Punkte
🔧 Programmierung

🔧 November Report on Apache SeaTunnel Community Development


📈 27.77 Punkte
🔧 Programmierung

🔧 October Recap | Apache SeaTunnel Community Updates and Progress


📈 27.77 Punkte
🔧 Programmierung

🔧 The Apache SeaTunnel Community Welcomes A New Committer From India!


📈 27.77 Punkte
🔧 Programmierung

🔧 Welcome, @dailai, to the Apache SeaTunnel Committer team!


📈 27.77 Punkte
🔧 Programmierung

🔧 Apache SeaTunnel 2.3.8 Officially Released!


📈 27.77 Punkte
🔧 Programmierung

🔧 Exciting Updates Coming in Apache SeaTunnel 2.3.8


📈 27.77 Punkte
🔧 Programmierung

🔧 The series on Source Code Analysis of the Apache SeaTunnel Zeta Engine


📈 27.77 Punkte
🔧 Programmierung

🔧 Source Code Analysis of Apache SeaTunnel Zeta Engine (Part 3): Server-Side Task Submission


📈 27.77 Punkte
🔧 Programmierung

🔧 How to Become an Apache SeaTunnel Committer?


📈 27.77 Punkte
🔧 Programmierung

🔧 Source Code Analysis of Apache SeaTunnel Zeta Engine (Part 1): Server Initialization


📈 27.77 Punkte
🔧 Programmierung

🔧 SeaTunnel-Powered Data Integration: How 58 Group Handles Over 500 Billion+ Data Points Daily


📈 27.11 Punkte
🔧 Programmierung

🔧 Building and Comparing Table Data Made Easy with @libs-jd/table-data-kit


📈 24.95 Punkte
🔧 Programmierung

🔧 Building and Comparing Table Data Made Easy with @libs-jd/table-data-kit


📈 24.95 Punkte
🔧 Programmierung

🔧 Synchronizing Data from InfluxDB to Doris with SeaTunnel


📈 24.48 Punkte
🔧 Programmierung

🔧 How to Create a Socket Data Synchronization Job in SeaTunnel


📈 24.48 Punkte
🔧 Programmierung

🔧 Unity in React.js: From 20 Seconds to 1.2 Seconds loading time


📈 24.46 Punkte
🔧 Programmierung

🐧 PSA: I reduced my boot time from ~30 seconds to ~3 seconds by editing a single file (Linux Mint 19.1)


📈 24.46 Punkte
🐧 Linux Tipps

matomo