Cookie Consent by Free Privacy Policy Generator Update cookies preferences 📌 Mastering Batch Data Processing with Versatile Data Kit (VDK)

🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security



📚 Mastering Batch Data Processing with Versatile Data Kit (VDK)


💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com

Data Management

A tutorial on how to use VDK to perform batch data processing

Photo by Mika Baumeister on Unsplash

Versatile Data Kit (VDK) is an open-source data ingestion and processing framework designed to simplify data management complexities. While VDK can handle various data integration tasks, including real-time streaming, this article will focus on how to use it in batch data processing.

This article covers:

  • Introducing Batch Data Processing
  • Creating and Managing Batch Processing Pipelines in VDK
  • Monitoring Batch Data Processing in VDK

1 Introducing Batch Data Processing

Batch data processing is a method for processing large volumes of data at specified intervals. Batch data must be:

  • Time-independent: data doesn’t require immediate processing and is typically not sensitive to real-time requirements. Unlike streaming data, which needs instant processing, batch data can be processed at scheduled intervals or when resources become available.
  • Splittable in chunks: instead of processing an entire dataset in a single, resource-intensive operation, batch data can be divided into smaller, more manageable segments. These segments can then be processed sequentially or in parallel, depending on the capabilities of the data processing system.

In addition, batch data can be processed offline, meaning it doesn’t require a constant connection to data sources or external services. This characteristic is precious when data sources may be intermittent or temporarily unavailable.

ELT (Extract, Load, Transform) is a typical use case for batch data processing. ELT comprises three main phases:

  • Extract (E): data is extracted from multiple sources in different formats, both structured and unstructured.
  • Load (L): data is loaded into a target destination, such as a data warehouse.
  • Transform (T): the extracted data typically requires preliminary processing, such as cleaning, harmonization, and transformations into a common format.

Now that you have learned what batch data processing is, let’s move on to the next step: creating and managing batch processing pipelines in VDK.

2 Creating and Managing Batch Processing Pipelines in VDK

VDK adopts a component-based approach, enabling you to build data processing pipelines quickly. For an introduction to VDK, refer to my previous article, An Overview of Versatile Data Kit. This article assumes that you have already installed VDK on your computer.

To explain how the batch processing pipeline works in VDK, we consider a scenario where you must perform an ELT task.

Imagine you want to ingest and process, in VDK, Vincent Van Gogh’s paintings available in Europeana, a well-known European aggregator for cultural heritage. Europeana provides all cultural heritage objects through its public REST API. Regarding Vincent Van Gogh, Europeana provides more than 700 works.

The following figure shows the steps for batch data processing in this scenario.

Image by Author

Let’s investigate each point separately. You can find the complete code to implement this scenario in the VDK GitHub repository.

2.1 Extract and Load

This phase includes VDK jobs calling the Europeana REST API to extract raw data. Specifically, it defines three jobs:

  • job1 — delete the existing table (if any)
  • job2 — create a new table
  • job3 — ingest table values directly from the REST API.

This example requires an active Internet connection to work correctly to access the Europeana REST API. This operation is a batch process because it downloads data only once and does not require streamlining.

We’ll store the extracted data in a table. The difficulty of this task is building a mapping between the REST API, which is done in job3.

Writing job3 involves simply writing the Python code to perform this mapping, but instead of saving the extracted file into a local file, we call a VDK function (job_input.send_tabular_data_for_ingestion) to save the file to VDK, as shown in the following snippet of code:

import inspect
import logging
import os

import pandas as pd
import requests
from vdk.api.job_input import IJobInput


def run(job_input: IJobInput):
"""
Download datasets required by the scenario and put them in the data lake.
"""
log.info(f"Starting job step {__name__}")

api_key = job_input.get_property("api_key")

start = 1
rows = 100
basic_url = f"https://api.europeana.eu/record/v2/search.json?wskey={api_key}&query=who:%22Vincent%20Van%20Gogh%22"
url = f"{basic_url}&rows={rows}&start={start}"

response = requests.get(url)
response.raise_for_status()
payload = response.json()
n_items = int(payload["totalResults"])

while start < n_items:
if start > n_items - rows:
rows = n_items - start + 1

url = f"{basic_url}&rows={rows}&start={start}"
response = requests.get(url)
response.raise_for_status()
payload = response.json()["items"]

df = pd.DataFrame(payload)
job_input.send_tabular_data_for_ingestion(
df.itertuples(index=False),
destination_table="assets",
column_names=df.columns.tolist(),
)
start = start + rows

For the complete code, refer to the example in GitHub. Please note that you need a free API key to download data from Europeana.

The output produced during the extraction phase is a table containing the raw values.

2.2 Transform

This phase involves cleaning data and extracting only relevant information. We can implement the related jobs in VDK through two jobs:

  • job4 — delete the existing table (if any)
  • job5 — create the cleaned table.

Job5 simply involves writing an SQL query, as shown in the following snippet of code:

CREATE TABLE cleaned_assets AS (
SELECT
SUBSTRING(country, 3, LENGTH(country)-4) AS country,
SUBSTRING(edmPreview, 3, LENGTH(edmPreview)-4) AS edmPreview,
SUBSTRING(provider, 3, LENGTH(provider)-4) AS provider,
SUBSTRING(title, 3, LENGTH(title)-4) AS title,
SUBSTRING(rights, 3, LENGTH(rights)-4) AS rights
FROM assets
)

Running this job in VDK will produce another table named cleaned_asset containing the processed values. Finally, we are ready to use the cleaned data somewhere. In our case, we can build a Web app that shows the extracted paintings. You can find the complete code to perform this task in the VDK GitHub repository.

3 Monitoring Batch Data Processing in VDK

VDK provides the VDK UI, a graphical user interface to monitor data jobs. To install VDK UI, follow the official VDK video at this link. The following figure shows a snapshot of VDK UI.

Image by Author

There are two main pages:

  • Explore: This page enables you to explore data jobs, such as the job execution success rate, jobs with failed executions in the last 24 hours, and the most failed executions in the last 24 hours.
  • Manage: This page gives more job details. You can order jobs by column, search multiple parameters, filter by some of the columns, view the source for the specific job, add other columns, and so on.

Watch the following official VDK video to learn how to use VDK UI.

https://medium.com/media/6a41ee2be8db7418097245b99cb1c16a/href

Summary

Congratulations! You have just learned how to implement batch data processing in VDK! It only requires ingesting raw data, manipulating it, and, finally, using it for your purposes! You can find many other examples in the VDK GitHub repository.

Stay up-to-date with the latest data processing developments and best practices in VDK. Keep exploring and refining your expertise!

Other articles you may be interested in…


Mastering Batch Data Processing with Versatile Data Kit (VDK) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

...



📌 Mastering Batch Data Processing with Versatile Data Kit (VDK)


📈 98.48 Punkte

📌 Batch Processing vs. Stream Processing: Why Batch Is Dying and Streaming Takes Over


📈 47.8 Punkte

📌 Inhalt/Text einer Batch in neue Batch schreiben - Batch automatisch erstellen


📈 39 Punkte

📌 Batch File Syntax: Understanding and Mastering the Syntax for Batch Scripting


📈 36.99 Punkte

📌 Efficient Batch Processing in the Cloud with AWS Batch


📈 36.9 Punkte

📌 Warehouse: A versatile toolbox for viewing flatpak info, managing user data, and batch managing installed flatpaks


📈 32.9 Punkte

📌 Sozialverband VdK fordert günstigere Monatstickets


📈 32.43 Punkte

📌 Gasumlage: Sozialverband VdK und Mieterbund befürchten zu hohe Belastungen


📈 32.43 Punkte

📌 Batch Console by Privateloader - Earn and learn with Batch


📈 26 Punkte

📌 Batch Geocoding and Batch Reverse-Geocoding with Bing Maps


📈 26 Punkte

📌 HPR3740: Batch File Variables; Nested Batch Files


📈 26 Punkte

📌 Creating a Batch File and PowerShell Script “Batch File to Run PowerShell Script


📈 26 Punkte

📌 Batch File Rename: A Guide on Renaming Files through Batch Scripts


📈 26 Punkte

📌 Batch File Pause for 5 Seconds: How to Add Delays to Your Batch Scripts


📈 26 Punkte

📌 Create a Folder in Batch File: How to Create Directories Using Batch Scripts


📈 26 Punkte

📌 CM Batch Photo Processor 4.1.5.3.564 - Batch size and rename photos.


📈 26 Punkte

📌 CM Batch Filename Changer 1.2.7.3.0 - Batch file renaming application.


📈 26 Punkte

📌 Batch File Copy: A Guide to Copying Files Using Batch Scripts


📈 26 Punkte

📌 Batch File Change Directory: How to Navigate the Directories in Batch Scripts


📈 26 Punkte

📌 Exit Batch File – How to Properly Terminate Your Batch Scripts


📈 26 Punkte

📌 Batch File Echo and Echo Off: How to Control the Command Outputs in Batch Scripts


📈 26 Punkte

📌 Batch File Delete Folder: How to Automate Folder Deletion Using Batch Scripts


📈 26 Punkte

📌 Batch File Prompt for Input: How to Create Interactive Batch Scripts


📈 26 Punkte

📌 Batch Processing for Data Integration


📈 25.97 Punkte

📌 [$] Batch processing of network packets


📈 23.9 Punkte

📌 HPR3291: The New Audacity and Batch Processing Macros


📈 23.9 Punkte

📌 Datenverarbeitung: Apache Flink 1.16 mit besserem Batch- und Stream-Processing


📈 23.9 Punkte

📌 Announcing Converter 1.5, support for batch processing and various image datatypes.


📈 23.9 Punkte

📌 Batch request processing with API Gateway


📈 23.9 Punkte

📌 How to Design a Batch Processing?


📈 23.9 Punkte

📌 Efficient batch processing for event-driven chunking


📈 23.9 Punkte

📌 CVE-2019-10195 | Ipa up to 4.6.6/4.7.3/4.8.2 Batch Processing Password information disclosure (RHSA-2020:0378)


📈 23.9 Punkte

📌 Unified Stream and Batch Processing of WorldQuant 101 Alphas in DolphinDB


📈 23.9 Punkte











matomo