🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security

800+ IT News als RSS Feed abonnieren

Thema auswählen:

📚 Mastering Batch Data Processing with Versatile Data Kit (VDK)

🕛 Zeit seit Veröffentlichung: 202 Tage, 2 Stunden 42 Minuten
📆 Veröffentlicht am: 17.11.2023 um 07:06 Uhr
💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com

Data Management

A tutorial on how to use VDK to perform batch data processing

Versatile Data Kit (VDK) is an open-source data ingestion and processing framework designed to simplify data management complexities. While VDK can handle various data integration tasks, including real-time streaming, this article will focus on how to use it in batch data processing.

This article covers:

Introducing Batch Data Processing
Creating and Managing Batch Processing Pipelines in VDK
Monitoring Batch Data Processing in VDK

1 Introducing Batch Data Processing

Batch data processing is a method for processing large volumes of data at specified intervals. Batch data must be:

Time-independent: data doesn’t require immediate processing and is typically not sensitive to real-time requirements. Unlike streaming data, which needs instant processing, batch data can be processed at scheduled intervals or when resources become available.
Splittable in chunks: instead of processing an entire dataset in a single, resource-intensive operation, batch data can be divided into smaller, more manageable segments. These segments can then be processed sequentially or in parallel, depending on the capabilities of the data processing system.

In addition, batch data can be processed offline, meaning it doesn’t require a constant connection to data sources or external services. This characteristic is precious when data sources may be intermittent or temporarily unavailable.

ELT (Extract, Load, Transform) is a typical use case for batch data processing. ELT comprises three main phases:

Extract (E): data is extracted from multiple sources in different formats, both structured and unstructured.
Load (L): data is loaded into a target destination, such as a data warehouse.
Transform (T): the extracted data typically requires preliminary processing, such as cleaning, harmonization, and transformations into a common format.

Now that you have learned what batch data processing is, let’s move on to the next step: creating and managing batch processing pipelines in VDK.

2 Creating and Managing Batch Processing Pipelines in VDK

VDK adopts a component-based approach, enabling you to build data processing pipelines quickly. For an introduction to VDK, refer to my previous article, An Overview of Versatile Data Kit. This article assumes that you have already installed VDK on your computer.

To explain how the batch processing pipeline works in VDK, we consider a scenario where you must perform an ELT task.

Imagine you want to ingest and process, in VDK, Vincent Van Gogh’s paintings available in Europeana, a well-known European aggregator for cultural heritage. Europeana provides all cultural heritage objects through its public REST API. Regarding Vincent Van Gogh, Europeana provides more than 700 works.

The following figure shows the steps for batch data processing in this scenario.

Let’s investigate each point separately. You can find the complete code to implement this scenario in the VDK GitHub repository.

2.1 Extract and Load

This phase includes VDK jobs calling the Europeana REST API to extract raw data. Specifically, it defines three jobs:

job1 — delete the existing table (if any)
job2 — create a new table
job3 — ingest table values directly from the REST API.

This example requires an active Internet connection to work correctly to access the Europeana REST API. This operation is a batch process because it downloads data only once and does not require streamlining.

We’ll store the extracted data in a table. The difficulty of this task is building a mapping between the REST API, which is done in job3.

Writing job3 involves simply writing the Python code to perform this mapping, but instead of saving the extracted file into a local file, we call a VDK function (job_input.send_tabular_data_for_ingestion) to save the file to VDK, as shown in the following snippet of code:

import inspect
import logging
import os

import pandas as pd
import requests
from vdk.api.job_input import IJobInput


def run(job_input: IJobInput):
    """
    Download datasets required by the scenario and put them in the data lake.
    """
    log.info(f"Starting job step {__name__}")

    api_key = job_input.get_property("api_key")

    start = 1
    rows = 100
    basic_url = f"https://api.europeana.eu/record/v2/search.json?wskey={api_key}&query=who:%22Vincent%20Van%20Gogh%22"
    url = f"{basic_url}&rows={rows}&start={start}"

    response = requests.get(url)
    response.raise_for_status()
    payload = response.json()
    n_items = int(payload["totalResults"])

    while start < n_items:
        if start > n_items - rows:
            rows = n_items - start + 1

        url = f"{basic_url}&rows={rows}&start={start}"
        response = requests.get(url)
        response.raise_for_status()
        payload = response.json()["items"]

        df = pd.DataFrame(payload)
        job_input.send_tabular_data_for_ingestion(
            df.itertuples(index=False),
            destination_table="assets",
            column_names=df.columns.tolist(),
        )
        start = start + rows

For the complete code, refer to the example in GitHub. Please note that you need a free API key to download data from Europeana.

The output produced during the extraction phase is a table containing the raw values.

2.2 Transform

This phase involves cleaning data and extracting only relevant information. We can implement the related jobs in VDK through two jobs:

job4 — delete the existing table (if any)
job5 — create the cleaned table.

Job5 simply involves writing an SQL query, as shown in the following snippet of code:

CREATE TABLE cleaned_assets AS (
    SELECT
        SUBSTRING(country, 3, LENGTH(country)-4) AS country,
        SUBSTRING(edmPreview, 3, LENGTH(edmPreview)-4) AS edmPreview,
        SUBSTRING(provider, 3, LENGTH(provider)-4) AS provider,
        SUBSTRING(title, 3, LENGTH(title)-4) AS title,
        SUBSTRING(rights, 3, LENGTH(rights)-4) AS rights
    FROM assets
)

Running this job in VDK will produce another table named cleaned_asset containing the processed values. Finally, we are ready to use the cleaned data somewhere. In our case, we can build a Web app that shows the extracted paintings. You can find the complete code to perform this task in the VDK GitHub repository.

3 Monitoring Batch Data Processing in VDK

VDK provides the VDK UI, a graphical user interface to monitor data jobs. To install VDK UI, follow the official VDK video at this link. The following figure shows a snapshot of VDK UI.

There are two main pages:

Explore: This page enables you to explore data jobs, such as the job execution success rate, jobs with failed executions in the last 24 hours, and the most failed executions in the last 24 hours.
Manage: This page gives more job details. You can order jobs by column, search multiple parameters, filter by some of the columns, view the source for the specific job, add other columns, and so on.

Watch the following official VDK video to learn how to use VDK UI.

https://medium.com/media/6a41ee2be8db7418097245b99cb1c16a/href

Summary

Congratulations! You have just learned how to implement batch data processing in VDK! It only requires ingesting raw data, manipulating it, and, finally, using it for your purposes! You can find many other examples in the VDK GitHub repository.

Stay up-to-date with the latest data processing developments and best practices in VDK. Keep exploring and refining your expertise!

Sharing is caring on Social Media

📌 GitHub - flattool/warehouse: A versatile toolbox for viewing flatpak info, managing user data, and batch managing installed flatpaks

🕛 150 Tage, 2 Stunden 17 Minuten
📆 08.01.2024 um 07:28 Uhr
📈 32.9 Punkte