📚 How to bring custom ML Models into OpenMetadata
💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com
Build custom CICD pipelines to put your ML assets on the map
OpenMetadata is more than a data catalog. Built on standard definitions and APIs, the catalog is just one of many applications exploiting the metadata of your platform. Since the beginning, the goal of OpenMetadata has been to solve the metadata problem in the industry. Not having to figure out essential components such as metadata ingestion or how to bring back collaboration into data, teams can focus on improving their processes and automations.
This post aims to showcase how we can integrate multiple metadata sources, both from existing services and in-house solutions. With every action being powered by APIs, there is no difference between metadata coming from featured connectors such as Postgres or being sent via the Python SDK. This high degree of flexibility allows us to explore the metadata from custom-built ML Models and the tables feeding their features.
If this sounds interesting, follow the steps with the material in this repository.
OpenMetadata and ML
One of the main challenges of ML models’ lifecycle is closing the gap between the ML model and the Data Platform. We have tools that help us train, test, tune and deploy ML, but those tools rarely put ML Models in the context of the platform they live in.
How all the pieces fit together is information that is usually held by Data Scientists or ML Engineers but hardly ever shared. Typical causes are:
- No generic approach to how to define and maintain the metadata.
- No central place to publish the results for users to explore.
- This lack of clarity and the work involved in the previous two tasks makes it hard to justify the benefits and measure the impact.
In this demo, we’ll follow a use case where:
- We have an ML model using features from Postgres,
- The model is regularly updated and deployed,
- The documentation of the model is hosted as code,
- We’ll use OpenMetadata’s Python SDK to create the ML Model assets and push them to OpenMetadata.
Getting our models in OpenMetadata helps us share the documentation, keep track of metadata changes and versioning, discover lineage with the sources, drive discussions and collaboration… A few quick wins from bringing this holistic view on ML and AI assets are:
- Teams can quickly start to collaborate instead of trying to reach similar outcomes in different ways,
- Knowledge gathering of the most used features to start building a Feature Store with the highest possible value for the whole organization.
- ML teams can start building Data Quality tests and alerts directly in OpenMetadata to prevent feature drifts and performance decreases.
Ingesting Postgres metadata
The first step will be ingesting Postgres metadata, as there, we have the sources for the ML features. You can follow these steps to configure and deploy the Postgres metadata ingestion.
The OpenMetadata UI will guide us through the two main steps:
- Creating the Database Service: A service represents the source system we want to ingest. Here is where we will define the connection to Postgres, and this service will hold the assets that will be sent to OpenMetadata: databases, schemas, and tables.
- Creating and deploying the Ingestion Pipelines: which are internally handled by OpenMetadata using the Ingestion Framework, a Python library holding the logic to connect to multiple sources, translate their original metadata into the OpenMetadata standard, and send it to the server using the APIs.
What’s interesting here is that the Ingestion Framework package can be directly used to configure and host the ingestion processes. Moreover, any operation in the UI or in the Ingestion Framework is entirely open and supported by the server APIs. This means full automation possibilities for any metadata-related activity, which can be achieved directly via REST or the OpenMetadata SDKs.
These are the capabilities we will exploit next when creating the CICD process.
Building a CICD pipeline
In the discussion above, we highlighted two pains that usually become blockers to maintaining updated ML models’ metadata: No generic metadata definition and no single place to publish it. Thankfully, OpenMetadata takes care of both of these aspects.
The missing piece for building a successful process? It should be simple to maintain and evolve. That’s why we base our example on a YAML file checked out in the code repository. Therefore, Data Scientists and ML Engineers can rely on their deployment pipelines to update as well the metadata of their fresh production model.
The CICD process will then have a specific step that will:
- Read the YAML file with the metadata,
- Translate the structure of the YAML to the ML Model Entity definition from the OpenMetadata standard,
- Push the ML Model asset into OpenMetadata using the Python SDK.
Here you will find an example of such a pipeline. Hopefully, that will help you start putting your ML assets on the map!
After the script has finished running, we’ll see our Revenue Predictions model in OpenMetadata as:
One key benefit of having the metadata available in the platform is being able to see the lineage information between our models and the sources containing the features. In our example, we already ingested the Postgres metadata. Then, if we check the Lineage tab, we’ll be able to see all of our models’ dependencies.
Summary
In this post, we have:
- Discussed the industry needs for a common approach to define, ingest and exploit metadata and how OpenMetadata covers those.
- Ingested Postgres metadata directly from the ≈ UI.
- Built a CICD process that pushes custom-built ML Models metadata during the release process.
Putting ML Models into the context of the Data Platform has essential benefits, such as exploring dependencies and fueling collaboration. If you need a simple approach to putting your ML assets on the map, OpenMetadata has you covered.
How to bring custom ML Models into OpenMetadata was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
...