Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Open in app

Search Medium

Photo by Iñaki del Olmo on Unsplash

Member-only story

Data Observability & discovery platform—


OpenMetadata
Amit Singh Rathore · Follow
Published in Geek Culture
3 min read · Aug 16, 2022

Listen Share More

Managing Data about data

Data discovery is a crucial first step of the data consumption workflow. Data
discovery answers different aspects of data like what is the source, where it is
stored, what is the meaning of this data, how recent/relevant this data is, how
this data is used by others, and how this data came into its current form
(lineage), etc. So, Data Discovery becomes an essential part of a data platform.

Based on the tools selection for four major capabilities like search(solr),
attribute lookup (databases), entity relation(graph databases), and regular
refresh of metadata (schedulers/queues) multiple companies have built their
own versions of metadata platforms. Few of the major ones are Amundsen,
DataHub, Atlas, Metacat, Databook, and Marquez. Each product has its own way
and specification of collecting metadata. Some support a certain number of
sources while some have very limited integration.

In general, the catalog/metadata segment of the data platform has the following
shortcomings.

1. Non-standardized metadata collection

2. Incompatibility of data catalogs (the need to recollect data)

3. Limited, not truly company-wide end-to-end data lineage

4. Absent or insufficient data quality and observability

5. Undiscoverable ML assets

An open standard for collecting metadata could become a sound solution to the
lack of efficient discovery and observability and a solid foundation for the next-
gen data platform.

Open Data Discovery Specification (ODD Spec) is an attempt at creating an


open-source, industry-wide metadata standard that would enable engineers to
collect and export metadata from cloud-native applications, infrastructures, and
other data sources.

OpenMetadata
OpenMetadata is touted as Open Standard for Metadata. A single place to
discover, collaborate and get your data right.

OpenMetadata has its own specification, which can be found here. Each schema
definition is mapped to the data/asset entity type.

Five major Pillars


OpenMetadata takes a JSON-schema first approach to metadata. Metadata
schemas define core abstractions and vocabulary for metadata with schemas for
Types, Entities, and Relationships between entities. This is the foundation of the
Open Metadata Standard.

SAML Protected Metadata APIs — for producing and consuming metadata built
on schemas for User Interfaces and Integration of tools, systems, and services.

Metadata store — Organization of entity and relationship graph that connects


data assets, user, and tool-generated metadata.

Ingestion framework — a pluggable framework for integrating tools and


ingesting metadata to the metadata store. Ingestion framework already supports
50+ well know data warehouses — Google BigQuery, Snowflake, Amazon
Redshift, Apache Druid, and Apache Hive, and databases — MySQL, Postgres,
Oracle, and MSSQL. It also has connectors for Airbyte, Airflow & DBT.

OpenMetadata User Interface — Easy to use User interface for users to discover,
and collaborate on all data.

OpenMetadata components
Server — UI & API
Elastic search— Search & Analytics engine

MySQL — Storage layer for Entity, their attributes & Relationships

Ingestion — Airflow

OpenMetadata features
Support for personas using RBAC

Support for Keyword & Advance Search

Support for Table, column & pipeline Lineage

Proving usage metadata

Support for entities like Topic, dashboards, Pipelines

Support for custom Labels for asset importance

Support for Glossary — universal language to define, standardize, and


contextualize data assets

Activity Feeds — shows all change events linked to assets in a single view

Task workflow for raising Request objects for data owners for any changes

Quality, Profiler, and metrics — quality tests supported by Great Expectation,


DBT, or other data quality tools

Metadata versioning

Happy cataloging!!!!

Data Data Engineering Open Source Metadata


Follow

Written by Amit Singh Rathore


3K Followers · Writer for Geek Culture

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML

More from Amit Singh Rathore and Geek Culture

Amit Singh Rathore in Dev Genius

Spark Questions — Interview Series


Series on spark question we should know before an interview

· 3 min read · Sep 17

116 1
Arslan Ahmad in Geek Culture

Load Balancer vs. Reverse Proxy vs. API Gateway


Understanding the Key Components for Efficient, Secure, and Scalable Web Applications.

12 min read · May 17

1.6K 10

Farhan Tanvir in Geek Culture

7 Useful Java Libraries You Should Use in Your Next Project


Power up your Java development

· 5 min read · Aug 30

171 4

Amit Singh Rathore in Dev Genius

Spark Interview Questions — II


Next blog in the Spark Question Interview Series

· 4 min read · Sep 23

82 1

See all from Amit Singh Rathore

See all from Geek Culture


Recommended from Medium

James Oluwaleye in AWS in Plain English

AWS Data Lake vs. Data Warehouse: Choosing the Right Data Storage
Organizations face a constant struggle in today’s data-driven world: how to efficiently manage,
store, and exploit their data. AWS (Amazon…

13 min read · Sep 19

41
Nam Huynh Thien

Building a Dimensional Data Warehouse Using dbt


Introduction

10 min read · Aug 5

131

Lists

New_Reading_List
174 stories · 133 saves

General Coding Knowledge


20 stories · 392 saves

Predictive Modeling w/ Python


20 stories · 442 saves

Icon Design
30 stories · 106 saves
Nicholas Leong

How I Built a Data Lakehouse With Delta Lake Architecture


Data Engineer Explains the Data Lakehouse Architecture

· 10 min read · Sep 18

446 5

Jhon Carrillo | Just a Data Guy in FluenFactors

The future of Apache Spark


Data platforms operational costs are not longer hidden for organizations.

4 min read · Apr 27

Martin Jurado Pedroza

“Exploring DUCKDB: The Fast, Embeddable Analytical Database for


Modern Data Challenges”
Nowadays, you are aware that a large number of tools are available for managing information,
and that new ones frequently develop depending…

6 min read · Aug 12

63
Thosan Girisona in Data Engineering Indonesia

Defusing Data ‘Time Bombs’ with DataHub Observability


Data is like garbage, You’d better know what you are going to do with it before you collect it. —
Mark Twain

7 min read · 5 days ago

40

See more recommendations

You might also like