Data Observability & Discovery Platform - OpenMetadata - by Amit Singh Rathore - Geek Culture - Medium

Open in app
Search Medium
Photo by Iñaki del Olmo on Unsplash
Member-only story
Data Observability & discovery platform—

OpenMetadata
Amit Singh Rathore · Follow
Published in Geek Culture
3 min read · Aug 16, 2022
Listen Share More
Managing Data about data
Data discovery is a crucial first step of the data consumption workflow. Data
discovery answers different aspects of data like what is the source, where it is
stored, what is the meaning of this data, how recent/relevant this data is, how
this data is used by others, and how this data came into its current form
(lineage), etc. So, Data Discovery becomes an essential part of a data platform.
Based on the tools selection for four major capabilities like search(solr),
attribute lookup (databases), entity relation(graph databases), and regular
refresh of metadata (schedulers/queues) multiple companies have built their
own versions of metadata platforms. Few of the major ones are Amundsen,
DataHub, Atlas, Metacat, Databook, and Marquez. Each product has its own way
and specification of collecting metadata. Some support a certain number of
sources while some have very limited integration.
In general, the catalog/metadata segment of the data platform has the following
shortcomings.
1. Non-standardized metadata collection
2. Incompatibility of data catalogs (the need to recollect data)
3. Limited, not truly company-wide end-to-end data lineage
4. Absent or insufficient data quality and observability
5. Undiscoverable ML assets
An open standard for collecting metadata could become a sound solution to the
lack of efficient discovery and observability and a solid foundation for the next-
gen data platform.
Open Data Discovery Specification (ODD Spec) is an attempt at creating an

open-source, industry-wide metadata standard that would enable engineers to
collect and export metadata from cloud-native applications, infrastructures, and
other data sources.
OpenMetadata
OpenMetadata is touted as Open Standard for Metadata. A single place to
discover, collaborate and get your data right.
OpenMetadata has its own specification, which can be found here. Each schema
definition is mapped to the data/asset entity type.
Five major Pillars

OpenMetadata takes a JSON-schema first approach to metadata. Metadata
schemas define core abstractions and vocabulary for metadata with schemas for
Types, Entities, and Relationships between entities. This is the foundation of the
Open Metadata Standard.
SAML Protected Metadata APIs — for producing and consuming metadata built
on schemas for User Interfaces and Integration of tools, systems, and services.
Metadata store — Organization of entity and relationship graph that connects

data assets, user, and tool-generated metadata.
Ingestion framework — a pluggable framework for integrating tools and

ingesting metadata to the metadata store. Ingestion framework already supports
50+ well know data warehouses — Google BigQuery, Snowflake, Amazon
Redshift, Apache Druid, and Apache Hive, and databases — MySQL, Postgres,
Oracle, and MSSQL. It also has connectors for Airbyte, Airflow & DBT.
OpenMetadata User Interface — Easy to use User interface for users to discover,
and collaborate on all data.
OpenMetadata components
Server — UI & API
Elastic search— Search & Analytics engine
MySQL — Storage layer for Entity, their attributes & Relationships
Ingestion — Airflow
OpenMetadata features
Support for personas using RBAC
Support for Keyword & Advance Search
Support for Table, column & pipeline Lineage
Proving usage metadata
Support for entities like Topic, dashboards, Pipelines
Support for custom Labels for asset importance
Support for Glossary — universal language to define, standardize, and

contextualize data assets
Activity Feeds — shows all change events linked to assets in a single view
Task workflow for raising Request objects for data owners for any changes
Quality, Profiler, and metrics — quality tests supported by Great Expectation,

DBT, or other data quality tools
Metadata versioning
Happy cataloging!!!!
Data Data Engineering Open Source Metadata

Follow
Written by Amit Singh Rathore

3K Followers · Writer for Geek Culture
Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML
More from Amit Singh Rathore and Geek Culture
Amit Singh Rathore in Dev Genius
Spark Questions — Interview Series

Series on spark question we should know before an interview
· 3 min read · Sep 17
116 1
Arslan Ahmad in Geek Culture
Load Balancer vs. Reverse Proxy vs. API Gateway

Understanding the Key Components for Efficient, Secure, and Scalable Web Applications.
12 min read · May 17
1.6K 10
Farhan Tanvir in Geek Culture
7 Useful Java Libraries You Should Use in Your Next Project

Power up your Java development
· 5 min read · Aug 30
171 4
Amit Singh Rathore in Dev Genius
Spark Interview Questions — II

Next blog in the Spark Question Interview Series
82 1
See all from Amit Singh Rathore
See all from Geek Culture

Recommended from Medium
James Oluwaleye in AWS in Plain English
AWS Data Lake vs. Data Warehouse: Choosing the Right Data Storage
Organizations face a constant struggle in today’s data-driven world: how to efficiently manage,
store, and exploit their data. AWS (Amazon…
13 min read · Sep 19
41
Nam Huynh Thien
Building a Dimensional Data Warehouse Using dbt

Introduction
10 min read · Aug 5
131
Lists
New_Reading_List
174 stories · 133 saves
General Coding Knowledge

Predictive Modeling w/ Python

Icon Design
Nicholas Leong
How I Built a Data Lakehouse With Delta Lake Architecture

Data Engineer Explains the Data Lakehouse Architecture
446 5
Jhon Carrillo | Just a Data Guy in FluenFactors
The future of Apache Spark

Data platforms operational costs are not longer hidden for organizations.
4 min read · Apr 27
Martin Jurado Pedroza
“Exploring DUCKDB: The Fast, Embeddable Analytical Database for

Modern Data Challenges”
Nowadays, you are aware that a large number of tools are available for managing information,
and that new ones frequently develop depending…
6 min read · Aug 12
63
Thosan Girisona in Data Engineering Indonesia
Defusing Data ‘Time Bombs’ with DataHub Observability

Data is like garbage, You’d better know what you are going to do with it before you collect it. —
Mark Twain
7 min read · 5 days ago
40
See more recommendations

Data Observability & Discovery Platform - OpenMetadata - by Amit Singh Rathore - Geek Culture - Medium

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Observability & Discovery Platform - OpenMetadata - by Amit Singh Rathore - Geek Culture - Medium

Uploaded by

Copyright:

Available Formats

Open in app

Photo by Iñaki del Olmo on Unsplash

Data Observability & discovery platform—

Listen Share More

Managing Data about data

1. Non-standardized metadata collection

2. Incompatibility of data catalogs (the need to recollect data)

3. Limited, not truly company-wide end-to-end data lineage

4. Absent or insufficient data quality and observability

Open Data Discovery Specification (ODD Spec) is an attempt at creating an

Five major Pillars

Metadata store — Organization of entity and relationship graph that connects

Ingestion framework — a pluggable framework for integrating tools and

MySQL — Storage layer for Entity, their attributes & Relationships

Support for Keyword & Advance Search

Support for Table, column & pipeline Lineage

Proving usage metadata

Support for entities like Topic, dashboards, Pipelines

Support for custom Labels for asset importance

Support for Glossary — universal language to define, standardize, and

Quality, Profiler, and metrics — quality tests supported by Great Expectation,

Data Data Engineering Open Source Metadata

Written by Amit Singh Rathore

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML

More from Amit Singh Rathore and Geek Culture

Amit Singh Rathore in Dev Genius

Spark Questions — Interview Series

· 3 min read · Sep 17

Load Balancer vs. Reverse Proxy vs. API Gateway

12 min read · May 17

Farhan Tanvir in Geek Culture

7 Useful Java Libraries You Should Use in Your Next Project

· 5 min read · Aug 30

Amit Singh Rathore in Dev Genius

Spark Interview Questions — II

· 4 min read · Sep 23

See all from Amit Singh Rathore

See all from Geek Culture

James Oluwaleye in AWS in Plain English

13 min read · Sep 19

Building a Dimensional Data Warehouse Using dbt

10 min read · Aug 5

General Coding Knowledge

Predictive Modeling w/ Python

How I Built a Data Lakehouse With Delta Lake Architecture

· 10 min read · Sep 18

Jhon Carrillo | Just a Data Guy in FluenFactors

The future of Apache Spark

4 min read · Apr 27

Martin Jurado Pedroza

“Exploring DUCKDB: The Fast, Embeddable Analytical Database for

6 min read · Aug 12

Defusing Data ‘Time Bombs’ with DataHub Observability

7 min read · 5 days ago

See more recommendations

You might also like