Data Curation and Managment Chap1-5 1-5

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

University of Gondar

College of Informatics
Department of Information Science

Data Curation and Management


Chapter 1
June 2023
Chapter 1
Data curation

• Data curation is the practice of gathering and managing data to use for analytical
purposes.
• The purpose of data curation is to expand the awareness and knowledge of a specific
subject.
• Data curation involves collecting information
using research methodology and then shifting
independent data into organized data sets.
In short , making people can find and use data
now and in the future
Data curation (…)
• Data curation is the process of collecting, organizing, preserving, and maintaining data for
current and future use. It is an important part of data management and involves a variety of
activities such as selecting and acquiring data, cleaning and transforming data, organizing
data, and making data accessible.

Advantage of Data Curation


• Better understand their growth potential
• Add value for stakeholders
• Maintain high-quality data
• Discover new information and advancements within their industry
Data curation (…)
What is a data curator?
A data curator is a professional who collects and organizes data that a business can access and
analyze. Data curators may gather new data or perform a more thorough analysis of existing
research.
Who uses data curation?
• Sales
• Science
• Health care
• Education
Steps of data curation:
1. Chose topic
2. Identify data
3. Gather data
4. Clean data
5. Transform data
Data Management
• Data management is the practice of collecting, organizing, protecting, and storing an
organization’s data so it can be analyzed for business decisions.
• As organizations create and consume data at unprecedented rates, data management
solutions become essential for making sense of the vast quantities of data. Today’s leading
data management software ensures that reliable, up-to-date data is always used to drive
decisions. The software helps with everything from data preparation to cataloging, search,
and governance, allowing people to quickly find the information they need for analysis.

Why data management is important


• Visibility
• Reliability
• Security
• Scalability
Data Management (…)
• Data management types or techniques

• Data management systems


• SQL, ORACLE, IBM, ACCESS, SAP, APACHE, AMAZON, ….
Data ontology
• Data ontology is a way of linking data in various formats based on various concepts. In the
early days of the internet, data were linked using HTTP protocols. Nowadays, one can add
another layer, an ontology, to define a specific concept, and then automatically link data
points that are pertinent to that concept.
• Ontology is the study of what is. To make this a little more concrete, one could also say
ontology is the study of what exists or of what is real. ―Does God exist?, ―Are my
feelings real?, ―What is nothing,‘ and does it exist? are all examples of ontological
questions.
• Data should not solely exist in the form of hypertext documents and hyperlinks between
them. Rather, data should be viewed as what it represents — people, places, events, ideas,
activities, and so on — and linked in a human-readable way.
• Ontology in philosophy and in computer science is that it‘s an attempt to describe
everything that is, i.e., entities, ideas, and events, and all the relations between these things.
Data ontology
• In computer science and information science, an ontology encompasses a representation,
formal naming, and definition of the categories, properties, and relations between the
concepts, data, and entities that substantiate one, many, or all domains of discourse. More
simply, an ontology is a way of showing the properties of a subject area and how they are
related, by defining a set of concepts and categories that represent the subject.
• In philosophy, ontology is the study of the nature of being; but in information technology
it’s “the working model of entities and interactions in some particular domain of
knowledge or practices.”
• An ontology is a formal description of knowledge as a set of concepts within a domain and
the relationships that hold between them. It ensures a common understanding of information
and makes explicit domain assumptions thus allowing organizations to make better sense of
their data.
• Data Ontology ensures that business rules and data are unambiguous, unified, linked,
and most importantly, readable both by humans and machines.
University of Gondar
College of Informatics
Department of Information Science

Data Curation and Management


Chapter 2
June 2023
Chapter 2
Data Models

• Data modeling is the process of creating a visual representation of either a whole


information system or parts of it to communicate connections between data points and
structures.
• The goal is to illustrate the types of data used and stored within the system, the relationships
among these data types, the ways the data can be grouped and organized and its formats and
attributes.
• Data can be modeled at various levels of abstraction. The process begins by collecting
information about business requirements from stakeholders and end users. These business
rules are then translated into data structures to formulate a concrete database design.
• Data modeling employs standardized schemas and formal techniques. This provides a
common, consistent, and predictable way of defining and managing data resources across an
organization, or even beyond.
Data Models (…)
• Types of Data model:
• according to their degree of abstraction, start with a conceptual model, progress to a logical
model and conclude with a physical model.
1. Conceptual data model: domain models and offer a big- picture view of what the system
will contain, how it will be organized, and which business rules are involved, high- level,
static business structures and concepts.
2. Logical data model: less abstract and provide greater detail about the concepts and
relationships in the domain under consideration.
3. Physical data model: provide a schema for how the data will be physically stored within a
database. offer a finalized design that can be implemented as a relational database,
including associative tables that illustrate the relationships among entities as well as the
primary keys and foreign keys that will be used to maintain those relationships. include
database management system (DBMS).
Data Models (…)
Data modeling process:
1. Identify the entities
2. Identify key properties of each entity
3. Identify relationship among entities
4. Map attributes to entities completely
5. Assign keys as needed, and decide on a degree of normalization that balances the need to
reduce redundancy with performance requirements.
6. Finalize and validate the data model.
Data Models (…)
Types of Data modeling alongside database management systems:
1. Hierarchical data model
2. Relational data model
3. Entity-relationship (ER) data model
4. Object oriented data model
5. Dimensional data model
Benefits of data modeling
• Reduce errors in software and database development.
• Increase consistency in documentation and system design across the enterprise.
• Improve application and database performance.
• Ease data mapping throughout the organization.
• Improve communication between developers and business intelligence teams.
• Ease and speed the process of database design at the conceptual, logical and physical
levels.
Data Models (…)
Data modeling tools
• erwin Data Modeler
• Enterprise Architect
• ER/Studio
• Free data modeling tools include open source solutions such as Open ModelSphere
University of Gondar
College of Informatics
Department of Information Science

Data Curation and Management


Chapter 3
June 2023
Chapter 3
Meta Data fundamentals
Metadata is a cornerstone of a modern enterprise data stack. Metadata can be defined as the
information that describes and explains data. It provides context with details such as the
source, type, owner, and relationships to other data sets, thus helping you understand the
relevance of a particular data set and guiding you on how to use it.
Metadata describes a data set (by providing answers to questions such as):
• How it was collected? When was it collected?
• What assumptions were made in the data collection methodology? What is the geographic
scope?
• Are there multiple files? If yes, how do they relate to one another?
• What are the definitions of individual variables and, if applicable, what were the possible
answers?
• What was the calibration of any equipment used in data collection? And the version of
software used for analysis?
Meta Data fundamentals (…)

e.g.
Meta Data fundamentals (…)

e.g.
Meta Data fundamentals (…)

e.g.
Meta Data fundamentals (…)

Metadata can be classified into 6 types:

• Technical: This includes technical metadata such as row or column count, data type,
schema, etc.
• Governance: This includes governance terms, data classification, ownership information,
etc.
• Operational: This includes information on the flow of data such as dependencies, code, and
runtime.
• Collaboration: This includes data-related comments, discussions, and issues
• Quality: This includes quality metrics and measures, such as dataset status, freshness, tests
run, and their statuses
• Usage: This includes information on how much a dataset is used, such as view count,
popularity, top users, and more.
Meta Data fundamentals (…)

6 types Metadata:
Meta Data fundamentals (…)

Why is metadata important?


The right context for data is essential for understanding and putting it to use. Metadata helps
you make data discoverable, accessible, trustworthy, and valuable.
Metadata ensures that data is:
• Discoverable
• Trustworthy
• Relevant
• Accessible
• Secure
• Interoperate
Meta Data fundamentals (…)

Metadata plays a significant role in everything from data discovery to lineage and governance.
So, let‘s look at three prominent metadata use cases:
 Speeding up root cause analysis
 Managing security classifications
 Optimizing data stack spending
Metadata Characteristics
• They are highly structured packages of information that explain the content, quality and
characteristics of the data.
• They are precise and in many cases short and made up of simple words.
• They offer access points to the information.
• They encode the description.
University of Gondar
College of Informatics
Department of Information Science

Data Curation and Management


Chapter 4
June 2023
Chapter 4
Big Data fundamentals
• Nowadays we live in data age. Large volume of data is generated daily from different
source such as social network (Facebook, twitter, YouTube), e-commerce,
telecommunication, healthcare, new media, government and personal data. This is caused
by the emergence and the advancement of technologies such as IoT, cloud computing and
smart devices.
• It is difficult to measure the size of today’s universe data.
• There are different reason to the explosion of such much amount of data .These include the
increase of online user ,business process automation ,the generation of structured ,semi-
structured and un structured data, the availability of high-speed networking and the use of
distributed computing.
• Due to the increase in volume, velocity and variety (3V) of today‘s data, it makes difficult
to get insight and value from these data using classical data analytics and processing
methods. Even though the capacity of the storage is increasing it is not cost effective to
afford servers for an organization rather it is a prominent to use commodity hardware.
Big Data fundamentals (…)
Three dimensions of big data i.e. 3V‘s.The 3V‘s are: volume, variety and velocity.
• Volume: data is being generated at an accelerating rate.
• Varity :Different type of data and data sources.
• Velocity: data in motion.
*Data is found in the form of structured, semi-structured and unstructured.
Driving Forces of big data
*Advancement of technologies such as IoT, cloud computing and smart devices.
Big Data fundamentals (…)
Application areas of Big data analytics
Big data analytics can be used in different application areas nowadays. These areas are
telecommunication, marketing and business, healthcare and social networking.
Use cases
• Credit card fraud detection
• Social media analysis
• Customer Relationship management
• Health care analysis -Genome sequencing,
• Sentiment analysis
• Market basket analysis
• Customer Churn analysis
• Weather furcating
• Call detail analysis in Telecom and IoT
Big Data fundamentals (…)
Big data analytics platforms
Now days, there are several open source and commercial big data analytics tools. The most
widely used open source tools are Apache Hadoop, Apache Spark, R programing.
• Hadoop
• Spark
• R
Apache Hadoop is an open source apache software foundation project written in java. It is
distributed software platform for reliable, scalable and distributed processing of large datasets
across clusters of commodity hardware using simple programing model.
University of Gondar
College of Informatics
Department of Information Science

Data Curation and Management


Chapter 5
June 2023
Chapter 5
Data Management Plan and Policy
• A Data Management Plan, or DMP (sometimes also called a data sharing plan), is a
formal document that outlines what you will do with your data during and after a research
project. It describes the type of data you use for your research, how are collected, organized,
and stored, and what formats you use. When sensitive data are used, the DMP must also
describe what steps you are taking to make your data secure and compliant with regulations.
It details how data will be accessible and documented for sharing and reuse during and after
the project is finished. DMP are required by an increasing number of funding entities and
research institutions.
• A DMP might be informal, to be used internally, guided by policies established by the head
of a research lab, collaborators, or IT groups.
• Who requires a Data Management Plan? (--- private, public, federal institutions…)
Data Management Plan and Policy (…)
What should be included in a Data Management Plan?
Type of data: observation, experimental, simulation, derived/compiled
Form of data: text, numeric, audiovisual, discipline or instrument specific
File Formats: Research community standards preferred (e.g., FITS for astronomy) Preservation
formats preferred (e.g., tables in CSV, docs in PDF, images in JPG)
Size of data, stable data: plans for where you will store them
Sensitive data: plans for secure storage
Metadata: Data documentation (general, content, technical, access)

You might also like