Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Chapter 6: Big Data Curation Part II

Topic 7B

Hyung-Eun Choi

Assistant Professor of Finance, NEOMA Business School

Big Data for Finance, Spring 2024

1 / 11
6.5 Big Data Curation State of the Art

Technologies widely adopted for data curation


Master Data Management is composed of the processes and
tools that support a single point of reference for the data of an
organization, an authoritative data source of master data.
Curation at Source is an approach to curate data where
lightweight curation activities are integrated into the normal
workflow of those creating and managing data and other digital
assets
Crowdsourcing has been fuelled by the rapid development in
web technologies that facilitate contributions from millions of
online users.

2 / 11
6.5 Big Data Curation State of the Art
Master Data Management (MDM) is:
MDM tools can be used to remove duplicates and standardize
data syntax, as an authoritative source of master data.
MDM focuses on ensuring that an organization does not use
multiple and inconsistent versions of the same master data in
different parts of its systems.
Processes in MDM include source identification, data
transformation, normalization, rule administration, error
detection and correction, data consolidation, data storage,
classification, taxonomy services, schema mapping, and
semantic enrichment.
The three main objectives of MDM:
Synchronizing master data across multiple instances of an
enterprise application
Coordinating master data management during an application
migration
Compliance and performance management reporting across
multiple analytic systems
3 / 11
6.5.1 Data Curation Platforms

A Business Model Example of MDM: Data Tamer


This prototype aims to replace the current developer-centric
extract-transform-load (ETL) process with automated data
integration.
The system uses a suit of algorithms to automatically map
schemas and de-duplicate entities.
However, human experts and crowds are leveraged to verify
integration updates that are particularly difficult for algorithms.

4 / 11
6.5 Big Data Curation State of the Art

Curation-at-Source or Sheer curation


Sheer curation activities can include lightweight categorization
and normalization activities.
An example would be vetting or “rating” the results of a
categorization process performed by a curation algorithm.
Sheer curation activities can also be composed with other
curation activities, allowing more immediate access to curated
data while also ensuring the quality control that is only possible
with an expert curation team.
The high-level objectives of sheer curation:
Avoid data deposit by integrating with normal workflow tools
Capture provenance information of the workflow
Seamless interfacing with data curation infrastructure

5 / 11
6.5.1 Data Curation Platforms
An Example of Sheer Curation: Feedly
Leo is a news-feed AI research assistant by Feedly.
Feedly has been teaching Leo how to read and analyze
information from all the media sources and SNS platforms.
Leo allows you to prioritize topics, trends, and keywords of
choice; deduplicate repetitive news; mute irrelevant information;
summarize articles, etc.

6 / 11
6.5 Big Data Curation State of the Art
Crowdsourcing
The notion of “wisdom of crowds” advocates that potentially
large groups of non-experts can solve complex problems usually
considered to be solvable only by experts.
Crowdsourcing has emerged as a powerful paradigm for
outsourcing work at scale with the help of online people.
The underlying assumption is that large-scale and cheap labour
can be acquired on the web.
The effectiveness of crowdsourcing has been demonstrated
through like Wikipedia, Amazon Mechanical Turk, and Kaggle.
The high-level objectives of crowdsourcing:
Wikipedia follows a volunteer crowdsourcing approach where the
general public is asked to contribute to the encyclopaedia
creation project for the benefit of everyone.
Amazon Mechanical Turk provides a labour market for
crowdsourcing tasks against money.
Kaggle enables organization to publish problems to be solved
through a competition between participants against a predefined
reward.
7 / 11
6.5.1 Data Curation Platforms
An Example of Crowdsourcing: Kaggle
Kaggle, a subsidiary of Google LLC, is an online community of
data scientists and machine learning practitioners.
Kaggle allows users to find and publish data sets, explore and
build models in a web-based data-science environment, work
with other data scientists and machine learning engineers, and
enter competitions to solve data science challenges.*

*source: Wikepedia

8 / 11
Discussion

Discussion: Big Data Curation in Finance


One group to discuss the sheer curation model, i.e., LEO, for
finance.
Another group to discuss the crowdsourcing model, i.e., Kaggle,
for finance.
How can you apply the above data curation model to finance
areas?
Start with a sub-category of fintech:
PayTech, CreditTech,InvestTech, InsureTech, RealEstateTech,
or CryptoFinance, etc.
Discuss how data curation can address data quality and data
heterogeneity issues in your business model.

9 / 11
Group Assignment Guidelines

General Guidelines: Submit “two” separate materials


A: NON-manual data collection files
None-handcollected data files and its source programming
codes using web-scraping, APIs, etc.
Should be submitted with the programming codes proving your
API or webscraping procedures.
Grades are based on the Three Vs: The larger volume, the
higher frequency, the more variety format, the higher grade
A dropbox link will be given to each group

10 / 11
Group Assignment Guidelines
General Guidelines: Submit “two” separate materials

B: brief report (four-pages max, two for figures, two for


reports) by using the above data set

B-1: Figures (visualization outputs) in 2D, 3D, or interactive, AI


images, etc., to emphasize the role of visualization in extracting
meaningful insights.
B-2: Submit either data analytics results or business proposal report
Topics include but not limited to: NLP(textual anlaysis) Algo trading
with technical analysis, Stock and crypto screener and backtesting,
On-Chain data analytics for cryptos, Defi, NFTs.
For the business proposal, the application of Big Data Framework is
necessary, i.e., Three Vs, Big data acquisition, analysis, curation,
storage, and usage, etc.
Upload in the same dropbox folder given to your group

Group assignment submission due: 22nd March Friday


11 / 11

You might also like