Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Big Data Analytics

Andreas Harth
Karlsruhe Institute of Technology (Germany)
EPAC Anti-Corruption Meeting
Barcelona, Nov 2012
Goal
• Do more with less
• Find most egregious cases
• High-level filtering of data deluge
Volume, Velocity, Variety (V3)
Story Time!
• Ph.D. Student (Albania)
– Open Government, Linked Data
• Retired Banker, (Germany), Ph.D. Student
(Germany), Post-Doc (Germany) Accountant
(United States)
– XBRL
• Consultant/Accountant (Germany)
– E-Discovery
• Macro-Fund Manager (United States)
– Correlations
Open Data – „Raw Data Now!“
Linked Data I

http://edgarwrap.ontologycentral.com/cik/1108366#id

http://dbpedia.org/property/coFounder

http://www.kellogghongkong.org/advisor-JianmingYu.aspx
Linked Data II

http://www.linkedin.com/in/xiaomengyang

http://dbpedia.org/property/investor

http://edgarwrap.ontologycentral.com/cik/1108366#id
Linked Data III

=
Linked Data IV

http://www.kellogghongkong.org/advisor-JianmingYu.aspx

owl:sameAs

http://cn.linkedin.com/pub/jianming-yu/12/2a/7a4
Linked Open Data
• World Bank (corruption – Linked Data)
• Transparency International (available as Linked
Data)
PH.D. STUDENT (ALBANIA)
Open Data - Framework
Civic Society Public Offices

Source
Institutions
Visualization
• Semantic Technologies
– Ontology modeling
Explorer – Convertion to RDF format
– Publish as Linked Data

Ontology Model and Knowledge Base


• Archive
RDF RDF
Storage Converter – Portal open.data.al
– CKAN Catalog
Raw Data

Structured Text Semi-structured


data data 13
Open Spending
• Analysis of the Central Treasury Office public
spending system
• Process, transform, and restructure
information queried from this database
Invoice Nr. Description Institution Service Amount Date
Provider
1073101104 600 Up inxh matemt.& Universiteti RAIFFEISE 1679040 20.07.20
02012 fizik ore mesimore, Politeknik N BANK 12
shkrese 149/1,
10.07.2012, listpagesa
1011012021 1012021 604 GALERIA Galeria CEZ 156328 20.07.20
2012 energji qershor 2012 Kombetare SHPERND 12
kont a107846, ARJE
a107847
Open Spending (contd.)
• Semantically-annotate the spending records
• Offer faceted search and user-friendly query
interfaces
• Showcase money allocation (outlier
detection), public administration
expenditures, where do funds go and how
much is paid for different services
• Offer visualizations to graphically highlight
these cases
Fund Diversion
• Government may inappropriately divert funds,
especially in times close to elections
• Monitoring Budget Oversight
– measure the magnitude of the fund diversion
– calculate difference between Budget Execution
and Budget Approval
Open Gov Data
• open.data.al
• dava.gov
• data.gov.uk
• …
Crowdsourcing – The Guardian, UK
RETIRED BANKER, (GERMANY), PH.D.
STUDENT (GERMANY), POST-DOC
(GERMANY) ACCOUNTANT (UNITED
STATES)
Monitor Indicators over Time
• Check numbers individually (Benford‘s law)
• Check stats over time, e.g., cost of goods sold
v sales (hide items in incoming goods)
Extensible Business Reporting
Language (XBRL)
• Enhance the creation, exchange and
comparison of business reporting information,
e.g., financial statements and directors/board
memberships
• The U.S. Securities and Exchange Commission
requires companies listed in the U.S. to
provide financial statement information XBRL
• Completely new field of "mechanics" and
"analysis" of financial data
SEC EDGAR
DataCube Vocabulary
Extending the DataCube Vocabulary
Aggregate Data
avg, count, sum along dimensions
formulas/calculation arcs
Numeric: financial facts
Time series: valid start and end data of facts
Multidimensional: filing, issuer, segment…
Highly structured: concepts from taxonomies
Linking Data Sources
OLAP Query (SEC XBRL and Freebase)
From Statistics to Explainations

Linking Open Data cloud diagram,


by Richard Cyganiak and Anja Jentzsch.
Via Heiko Paulheim, TU Darmstadt http://lod-cloud.net/
Prototype: Explain-a-LOD

Via Heiko Paulheim, TU Darmstadt


Countries with low corruption
• HDI > 78%
• Human Development Index, calculated from live
expectancy, education level, economic
performance
• OECD member states
• Foundation place of more than nine organizations
• More than ten mountains
• More than ten companies with their headquarter
in that state, but less than two cargo airlines

Via Heiko Paulheim, TU Darmstadt


Money Flows

Sueddeutsche Zeitung, Nov 2012


ACCOUNTANT/CONSULTANT
(GERMANY)
E-Discovery!
• Corporate Governance, Complience, Fraud
• More regulations (Foreign Corupt Practices
Act US 1977, United Kingdom Bribery Act UK
2011)
• E-Discovery/forensic data analysis were little
known at surveyed companies
• Increasing relevance for reactive investigation
(court cases) but also as part of Complience
Management System

Ernst & Young Study on E-Discovery (Renato Fazzone)


Reason for Starting E-Discovery

Ernst & Young Study on E-Discovery (Renato Fazzone)


Challenges for Forensic Data Analysis

Ernst & Young Study on E-Discovery (Renato Fazzone)


MACRO FUND MANAGER (U.S.)
Correlations!
• GDP growth (companies) vs. tax increases
Current taxes on income, wealth, etc.
• Eurostat tec00018 (national level)
• Same stats are available on a regional level
• E.g., Katerini, EL: population 83,764
(Wikipedia)
Eurostat NUTS – GR125
Land Use Satellite Imagery

http://vision.ucmerced.edu/datasets/landuse.html
Ferraris

http://en.wikipedia.org/wiki/Ferrari
Correlations!
• Swimming pools vs. tax levels
• Ferarris vs. tax levels
• But: correlation does not imply causation

XKCD comic http://xkcd.com/552/ via Heiko Paulheim, TU Darmstadt


DISCUSSION
Data Quality Issues
• Integration requires (manual) effort
• Time lag of municipal data
KIT‘s Research Portfolio d
• Linked Data as abstraction for data integration
between hetereogeneous systems
• Crowdsourcing manual integration effort
• Query processing and reasoning (mappings)
on Linked Data
• User interfaces based on OLAP systems
PlanetData
• European Commission FP7 Network of
Excellence
• Goal: help organisations to publish and use
(Linked|Open|Streaming) Data
• Collaborations possible (e.g., in the geospatial
area, potentially with DLR/ESA)
Conclusion
• Technology can help to identify hot-spots
• Open Data helps to find and access data
• Linked Data helps to integrate data
• Techniques on integrated data are time-series
analysis and mining correlations/ratios
• Finding corruption still requires detective
work (and always will)
http://upload.wikimedia.org/wikipedia/en/0/0f/2000AD168.jpg

You might also like