Large Scale Semantic Data Integration And: Analytics Through Cloud: A Case Study in Bioinformatics

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 25

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics

Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore

Michael Li Semantic Technology Group, Institute for Infocomm Research (I2R), A-Star, Singapore
11th Feb 2011

Overview
Motivation Problem Definition Objective Proposed Architecture A case study in Bio-informatics Demo Future works Summary

Motivation
Deluge of biological data Biomedical data is available on heterogeneous databases Data: structured and semi/un-structured formats Demand for fast, large-scale and cost-effective computing strategies

Problem Definition
Data
PubMed contains 20+ million abstracts UniProt contains 13.5+ million records

Case study on antiviral proteins


Over 70,000 citations in Pubmed Over 14,000 proteins in Uniprot

Integration and Analysis

Related Works
Using NLP to link documents to existing ontologies (e.g. GoPubMed, Textpresso)
No querying & reasoning Not scalable

RDF/OWL based integration tools (e.g. TopBraid Suite)


No NLP Not bio specific. Also not biologist friendly

Cloud-based bio data mining works (e.g. Kudtarkar P 2010)


Still in early stages Challenging to perform semantic integration on cloud

Objective
To provide a framework that enables Better data infrastructure
Scalability Management of heterogeneity Cost-effectiveness

Better data analytics


Integrative data mining Visual query interface

Our Approach
Proposed Framework

Data Infrastructure Module

Data Analytics Module

Our Approach
Data Infrastructure module Data Analytics module Query & Reasoner Web Crawler Parser Knowle Population Service

User Interface

Ontology

Biomedical sources

Cloud-based data store

Our Approach
Data Infrastructure Module
Cloud based: Amazon EC2, Hadoop, Microsoft Azure Parallel processing: MapReduce Distributed Storage: Big Table, HBase, HDFS

Data Analytics Module


Non-semantic: database driven Semantic: ontology driven (Knowle, Allegrograph, TopBraid)

Data Infrastructure Module (Hadoop)


Software framework for data-intensive and distributed applications Hadoop distributed file system provides a distributed, scalable, and portable file system that support for large data set Hadoop Map-reduce allows to program in parallel on large amount of data

Cloud Based Data Store Hadoop Distributed File System


Secondary Name node Name node

- Meta data (in memory) - Data nodes - Data blocks - Node attributes - Name of files - Mapping of block-node

Data node

Data node

Data node

Data node

Data node

- Stores file contents - File is chunked to block - each block is spread to data nodes

Data Analytics Module (Knowle)


Semantic Technology Toolkit Knowle services used in Data Analytics Module
Data/Text mining Ontology Population Ontology Query Visual Ontology Query

Developed in Institute for Infocomm Research, Singapore

Our Approach
Data Infrastructure module Data Analytics module Query & Reasoner Web Crawler Parser Knowle Population Service Ontology

User Interface

Biomedical data sources

Cloud-based data store

Web Crawler

UniProt

UniProt Crawler

Cloud-based data store


PubMed PubMed Crawler

Bio-medical data source

Parser

Crawled UniProt data

UniProt Parser Knowle Ontology Population Service

Crawled PubMed data

PubMed Parser

Cloud-based data store

Ontology
Protein + Literature Ontology Protein Ontology

Ontology Populator
Parsed Uniprot Data
Knowle Ontolgy Population Service
Populate concepts Assert Datatype Properties Assert Object Properties

Ontology Triplestore Protein + Literature ontology Knowle Text mining Service


Entity Detection Relation Extraction

Parsed Pubmed Data

Query & Reasoner

OWLIM Reasoner

SAIL

Ontology Triplestore

User Interface

Sesame

Knowle Query Service

User Interface
Knowle Population Service

Search

Web Crawler

Parser

KnowleGator Ontology Visual Query

Visual Query Translator

Ontology Query & Reasoner

Ontology Triplestore

A case study in Bio-informatics


Integration, cross-querying from PubMed and UniProt Data
70,054 citations from Pubmed 14,527 proteins in Uniprot

Infrastructure (virtual computers)


4 data node ( RAM : 1Gb, CPU : Intel Xeon 2.4Ghz) 2 master node ( 1 name node,1 secondary name node) (RAM : 512 Mb, CPU : Intel Xeon 2.4Ghz) 1 virtual CPU = Intel Xeon 2.4 Ghz

Demo
Data
Uniprot : 853 antiviral protein entries Pubmed : 2000 citations

Demo Snapshot

Summary
We proposed a new framework
Data infrastructure module (cloud-based infrastructure ) Data analytics module(semantic technologies)

We tested on a prototype
Using our own infrastructure With integration, cross-querying from PubMed and UniProt

Future works
Integrated user interface Explore other cloud-based data store: HBase, BigTable Apply map-reduce concept on data analytics and crawling Integrate Knowle into cloud-based environment

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics
Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore

Michael Li Semantic Technology Group, Institute for Infocomm Research (I2R), A-Star, Singapore
11th Feb 2011

You might also like