Professional Documents
Culture Documents
Large Scale Semantic Data Integration And: Analytics Through Cloud: A Case Study in Bioinformatics
Large Scale Semantic Data Integration And: Analytics Through Cloud: A Case Study in Bioinformatics
Large Scale Semantic Data Integration And: Analytics Through Cloud: A Case Study in Bioinformatics
Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore
Michael Li Semantic Technology Group, Institute for Infocomm Research (I2R), A-Star, Singapore
11th Feb 2011
Overview
Motivation Problem Definition Objective Proposed Architecture A case study in Bio-informatics Demo Future works Summary
Motivation
Deluge of biological data Biomedical data is available on heterogeneous databases Data: structured and semi/un-structured formats Demand for fast, large-scale and cost-effective computing strategies
Problem Definition
Data
PubMed contains 20+ million abstracts UniProt contains 13.5+ million records
Related Works
Using NLP to link documents to existing ontologies (e.g. GoPubMed, Textpresso)
No querying & reasoning Not scalable
Objective
To provide a framework that enables Better data infrastructure
Scalability Management of heterogeneity Cost-effectiveness
Our Approach
Proposed Framework
Our Approach
Data Infrastructure module Data Analytics module Query & Reasoner Web Crawler Parser Knowle Population Service
User Interface
Ontology
Biomedical sources
Our Approach
Data Infrastructure Module
Cloud based: Amazon EC2, Hadoop, Microsoft Azure Parallel processing: MapReduce Distributed Storage: Big Table, HBase, HDFS
- Meta data (in memory) - Data nodes - Data blocks - Node attributes - Name of files - Mapping of block-node
Data node
Data node
Data node
Data node
Data node
- Stores file contents - File is chunked to block - each block is spread to data nodes
Our Approach
Data Infrastructure module Data Analytics module Query & Reasoner Web Crawler Parser Knowle Population Service Ontology
User Interface
Web Crawler
UniProt
UniProt Crawler
Parser
PubMed Parser
Ontology
Protein + Literature Ontology Protein Ontology
Ontology Populator
Parsed Uniprot Data
Knowle Ontolgy Population Service
Populate concepts Assert Datatype Properties Assert Object Properties
OWLIM Reasoner
SAIL
Ontology Triplestore
User Interface
Sesame
User Interface
Knowle Population Service
Search
Web Crawler
Parser
Ontology Triplestore
Demo
Data
Uniprot : 853 antiviral protein entries Pubmed : 2000 citations
Demo Snapshot
Summary
We proposed a new framework
Data infrastructure module (cloud-based infrastructure ) Data analytics module(semantic technologies)
We tested on a prototype
Using our own infrastructure With integration, cross-querying from PubMed and UniProt
Future works
Integrated user interface Explore other cloud-based data store: HBase, BigTable Apply map-reduce concept on data analytics and crawling Integrate Knowle into cloud-based environment
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics
Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore
Michael Li Semantic Technology Group, Institute for Infocomm Research (I2R), A-Star, Singapore
11th Feb 2011