Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Emerging multidisciplinary research across database

management systems
Anisoara Nica1, Fabian M. Suchanek2 * and Aparna S. Varde3
1. Sybase iAnywhere, Waterloo, ON, Canada
2. INRIA Saclay, Paris, France
3. Department of Computer Science, Montclair State University, Montclair, NJ, USA
anica@sybase.com, fabian@suchanek.name, vardea@montclair.edu

ABSTRACT Leveraging Logical and Statistical Inference in Ontology


Alignment [9] by Professor Renée J. Miller from the
The database community is exploring more and more University of Toronto, gave an overview on the work
multidisciplinary avenues: Data semantics overlaps with done in this area. It also presented a new approach,
ontology management; reasoning tasks venture into the “Integrated Learning In Alignment of Data and Schema”
domain of artificial intelligence; and data stream (ILIADS), which combines data matching and logical
management and information retrieval shake hands, e.g., reasoning to achieve better matching of ontologies.
when processing Web click-streams. These new research Professor Miller ended her talk with advice for graduate
avenues become evident, for example, in the topics that students based on her own experience – both as a student
doctoral students choose for their dissertations. This and as a graduate adviser.
paper surveys the emerging multidisciplinary research by
doctoral students in database systems and related areas. It Another problem in the area of data mining is
is based on the PIKM 2010, which is the 3rd Ph.D. frequent subgraph mining. Garcia et al. [5] study this
workshop at the International Conference on Information problem with techniques available inside a database
and Knowledge Management (CIKM). The topics management system. Three fundamental research
addressed include ontology development, data streams, problems under a database approach are discussed:
natural language processing, medical databases, green efficient graph storage and indexing, searching for
energy, cloud computing, and exploratory search. In frequent subgraphs and finding subgraph isomorphism
addition to core ideas from the workshop, we list some using SQL. The solutions toward these problems are
open research questions in these multidisciplinary areas. outlined, together with the preliminary experimental
validation, focusing on query optimizations and time
complexity.
1. SURVEY OF TOPICS
Our survey groups the topics of the workshop Park et al. [13] research the issue of finding
into six research domains. We start with two core areas of statistical correlations among pairs of attribute values in
database research, data mining and stream processing. We relational database systems. By extending the Bayesian
proceed to the sister domain of database management, network models the authors provide a probabilistic
Information Retrieval, before venturing into Information ranking function based on a limited assumption of value
Extraction and the relatively new area of Privacy and independence. Experimental results show that the
Trust. We conclude with an application-driven domain, proposed model improves the retrieval effectiveness on
data warehouse management. We shed light on new real datasets and has a reasonable query processing time
research proposals in each of these domains, as well as on compared to previous related work.
open issues. Discussions with the speakers at the workshop
pointed to open research avenues in the area of data
mining in general. One such avenue, inspired by the
1.1. Data mining keynote talk, is investigating the usefulness of data
mining in the Semantic Web: Data can be mined from the
Data mining, the science of discovering
Semantic Web (for tasks such as summarization or
meaningful knowledge in data, has always been a core
alignment) as well as for the Semantic Web (with the
area of database research. Nowadays, with the advent of
purpose of adding new knowledge to an ontology). The
large semantic knowledge bases, the area faces new
interplay of these two directions appears still largely
challenges, for example with the task of integrating and
unexplored.
aligning ontologies. The keynote of our workshop,

*
The work by Fabian Suchanek has been partially funded by the European Research Council under the European Community's Seventh
Framework Programme (FP7/2007-2013) / ERC grant Webdam, agreement 226513. http://webdam.inria.fr/
1.2. Stream processing information. Mirizzi et al. [10] propose to combine these
two concepts. They present a Web-based tool called
As more and more data (such as sensor data)
“Lookup Discover Explore” (LED) that improves lookup
becomes available in the form of continuous streams, new
search by adding exploration capabilities. LED supports a
avenues of research open up. Amiguet et al. [2], for
lookup step, an exploratory step and a meta search
example, address the issue of semantic changes
process. Given a keyword query, their system can display
(annotations) in the stream data processing of sensor
DBpedia resources with a graph explorer, a ranker and a
networks. The propagation of annotations through the
context analyzer. One of the contributions of their work is
workflow, as part of the data processing, raises interesting
a novel approach to browsing by exploiting the Semantic
research questions related to how the propagation
Web.
depends on the structural and temporal contributions, and
on the annotation significance. The work shows its Di Buccio et al. [3] address the related problem
practical viability through a case study from a climate of querying a document collection with short textual
forecasting application that processes temperature sensor queries. Newer systems use relevance feedback from
data. This paper won the best paper award in the PIKM different sources of evidence. In this paper, the authors
2010 workshop. propose a uniform model for these evidence sources.
Another problem in stream processing is the One area of Information Retrieval that is gaining
adaptation of the system to different utilizations. Farag et more and more attention lately is the exploitation of the
al. [4] propose new algorithms for providing good data deep Web. Tjin-Kam-Jet [17] proposes to address this
stream system performance during periods of peak load challenge in a distributed environment. Given the fact that
and periods of delays. The “External Memory Sliding the deep Web is up to two orders of magnitude larger than
Window Join” (EM-SWJoin) algorithm utilizes external the surface Web, they argue that distribution might be the
memory data structures to adapt to the variable data key to scalability. This work proposes to automatically
arrival rates while keeping disk access latency at convert free-text queries to structured queries for complex
minimum. The “ADaptive Execution of DAta Streams” web forms in order to make the deep web more easily
(ADEDAS) algorithm guarantees ordered release of searchable. Challenges include developing a formal query
output results to the user/application while controlling the description syntax, translating queries with correct
impact of delays over stream processing. Thus, it interpretation, bridging the gap between user expectations
addresses the problem of releasing output results in an and system capabilities, adapting query description for
increasing order of timestamp when the input data items resource selection, ranking top k resources, merging
arrive out-of order. The authors exemplify their approach results from resources to maximize precision, recall and
through a system that monitors online stocks. suggestion ranking for users with respect to resources. He
In some scenarios, several data streams come aims to evaluate this solution with a prototype system and
together. Huq et al. [6] discuss how different data streams user studies, criteria being processing time and user
can be coordinated, how storage space can be optimized satisfaction.
and how the database state can be kept consistent in a Discussions following the talk [17] brought up
distributed scenario. The main application the authors the idea of studying distributed deep web searches in the
have in mind is sensor-based experiments, for example context of handheld mobile devices. As mobile devices
based on motion sensors. become practically ubiquitous (and reasonably powerful,
This last talk brought up the idea to investigate too), it might become possible to utilize their capacities
column-oriented data representations in data stream for distributed query processing.
environments. Such a representation will require stronger
coordination (horizontally in addition to vertically), but
might potentially improve output efficiency and overall 1.4. Information extraction
system performance. Information extraction, in its widest sense, is the
extraction of structured data from unstructured data. One
domain where large corpora of unstructured text would
1.3. Information retrieval particularly benefit from information extraction is the
Recent years have seen an opening up of the medical domain. Raghavan et al. [15] are concerned with
border between the database research and information extracting information from medical reports. A medical
retrieval. Storing data and querying data go hand in hand, report is a natural language description of diagnoses,
and this applies to both structured data and unstructured treatments or medications, together with structured
data. In many scenarios, queries can be classified into information about the patient. The goal is to extract a
“lookup searches” and “exploratory searches”. In lookup chronology of events from the reports. Such chronologies
searches, users “look up” details on topics known to can then be used to review a patient’s history or to gather
them; in exploratory searchers they “explore” new statistical data about the effectiveness or consequences of
medical treatments. The task is challenging, because the and connected nodes. This encourages work on a context-
reports use medical jargon and colloquial temporal neutral autonomous cloud monitor agent.
expressions (“two days ago”). The paper conducts two Just as [15], the authors of [1] target the medical
initial case studies: In the first, machine learning on domain. We take this as an indication that this domain
medical reports is used to determine whether patients can attract research from different areas of computer
qualify for Leukemia trials. In the second, a bio-specimen science, including security, privacy, data management,
repository is augmented with data from medical reports. information extraction and systems research.
This additional data facilitates the classification of tissue
probes and also information retrieval on the specimen
database. 1.6. Workflow and Management
Kliegr [7] discusses the problem of noun phrase Research in databases can also touch quite
classification: Given a noun phrase (such as an image practical issues such as the management of data centers.
caption) and given a set of classes (such as “Nature” or As data centers grow larger and larger and ever more
“City”), the task is to assign a class to the noun phrase. powerful, pragmatic concerns such as heat management
This problem is challenging, because the classes are user- and environmental compatibility arise. Pawlish et al. [14],
defined and the noun phrases do not necessarily have for example, are concerned with the environment-friendly
context. The paper proposes to map the noun phrase to a management of data centers. The ultimate aim is to
weighted set of related Wikipedia articles (a “bag of provide the data center manager with a tool that can help
articles”, BOA). Likewise, each class is mapped to a decide management questions, such as whether to buy
BOA. Then, the classification problem is reduced to new hardware and what hardware, in order to minimize
determining the similarity between the noun phrase BOA the environmental impact of the facility. For this purpose,
and the class BOAs. the paper first explains what data is necessary to support
This last talk [7] has shown us that the such decisions. These include, e.g., data on energy use,
possibilities that Wikipedia offers for information humidity, acoustic levels, and the carbon footprint of the
extraction are not yet fully exploited. The first talk [15] center. For proof of concept, the authors have collected
has pointed us to an area where Information Extraction such data for one data center. Then, the paper proposes to
(and possibly data mining) could find attractive build decision trees on top of these data and to use case
applications: the medical domain. based reasoning in order to determine economically
sensible management steps. The paper outlines this
framework supported by some real world examples.
1.5. Privacy and Trust Another issue in data warehousing environments
As more and more data is being produced, stored is the problem of consistent data quality. As warehousing
and made public, issues such as security, trust and privacy environments are often dynamically changing the quality
gain more and more attention. Consider for example of various resources and also the users’ quality
pervasive health care systems. These are systems that requirements, there is a need to capture quality changes
monitor medical indicators on the patient and send their on-the-fly and also provide automated quality
data to a medical center. These systems can greatly notifications back to end users. Therefore, Li et al. [8]
increase efficiency and safety, but come with the problem propose an extended data warehousing systems
of data security. Acharya et al. [1] address the possible architecture that incorporates and extends the concepts of
security threats in pervasive health care systems. One the “Quality Factory” (QF) and the “Quality Notification
particular security loophole is the need to exchange Service” (QNS), and adopts the “Data Warehouse
encryption keys. The paper proposes a communication Quality” (DWQ) methodology to provide both subjective
scheme that eliminates the need to broadcast encryption and objective quality assessments of the end-users’
keys. quality requirements.
Security and trust are even more obvious issues Apart from quality, efficiency of processes is
when it comes to sharing data in cloud computing also almost always an issue. Naseri et al. [11] are
systems. Driven by this insight, Thorpe [16] develops a concerned with optimizing workflow management. They
trust paradigm for the cloud. The paper considers propose to systematically store, analyze and mine data
interactions, reputation, knowledge and experience with about successful workflow executions in order to
respect to the cloud. Concepts such as user trust, cloud optimize workflow composition, workflow selection and
trust, cloud trust peer and cloud distrust are defined and workflow refinement.
an algorithm is proposed for a trust cloud context graph. Discussions during the workshop were
The algorithm takes all the variables into account that are particularly attracted to the concept of managing data
relevant to the cloud, i.e., machines, storage components warehouses. This area touches not just several domains of
computer science (such as storage, querying, and quality
estimation), but also other domains such as economics, [3] E. Di Buccio and M. Melucci. Toward the design of a
the environmental sciences and the legal domain. The methodology to predict relevance through multiple
talks in this session have raised awareness for the sources of evidence. In PIKM 2010.
practical impacts of our ever-growing amounts of data, [4] F. Farag, M. Hammad, and R. Alhajj. Adaptive query
their quality and their storage. processing in data stream management systems under
limited memory resources. In PIKM 2010.
2. CONCLUSIONS [5] W. Garcia, C. Ordonez, K. Zhao, and P. Chen.
This paper presents a short survey of emerging Efficient algorithms based on relational queries to
research across database systems and related areas. It mine frequent graphs. In PIKM 2010.
focused on the work that doctoral students presented at [6] M. R. Huq, A. Wombacher, and P. M. G. Apers.
the PIKM workshop in ACM CIKM 2010. From a Identifying the challenges for optimizing the process
research point of view, the work is spread over areas as to achieve reproducible results in E-science
diverse as information extraction and workflow applications. In PIKM 2010.
management, stream processing and security. From an [7] T. Kliegr. Entity classification by bag of Wikipedia
application point of view, the work is concerned with articles. In PIKM 2010.
cloud computing systems, sensor networks, practical
aspects of data warehouse management and Web search. [8] Y. Li and K. M. Osei-Bryson. Quality factory and
The medical domain with its applications has also quality notification service in data warehouse. In
attracted particular interest. The new research proposals PIKM 2010.
as well as discussions with the students have illuminated [9] R. J. Miller, Leveraging Logical and Statistical
some of the new research avenues in the domain of Inference in Ontology Alignment, Keynote Talk In
databases and beyond. PIKM 2010.
[10] R. Mirizzi and T. Di Noia. From exploratory search
to web search and back. In PIKM 2010.
3. ACKNOWLEDGMENTS
We sincerely thank the CIKM 2010 organizers for their [11] M. Naseri and S. A. Ludwig. A multi-functional
help and support in organizing the Ph.D. workshop architecture addressing workflow and service
PIKM. In particular we thank the General Chair, Jimmy challenges using provenance data. In PIKM 2010.
Huang and the Workshops Chair, Mounia Lalmas, for [12] A. Nica and A. S. Varde, editors. Proceedings of the
their cooperation. We also express our sincere gratitude Third Ph.D. Workshop in CIKM, PIKM 2010,
towards the 24 PC members of PIKM, spread over 12 Nineteenth ACM Conference on Information and
countries across the globe. They comprised well-known Knowledge Management, CIKM 2010, Toronto,
experts from industry and academia and their exhaustive Canada, Oct. 2010. ACM.
reviews helped to make this workshop a fruitful and [13] J. Park and S. G. Lee. Probabilistic ranking for
enriching experience. All of this helped set the stage for relational databases based on correlations. In PIKM
writing this article. 2010.
[14] M. J. Pawlish and A. S. Varde. A decision support
4. REFERENCES system for green data centers. In PIKM 2010.
[1] D. Acharya and V. Kumar. A secure pervasive health care [15] P. Raghavan and A. M. Lai. Leveraging natural
system using location dependent unicast key generation language processing of clinical narratives for
scheme. In PIKM 2010. phenotype modeling. In PIKM 2010.
[2] J. Amiguet, A. Wombacher, and T. E. Klifman. [16] S. Thorpe. Modeling a trust cloud context. In PIKM
Annotations: Dynamic semantics in stream 2010.
processing. In PIKM 2010.
[17] K.T. Tjin-Kam-Jet. Research proposal for distributed
deep web search. In PIKM 2010.

You might also like