Contact Me To Get Fully Solved Smu Assignments/Project/Synopsis/Exam Guide Paper

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

[Winter 2014] ASSIGNMENT

PROGRAM Master of Science in Information Technology (MSc IT) Revised Fall 2011
SUBJECT CODE & NAME MIT401 Data Warehousing and Data Mining
Email Id:

Q.No 1 Explain the Top-Down and Bottom-up Data Warehouse development Methodologies. 10
Top- Down and Bottom - Up Development Methodology
Despite the fact that Data Warehouses can be designed in a number of different ways, they all share a number
of important characteristics. Most Data Warehouses are Subject Oriented. This means that the information
that is in the Data Warehouse is stored in a way that allows it to be connected to objects or event, which occur
in reality.
Another characteristic that is frequently seen in Data Warehouses is called Time Variant. A time variant Data
Warehouse will allow changes in the information to be monitored and recorded over time. All the programs
that are used by a particular institution will be stored in the Data Warehouse, and it will be integrated together.
The first Data Warehouses were developed in the 1980s. As societies entered the information age, there was a
large demand for efficient methods of storing information.

Many of the systems that existed in the 1980s were not powerful enough to store and manage large amounts of
data. There were a number of reason for this. The systems that existed at the time took too long to report and
process information. Many of these systems were not designed to analyze or report information. In addition to
this, the computer programs that were necessary for reporting information were both costly and slow. To solve
these problems, companies began designing computer databases that placed an emphasis on managing and
analyzing information. These were the first Data Warehouses, and they could obtain data from a variety of
different sources, and some of these include PCs and mainframes.
Spreadsheet programs have also played an important role in the development of Data Warehouses. By the end
of the 1990s, the technology had greatly advanced, and was much lower in cost. The technology has continued
to evolve to meet the demands of those who are looking for more functions and speed. There are four advances
in Data Warehouse technology that has allowed it to evolve. These advances are offline operational databases,
real time Data Warehouses, offline Data Warehouses, and the integrated Data Warehouses.
The offline operational database is a system in which the information within the database of an operational
system is copied to a server that is offline. When this is done, the operational system will perform at a much
higher level. As the name implies, a real time Data Warehouse system will be updated every time an event
occurs. For example, if a customer orders a product, a real time Data Warehouse will automatically update the
information in real time.
With the integrated Data Warehouse, transactions will be transferred back to the operational systems each day,
and this will allow the data to easily be analyzed by companies and organizations. There are a number of
devices that will be present in the typical Data Warehouse. Some of these devices are the source data layer,
reporting layer, Data Warehouse layer, and transformation layer. There are a number different data sources for
Data Warehouses. Some popular forms of data sources are Teradata, Oracle database, or Microsoft SQL Server.
Another important concept that is related to Data Warehouses is called data transformation. As the name
suggests, data transformation is a process in which information transferred from specific sources is cleaned
and loaded into a repository.

2 Explain the Functionalities and advantages of Data Warehouses 5+5=10


Functionality of Data Warehouses

Data Warehouses exist to facilitate complex, data-intensive and frequent adhoc queries. Data Warehouses must
provide far greater and more efficient query support than is demanded of transactional databases. Data
Warehouses provide the following functionality:
Roll-up: Data is summarized with increased generalization.
Drill-down: Increasing levels of detail are revealed.
Pivot: Cross tabulation that is, rotation is performed.
Slice and Dice: Performing projection operations on the dimensions.
Sorting: Data is sorted by ordinal value.
Selection: Data is available by value or range.
Derived or Computer Attributes: Attributes are computed by operations on stored data and values are

Advantages of Data Warehouse

A Data Warehouse provides a common data model for data, regardless of the data source. This makes it easier
to report and analyze information than it would be if multiple data models from disparate sources were used to
retrieve information such as sales invoices, order receipts, general ledger charges, etc.
Prior to loading data into the Data Warehouse inconsistencies are identified and resolved. This greatly
simplifies reporting and analysis.
Information in the Data Warehouse is under the control of Data Warehouse users so that, even if the source
system data is purged over time, the information in the warehouse can be stored safely for extended periods of
Because they are separate from operational systems, Data Warehouses provide fast retrieval of data without
slowing down operational systems.
Data Warehouses facilitate Decision Support System applications such as trend reports (e.g., the items with
the most sales in a particular area within the last two years), exception reports, and reports that show actual
performance versus goals.


Email Id:

3 Describe about Hyper Cube and Multicube 5+5=10

Hypercubes and Multicubes
Multidimensional databases can present their data to an application using two types of cubes: hypercubes and
multicubes. The Hypercube is the cube with four Dimensions. In the hypercube model, as shown in the
following illustration, all data appears logically as a single cube. This intuitive representation is a hypercube, a
representation that accommodates more than three dimensions. At a lower level of simplification, a Hypercube
can very well accommodate three dimensions. A hypercube is a general metaphor for representing
multidimensional data. Often, Multi Dimensional Structures (MDS) are used to represent such data.
Multicube: In the multicube model, data is segmented into a set of smaller cubes, each of which is composed
of a subset of the available dimensions It means we can view the cube in multiple dimensions.

Fig.: Multicube

4 List and explain the Strategies for data reduction. 5*2=10

Strategies for data reduction include the following:
1) Date cube aggregation, where aggregation operations are applied to the data in the construction of a data
2) Dimension reduction, where irrelevant, weakly relevant, or redundant attributes or dimensions may be
detected and removed.
3) Data compression, where encoding mechanisms are used to reduce the data set size.
4) Numerosity reduction, where the data are replaced or estimated by alternative, smaller data
representations such as a parametric models (which need store only the model parameters instead of the actual
data), or nonparametric methods such as clustering, sampling, and the use of histograms.
5) Discretization and concept hierarchy generation, where raw data values for attributes are replaced
by ranges or higher conceptual levels. Concept hierarchies allow the mining of data at multiple levels of
abstraction and are a powerful tool for data mining.

5. Describe K-means method for clustering. List its advantages and drawbacks. 5+5=10
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each
cluster. The basic step of k-means clustering is simple. In the beginning we determine number of cluster K and
we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or

the first K objects in sequence can also serve as the initial centroids. Then the K means algorithm will do the
three steps given below until convergence iterate until stable (= no object move group)
1. Determine the centroid coordinate
2. Determine the distance of each object to the centroids
3. Group the object based on minimum distance

These steps are given in the form of flow chart. (See fig. below)

Fig.: Flow chart representation of K-means

With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K
is small).
K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
The K-means method as described has the following drawbacks:
It does not do well with overlapping clusters.
The clusters are easily pulled off-center by outliers.
Each record is either inside or outside of a given cluster.


Email Id:

6 Describe about Multilevel Databases and Web Query Systems 5+5=10

Multilevel Databases
Several researchers have proposed a multilevel database approach to organizing Web-based information. The
main idea behind these proposals is that the lowest level of the database contains primitive semi-structured
information stored in various web repositories, such as hypertext documents. At the higher level(s) meta data
or generalizations are extracted from lower levels and organized in structured collections such as relational or
object-oriented databases.
Web Query Systems
There have been many web-base query systems and languages developed recently that attempt to utilize
standard database query languages such as SQL, structural information about web documents, and even
natural language processing for accommodating the types of queries that are used in World Wide Web
searches. We mention a few examples of these Web-base query systems here. W3QL combines structure
queries, based on the organization of hypertext documents, and content queries, based on information retrieval
techniques. WebLog is a logic-based query language for restructuring extracted information from Web
information sources. Lorel and UnQL support querying of heterogeneous and semi-structured information on
the Web using a labeled graph data model. TSIMMIS helps to extract data from heterogeneous and semistructured information sources and correlates them to generate an integrated database representation of the
extracted information.

You might also like