Professional Documents
Culture Documents
Business Intelligence
Business Intelligence
enterprise aimed at enabling knowledge workers such as executives, managers, and analysts
to make better and faster decisions. The past two decades have seen explosive growth, both in
the number of products and services offered and in the adoption of these technologies by
industry. This growth has been fueled by the declining cost of acquiring and storing very
large amounts of data arising from sources such as customer transactions in banking, retail as
well as in e-businesses, RFID tags for inventory tracking, email, query logs for Web sites,
blogs, and product reviews. Enterprises today collect data at a finer granularity, which is
therefore of much larger volume. Businesses are leveraging their data asset aggressively by
deploying and experimenting with more sophisticated data analysis techniques to drive
business decisions and deliver new functionality such as personalized offers and services to
customers. Today, it is difficult to find a successful enterprise that has not leveraged BI
technology for its business. For example, BI technology is used in manufacturing for order
shipment and customer support, in retail for user profiling to target grocery coupons during
checkout, in financial services for claims analysis and fraud detection, in transportation for
fleet management, in telecommunications for identifying reasons for customer churn, in
utilities for power usage analysis, and health care for outcomes analysis.
A typical architecture for supporting BI within an enterprise is shown in Figure 1 (the shaded
boxes are technology that we focus on in this article). The data over which BI tasks are
performed often comes from different sources—typically from multiple operational databases
across departments within the organization, as well as external vendors. Different sources
contain data of varying quality, use inconsistent representations, codes, and formats, which
have to be reconciled. Thus the problems of integrating, cleansing, and standardizing data in
preparation for BI tasks can be rather challenging. Efficient data loading is imperative for BI.
Moreover, BI tasks usually need to be performed incrementally as new data arrives, for
example, last month's sales data. This makes efficient and scalable data loading and refresh
capabilities imperative for enterprise BI. These back-end technologies for preparing the data
for BI are collectively referred to as Extract-Transform-Load (ETL) tools. Increasingly there
is a need to support BI tasks in near real time, that is, make business decisions based on the
operational data itself. Specialized engines referred to as Complex Event Processing (CEP)
engines have emerged to support such scenarios.
The data over which BI tasks are performed is typically loaded into a repository called the
data warehouse that is managed by one or more data warehouse servers. A popular choice of
engines for storing and querying warehouse data is relational database management systems
(RDBMS). Over the past two decades, several data structures, optimizations, and query
processing techniques have been developed primarily for executing complex SQL queries
over large volumes of data—a key requirement for BI. An example of such an ad hoc SQL
query is: find customers who have placed an order during the past quarter whose amount
exceeds the average order amount by at least 50%. Large data warehouses typically deploy
parallel RDBMS engines so that SQL queries can be executed over large volumes of data
with low latency.
As more data is born digital, there is increasing desire to architect low-cost data platforms
that can support much larger data volume than that traditionally handled by RDBMSs. This is
often described as the "Big Data" challenge. Driven by this goal, engines based on the
MapReduce9 paradigm—originally built for analyzing Web documents and Web search query
logs—are now being targeted for enterprise analytics. Such engines are currently being
extended to support complex SQL-like queries essential for traditional enterprise data
warehousing scenarios.
Data warehouse servers are complemented by a set of mid-tier servers that provide
specialized functionality for different BI scenarios. Online analytic processing (OLAP)
servers efficiently expose the multidimensional view of data to applications or users and
enable the common BI operations such as filtering, aggregation, drill-down and pivoting. In
addition to traditional OLAP servers, newer "in-memory BI" engines are appearing that
exploit today's large main memory sizes to dramatically improve performance of
multidimensional queries. Reporting servers enable definition, efficient execution and
rendering of reports—for example, report total sales by region for this year and compare with
sales from last year. The increasing availability and importance of text data such as product
reviews, email, and call center transcripts for BI brings new challenges. Enterprise search
engines support the keyword search paradigm over text and structured data in the warehouse
(for example, find email messages, documents, history of purchases and support calls related
to a particular customer), and have become a valuable tool for BI over the past decade. Data
mining engines enable in-depth analysis of data that goes well beyond what is offered by
OLAP or reporting servers, and provides the ability to build predictive models to help answer
questions such as: which existing customers are likely to respond to my upcoming catalog
mailing campaign? Text analytic engines can analyze large amounts of text data (for example,
survey responses or comments from customers) and extract valuable information that would
otherwise require significant manual effort, for example, which products are mentioned in the
survey responses and the topics that are frequently discussed in connection with those
products.
There are several popular front-end applications through which users perform BI tasks:
spreadsheets, enterprise portals for searching, performance management applications that
enable decision makers to track key performance indicators of the business using visual
dashboards, tools that allow users to pose ad hoc queries, viewers for data mining models,
and so on. Rapid, ad hoc visualization of data can enable dynamic exploration of patterns,
outliers and help uncover relevant facts for BI.
In addition, there are other BI technologies (not shown in Figure 1) such as Web analytics,
which enables understanding of how visitors to a company's Web site interact with the pages;
for example which landing pages are likely to encourage the visitor to make a purchase.
Likewise, vertical packaged applications such as customer relationship management (CRM)
are widely used. These applications often support built-in analytics, for example, a CRM
application might provide functionality to segment customers into those most likely and least
likely to repurchase a particular product. Another nascent but important area is mobile BI that
presents opportunities for enabling novel and rich BI applications for knowledge workers on
mobile devices.
In this short article, we are not able to provide comprehensive coverage of all technologies
used in BI (see Chaudhuri et al.5 for additional details on some of these technologies). We
therefore chose to focus on technology where research can play, or has historically played, an
important role. In some instances, these technologies are mature but challenging research
problems still remain—for example, data storage, OLAP servers, RDBMSs, and ETL tools.
In other instances, the technology is relatively new with several open research challenges, for
example, MapReduce engines, near real-time BI, enterprise search, data mining and text
analytics, cloud data services.
Data Storage
Access structures. Decision support queries require operations such as filtering, join, and
aggregation. To efficiently support these operations, special data structures (not typically
required for OLTP queries) have been developed in RDBMSs, described here. Access
structures used in specialized OLAP engines that do not use RDBMSs are discussed later.
Index structures. An index enables associative access based on values of a particular column.
When a query has one or more filter conditions, the selectivities of these conditions can be
exploited through index scans (for example, an index on the StoreId column can help retrieve
all sales for StoreId = 23) and index intersection (when multiple conditions exist). These
operations can significantly reduce, and in some cases eliminate, the need to access the base
tables, for example, when the index itself contains all columns required to answer the query.
Bitmap indexes support efficient index operations such as union and intersection. A bitmap
index on a column uses one bit per record for each value in the domain of that column. To
process a query of the form column1 = val1 AND column2 = val2 using bitmap indexes, we
identify the qualifying records by taking the bitwise AND of the respective bit vectors. While
such representations are very effective for low cardinality domains (for example, gender),
they can also be used for higher cardinality domains using bitmap compression.
Today, it is difficult to find a successful enterprise that has not leveraged BI technology for
their business.
Materialized views. Reporting queries often require summary data, for example, aggregate
sales of the most recent quarter and the current fiscal year. Hence, precomputing and
materializing summary data (also referred to as materialized views) can help dramatically
accelerate many decision support queries. The greatest strength of a materialized view is its
ability to specifically target certain queries by effectively caching their results. However this
very strength also can limit its applicability, that is, for a slightly different query it may not be
possible to use the materialized view to answer that query. This is in contrast to an index,
which is a much more general structure, but whose impact on query performance may not be
as dramatic as a materialized view. Typically, a good physical design contains a judicious
mix of indexes and materialized views.
Partitioning. Data partitioning can be used to improve both performance (discussed later) and
manageability. Partitioning allows tables and indexes to be divided into smaller, more
manageable units. Database maintenance operations such as loading and backup can be
performed on partitions rather than an entire table or index. The common types of
partitioning supported today are hash and range. Hybrid schemes that first partition by range
followed by hash partitioning within each range partition are also common.
Finally, we note that in the past decade, major commercial database systems have added
automated physical design tools that can assist database administrators (DBAs) in choosing
appropriate access structures (see Chaudhuri and Narasayya7 for an overview) based on
workload information, such as queries and updates executed on the system, and constraints,
for example, total storage allotted to access structures.
Data Compression can have significant benefits for large data warehouses. Compression can
reduce the amount of data that needs to be scanned, and hence the I/O cost of the query.
Second, since compression reduces the amount of storage required for a database, it can also
lower storage and backup costs. A third benefit is that compression effectively increases the
amount of data that can be cached in memory since the pages can be kept in compressed
form, and decompressed only on demand. Fourth, certain common query operations (for
example, equality conditions, duplicate elimination) can often be performed on the
compressed data itself without having to decompress the data. Finally, compressing data that
is transferred over the network effectively increases the available network bandwidth. This is
important for parallel DBMSs where data must be moved across nodes. Data compression
plays a key role not just in relational DBMSs, but also in other specialized engines, for
example, in OLAP.
There are different compression techniques used in relational DBMSs. Null suppression
leverages the fact that several commonly used data types in DBMSs are fixed length (for
example, int, bigint, datetime, money), and significant compression is possible if they are
treated as variable length for storage purposes. Only the non-null part of the value is stored
along with the actual length of the value. Dictionary compression identifies repetitive values
in the data and constructs a dictionary that maps such values to more compact
representations. For example, a column that stores the shipping mode for an order may
contain string values such as 'AIR', 'SHIP', 'TRUCK'. Each value can be represented using
two bits by mapping them to values 0,1,2 respectively. Finally, unlike compression schemes
in row-oriented stores where each instance of a value requires an entry (potentially with
fewer bits), in column-oriented stores other compression techniques such as run-length
encoding (RLE) can become more effective. In RLE compression, a sequence of k instances
of value v is encoded by the pair (v,k). RLE is particularly attractive when long runs of the
same value occur; this can happen for columns with relatively few distinct values, or when
the column values are sorted.
There are several interesting technical challenges in data compression. First, new
compression techniques suitable for large data warehouses and incurring an acceptable trade-
off with decompression and update costs are important. Second, even for known compression
techniques important open problems remain—for example, for RLE—the choice of sort order
of the table can significantly affect the amount of compression possible. Determining the best
sort order to use is a non-trivial optimization problem. Finally, the decision of whether to
compress access structures is workload dependent. Thus, there is a need for automated
physical design tools to also recommend which access structures should be compressed and
how based on workload information.
Query Processing
A popular conceptual model used for BI tasks is the multidimensional view of data, as shown
in Figure 2. In a multidimensional data model, there is a set of numeric measures that are the
objects of analysis. Examples of such measures are sales, budget, revenue, and inventory.
Each of the numeric measures is associated with a set of dimensions, which provide the
context for the measure. For example, the dimensions associated with a sale amount can be
the Product, City, and the Date when the sale was made. Thus, a measure can be viewed as a
value in the multidimensional space of dimensions. Each dimension is described by a set of
attributes, for example, the Product dimension may consist of the following attributes: the
category, industry, model number, year of its introduction. The attributes of a dimension may
be related via a hierarchy of relationships. For example, a product is related to its category
and the industry attributes through a hierarchical relationship (Figure 2). Another distinctive
feature of the conceptual model is its stress on aggregation of measures by one or more
dimensions; for example, computing and ranking the total sales by each county for each year.
OLAP Servers. Online Analytic processing (OLAP) supports operations such as filtering,
aggregation, pivoting, rollup and drill-down on the multi-dimensional view of the data.
OLAP servers are implemented using either a multidimensional storage engine (MOLAP); a
relational DBMS engine (ROLAP) as the backend; or a hybrid combination called HOLAP.
MOLAP servers. MOLAP servers directly support the multidimensional view of data through
a storage engine that uses the multidimensional array abstraction. They typically precompute
large data cubes to speed up query processing. Such an approach has the advantage of
excellent indexing properties and fast query response times, but provides relatively poor
storage utilization, especially when the data set is sparse. To better adapt to sparse data sets,
MOLAP servers identify dense and sparse regions of the data, and store/index these regions
differently. For example dense sub-arrays of the cube are identified and stored in array
format, whereas the sparse regions are compressed and stored separately.
ROLAP servers. In ROLAP, the multidimensional model and its operations have to be
mapped into relations and SQL queries. They rely on the data storage techniques described
earlier to speed up relational query processing. ROLAP servers may also need to implement
functionality not supported in SQL, for example, extended aggregate functions such as
median, mode, and time window based moving average. The database designs used in
ROLAP are optimized for efficiency in querying and in loading data. Most ROLAP systems
use a star schema to represent the multidimensional data model. The database consists of a
single fact table and a single table for each dimension. Each row in the fact table consists of a
pointer (a.k.a. foreign key) to each of the dimensions that provide its multidimensional
coordinates, and stores the numeric measures for those coordinates. Each dimension table
consists of columns that correspond to attributes of the dimension. Star schemas do not
explicitly provide support for attribute hierarchies. Snowflake schemas (shown in Figure 3)
provide a refinement of star schemas where the dimensional hierarchy is explicitly
represented by normalizing the dimension tables. This leads to advantages in maintaining the
dimension tables.
HOLAP servers. The HOLAP architecture combines ROLAP and MOLAP by splitting
storage of data in a MOLAP and a relational store. Splitting the data can be done in different
ways. One method is to store the detailed data in a RDBMS as ROLAP servers do, and
precomputing aggregated data in MOLAP. Another method is to store more recent data in
MOLAP to provide faster access, and older data in ROLAP. Since MOLAP performs better
when the data is reasonably dense and ROLAP servers perform for sparse data, Like MOLAP
servers, HOLAP servers also perform density analysis to identify sparse and dense sub-
regions of the multidimensional space. All major data warehouse vendors today offer OLAP
servers (for example, IBM Cognos,15 Microsoft SQL,17 and Oracle Hyperion23).
In-memory BI engines. Technology trends are providing an opportunity for a new class of
OLAP engines focused on exploiting large main memory to make response times for ad-hoc
queries interactive. First, the ratio of time to access data on disk vs. data in memory is
increasing. Second, with 64-bit operating systems becoming common, very large addressable
memory sizes (for example, 1TB) are possible. Third, the cost of memory has dropped
significantly, which makes servers with large amounts of main memory affordable. Unlike
traditional OLAP servers, in-memory BI engines (for example, QlikView24) rely on a
different set of techniques for achieving good performance. First, since the detailed data is
memory resident they avoid expensive I/Os required to access data cubes, indexes, or
materialized views. Second, they use data structures that would not be suitable for disk-based
access, but are very effective for in-memory access. For example, consider a query that
computes the total sales for each customer in a particular state. When the data is initially
loaded into the system, the engine can associate pointers from each state to customers in that
state, and similarly pointers from a customer to all the order detail records for that customer.
This allows fast associative access required to answer the query quickly, and is reminiscent of
approaches used by object-oriented databases as well as optimizations in traditional DBMSs
such as join indices. Third, in-memory BI engines can significantly increase the effective data
sizes over which they can operate in memory by using data organization techniques such as
column-oriented storage and data compression. In-memory BI engines are best suited for
read-mostly data without in-place data updates where new data arrives primarily in the form
of incremental batch inserts due to data decompression cost.
Relational Servers. Relational database servers (RDBMSs) have traditionally served as the
backend of large data warehouses. Such data warehouses need to be able to execute complex
SQL queries as efficiently as possible against very large databases. The first key technology
needed to achieve this is query optimization, which takes a complex query and compiles that
query into an execution plan. To ensure that the execution plan can scale well to large
databases, data partitioning and parallel query processing are leveraged extensively (see
Graefe13 for an overview of query processing techniques). We therefore discuss two pieces of
key technology—query optimization and parallel query processing.
Query optimization technology has been a key enabler for BI. The query optimizer is
responsible for selecting an execution plan for answering a query. The execution plan is a
composition of physical operators (such as Index Scan, Hash Join, Sort) that when evaluated
generates the results of the query. The performance of a query crucially depends on the
ability of the optimizer to choose a good plan from a very large space of alternatives. The
difference in execution time between a good and bad plan for such complex queries can be
several orders of magnitudes (for example, days instead of minutes). This topic has been of
keen interest in database research and industry (an overview of the field appears in
Chaudhuri4). Following the pioneering work done in the System R optimizer from IBM
Research in the late 1970s, the next major architectural innovation came about a decade later:
extensible optimizers. This allowed system designers to "plug-in" new rules and extend the
capabilities of the optimizer. For example, a rule could represent equivalence in relational
algebra (for example, pushing down an aggregation below join). Application of such rules
can potentially transform the execution plan into one that executes much faster. Extensible
optimizers allowed many important optimizations developed by industry and research over
the years to be incorporated relatively easily without having to repeatedly modify the search
strategy of the optimizer.
Despite the success of query optimization and the crucial role it plays in BI, many
fundamental challenges still remain. The optimizer needs to address the inherently difficult
problem of estimating the cost of a plan, that is, the total work (CPU, I/O, among others)
done by the plan. However, constrained by the requirement to impose only a small overhead,
the optimizer typically uses limited statistical information such as histograms describing a
column's data distribution. Such approximations sometimes result in brittleness since large
inaccuracies can lead to generation of very poor plans. There has been research in leveraging
feedback from query execution to overcome errors made by the query optimizer by observing
actual query execution behavior (for example, the actual result size of a query expression),
and adjusting the execution plan if needed. However, collecting and exploiting feedback at
low overhead is also challenging, and much more work is needed to realize the benefits of
this approach.
Parallel processing and appliances. Parallelism plays a significant role in processing queries
over massive databases. Relational operators such as selection, projection, join, and
aggregation present many opportunities for parallelism. The basic paradigm is data
parallelism, that is, to apply relational operators in parallel on disjoint subsets of data
(partitions), and then combine the results. The article by Dewitt and Gray10 provides an
overview of work in this area. For several years now, all major vendors of database
management systems have offered data partitioning and parallel query processing technology.
There are two basic architectures for parallelism: Shared disk, where each processor has a
private memory but shares disks with all other processors. Shared nothing, where each
processor has private memory and disk and is typically a low-cost commodity machine.
Interestingly, while these architectures date back about two decades, neither has yet emerged
as a clear winner in the industry and successful implementations of both exist today.
In shared disk systems all nodes have access to the data via shared storage, so there is no
need to a priori partition the data across nodes as in the shared nothing approach. During
query processing, there is no need to move data across nodes. Moreover, load balancing is
relatively simple since any node can service any request. However, there are a couple of
issues that can affect scalability of shared disk systems. First, the nodes need to communicate
in order to ensure data consistency. Typically this is implemented via a distributed lock
manager, which can incur non-trivial overhead. Second, the network must support the
combined I/O bandwidth of all processors, and can become a bottleneck. Shared disk systems
are relatively cost effective for small- to medium-sized data warehouses.
In shared nothing systems (for example, Teradata31) data needs to be distributed across nodes
a priori. They have the potential to scale to much larger data sizes than shared disk systems.
However, the decision of how to effectively distribute the data across nodes is crucial for
performance and scalability. This is important both from the standpoint of leveraging
parallelism, but also to reduce the amount of data that needs to be transferred over the
network during query processing. Two key techniques for data distribution are partitioning
and cloning. For example consider a large database with the schema shown in Figure 3. Each
of the two large fact tables, Orders and OrderDetails can be hash partitioned across all nodes
on the OrderId attribute respectively, that is, on the attribute on which the two tables are
joined. All other dimension tables, which are relatively small, could be cloned (replicated) on
each node. Now consider a query that joins Customers, Orders and OrderDetails. This query
can be processed by issuing one query per node, each operating on a subset of the fact data
and joining with the entire dimension table. As a final step, the results of each of these
queries are sent over the network to a single node that combines them to produce the final
answer to the query.
Data warehouse appliances. Recently a new generation of parallel DBMSs referred to as data
warehouse appliances (for example, Netezza19) have appeared. An appliance is an integrated
set of server and storage hardware, operating system and DBMS software specifically pre-
installed and pre-optimized for data warehousing. These appliances have gained impetus
from the following trends. First, since DW appliance vendors control the full
hardware/software stack, they can offer the more attractive one service call model. Second,
some appliances push part of the query processing into specialized hardware thereby
speeding up queries. For example, Netezza uses FPGAs (field-programmable gate arrays) to
evaluate selection and projection operators on a table in the storage layer itself. For typical
decision support queries this can significantly reduce the amount of data that needs to be
processed in the DBMS layer.
Data platforms based on the MapReduce paradigm and its variants have attracted strong
interest in the context of the "Big Data" challenge in enterprise analytics, as described in the
introduction. Another factor that makes such platforms attractive is the ability to support
analytics on unstructured data such as text documents (including Web crawls), image and
sensor data by enabling execution of custom Map and Reduce functions in a scalable manner.
Recently, these engines have been extended to support features necessary for enterprise
adoption (for example, Cloudera8). While serious enterprise adoption is still in early stages
compared to mature parallel RDBMS systems, exploration using such platforms is growing
rapidly, aided by the availability of the open source Hadoop14 ecosystem. Driven by the goal
of improving programmer productivity while still exploiting the advantages noted here, there
have been recent efforts to develop engines that can take a SQL-like query, and automatically
compile it to a sequence of jobs on a MapReduce engine (for example, Thusoo et al.32). The
emergence of analytic engines based on MapReduce is having an impact on parallel DBMS
products and research. For example, some parallel DBMS vendors (for example, Aster Data2)
allow invocation of MapReduce functions over data stored in the database as part of a SQL
query. The MapReduce function appears in the query as a table that allows its results to be
composed with other SQL operators in the query. Many other DBMS vendors provide
utilities to move data between MapReduce-based engines and their relational data engines. A
primary use of such a bridge is to ease the movement of structured data distilled from the data
analysis on the MapReduce platform into the SQL system.
Near Real-Time BI. The competitive pressure of today's businesses has led to the increased
need for near real-time BI. The goal of near real-time BI (also called operational BI or just-
in-time BI) is to reduce the latency between when operational data is acquired and when
analysis over that data is possible. Consider an airline that tracks its most profitable
customers. If a high-value customer has a lengthy delay for a fight, alerting the ground staff
proactively can help the airline ensure that the customer is potentially rerouted. Such near
real-time decisions can increase customer loyalty and revenue.
A class of systems that enables such real-time BI is Complex Event Processing (CEP)
engines (for example, Streambase29). Businesses can specify the patterns or temporal trends
that they wish to detect over streaming operational data (referred to as events), and take
appropriate actions when those patterns occur. The genesis of CEP engines was in the
financial domain where they were used for applications such as algorithmic stock trading,
which requires detecting patterns over stock ticker data. However, they are now being used in
other domains as well to make decisions in real time, for example, clickstream analysis or
manufacturing process monitoring (for example, over RFID sensor data).
CEP is different from traditional BI since operational data does not need to be first loaded
into a warehouse before it can be analyzed (see Figure 4). Applications define declarative
queries that can contain operations over streaming data such as filtering, windowing,
aggregations, unions, and joins. The arrival of events in the input stream(s) triggers
processing of the query. These are referred to as "standing" or "continuous" queries since
computation may be continuously performed as long as events continue to arrive in the input
stream or the query is explicitly stopped. In general, there could be multiple queries defined
on the same stream; thus one of the challenges for the CEP engine is to effectively share
computation across queries when possible. These engines also need to handle situations
where the streaming data is delayed, missing, or out-of-order, which raise both semantic as
well as efficiency challenges.
There are several open technical problems in CEP; we touch upon a few of them here. One
important challenge is to handle continuous queries that reference data in the database (for
example, the query references a table of customers stored in the database) without affecting
near real-time requirements. The problem of optimizing query plans over streaming data has
several open challenges. In principle, the benefit of an improved execution plan for the query
is unlimited since the query executes "forever." This opens up the possibility of more
thorough optimization than is feasible in a traditional DBMS. Moreover, the ability to
observe execution of operators in the execution plan over an extended period of time can be
potentially valuable in identifying suboptimal plans. Finally, the increasing importance of
real-time analytics implies that many traditional data mining techniques may need to be
revisited in the context of streaming data. For example, algorithms that require multiple
passes over the data are no longer feasible for streaming data.
Enterprise Search
BI tasks often require searching over different types of data within the enterprise. For
example, a salesperson who is preparing for a meeting with a customer would like to know
relevant customer information before the meeting. This information is today siloed into
different sources: CRM databases, email, documents, and spreadsheets, both in enterprise
servers as well as on the user's desktop. Increasingly, a large amount of valuable data is
present in the form of text, for example, product catalogs, customer emails, annotations by
sales representatives in databases, survey responses, blogs and reviews. In such scenarios, the
ability to retrieve and rank the required information using the keyword search paradigm is
valuable for BI. Enterprise search focuses on supporting the familiar keyword search
paradigm over text repositories and structured enterprise data. These engines typically exploit
structured data to enable faceted search. For example, they might enable filtering and sorting
over structured attributes of documents in the search results such as authors, last modification
date, document type, companies (or other entities of interest) referenced in documents.
Today, a number of vendors (for example, FAST Engine Search11 and Google Search
Appliance12) provide enterprise search capability.
A popular architecture for enterprise search engines is the integrated model, shown in Figure
5. The search engine crawls each data source and stores the data into a central content index
using an internal representation that is suitable for fast querying. The configuration data
controls what objects to index (for example, a crawl query that returns objects from a
database) as well as what objects to return in response to a user query (for example, a serve
query to run against the database when the query keywords match a crawled object). Several
technical challenges need to be addressed by enterprise search engines. First, crawling relies
on the availability of appropriate adapters for each source. Achieving a high degree of data
freshness requires specialized adapters that can efficiently identify and extract data changes at
the source. Second, ranking results across data sources is non-trivial since there may be no
easy way to compare relevance across sources. Unlike ranking in Web search, links across
documents in an enterprise are much sparser and thus not as reliable a signal. Similarly, query
logs and click-through information are typically not available at sufficient scale to be useful
for ranking. Finally, deploying enterprise search can involve manually tuning the relevance,
for example, by adjusting the weight of each source.
Extract-Transform-Load Tools
The accuracy and timeliness of reporting, ad hoc queries, and predictive analysis depends on
being able to efficiently get high-quality data into the data warehouse from operational
databases and external data sources. Extract-Transform-Load (ETL) refers to a collection of
tools that play a crucial role in helping discover and correct data quality issues and efficiently
load large volumes of data into the warehouse.
Data quality. When data from one or more sources is loaded into the warehouse, there may
be errors (for example, a data entry error may lead to a record with State = 'California' and
Country = 'Canada'), inconsistent representations for the same value (for example, 'CA',
'California'), and missing information in the data. Therefore, tools that help detect data
quality issues and restore data integrity in the warehouse can have a high payoff for BI. Data
profiling tools enable identification of data quality issues by detecting violations of properties
that are expected to hold in the data. For example, consider a database of customer names and
addresses. In a clean database, we might expect that (Name, Address) combinations are
unique. Data profiling tools verify whether this uniqueness property holds, and can quantify
the degree to which it is violated, for example, this might happen if Name or Address
information is missing. Data profiling tools can also discover rules or properties that hold in a
given database. For example, consider an external data source that needs to be imported into
a data warehouse. It is important to know which columns (or sets of columns) are keys
(unique) for the source. This can help in matching the incoming data against existing data in
the warehouse. For efficiency, these tools often use techniques such as sampling when
profiling large databases.
Accurately extracting structure from a string can play an important role in improving data
quality in the warehouse. For example, consider a shopping Web site that stores MP3 player
product data with attributes such as Manufacturer, Brand, Model, Color, Storage Capacity
and receives a data feed for a product as text, for example, "Coby MP3 512MB MP-C756 –
Blue." Being able to robustly parse the structured information present in the text into the
appropriate attributes in the data warehouse is important, for example, for answering queries
on the Web site. Vendors have developed extensive sets of parsing rules for important
verticals such as products and addresses. The survey article by Sarawagi28 discusses
techniques to the broader area of information extraction.
Another important technology that can help improve data quality is de-duplication:
identifying groups of approximately duplicate entities (for example, customers). This can be
viewed as a graph clustering problem where each node is an entity and an edge exists
between two nodes if the degree of similarity between two entities is sufficiently high. The
function that defines the degree of similarity between two entities is typically based on string
similarity functions such as edit distance (for example, 'Robert' and 'Robet' have an edit
distance of as well as domain-specific rules (for example, 'Bob' and 'Robert' are
synonymous). Thus, the ability to efficiently perform such approximate string matching
across many pairs of entities (also known as fuzzy matching) is important for de-duplication.
Most major vendors support fuzzy matching and de-duplication as part of their ETL suite of
tools. An overview of tools for merging data from different sources can be found in
Bernstein.3
Data load and refresh. Data load and refresh utilities are responsible for moving data from
operational databases and external sources into the data warehouse quickly and with as little
performance impact as possible at both ends. There are two major challenges. First, there is a
need to efficiently capture data at the sources, that is, identify and collect data to be moved to
the data warehouse. Triggers are general-purpose constructs supported by SQL that allow
rows modified by an insert/update SQL statement to be identified. However, triggers are a
relatively heavyweight mechanism and can impose non-trivial overheads on the operational
database running OLTP queries. A more efficient way of capturing changed data is to sniff
the transaction log of the database. The transaction log is used by the database system to
record all changes so that the system can recover in case of a crash. Some utilities allow
pushing filters when processing transaction log records, so that only relevant changed data is
captured; for example, only changed data pertaining to a particular department within the
organization.
The second aspect relates to techniques for efficiently moving captured data into the
warehouse. Over the years, database engines have developed specialized, performance
optimized APIs for bulk-loading data rather than using standard SQL. Partitioning the data at
the warehouse helps minimize disruption of queries at the data warehouse server. The data is
loaded into a partition, which is then switched in using a metadata operation only. This way,
queries referencing that table are blocked only for a very short duration required for the meta-
data operation rather than during the entire load time. Finally, load utilities also typically
checkpoint the operation so that in case of a failure the entire work does not need to be
redone. Using the techniques discussed above for capturing changed data and efficient
loading, these days utilities are able to approach refresh rates in a few seconds (for example,
Oracle GoldenGate22). Thus, it is potentially possible to even serve some near real-time BI
scenarios, as discussed earlier.
Other BI Technology
Here, we discuss two areas we think are becoming increasingly important and where research
plays a key role.
Data Mining and Text Analytics. Data mining enables in-depth analysis of data including
the ability to build predictive models. The set of algorithms offered by data mining go well
beyond what is offered as aggregate functions in relational DBMSs and in OLAP servers.
Such analysis includes decision trees, market basket analysis, linear and logistic regression,
neutral networks and more (see survey6). Traditionally, data mining technology has been
packaged separately by statistical software companies, for example, SAS,26 and SPSS.27 The
approach is to select a subset of data from the data warehouse, perform sophisticated data
analysis on the selected subset of data to identify key statistical characteristics, and to then
build predictive models. Finally, these predictive models are deployed in the operational
database. For example, once a robust model to offer a room upgrade to a customer has been
identified, the model (such as a decision tree) must be integrated back in the operational
database to be actionable. This approach leads to several challenges: data movement from
warehouse to the data mining engine, and potential performance and scalability issues at the
mining engine (or implied limitations on the amount of data used to build a model). To be
practical, such models need to be efficient to apply when new data arrives. Increasingly, the
trend is toward "in-database analytics," that is, integrating the data mining functionality in the
backend data-warehouse architecture so that these limitations may be overcome (for example,
Netz et al.20 and Oracle Data Mining21).
Text analytics. Consider a company making portable music players that conducts a survey of
its products. While many survey questions are structured (for example, demographic
information), other open-ended survey questions (for example, "Enter other comments here")
are often free text. Based on such survey responses, the company would like to answer
questions such as: Which products are referenced in the survey responses? What topics about
the product are people mentioning? In these scenarios, the challenge is to reduce the human
cost of having to read through large amounts of text data such as surveys, Web documents,
blogs, and social media sites in order to extract structured information necessary to answer
these queries. This is the key value of text analytic engines. Today's text analysis engines (for
example, FAST11 and SAS26) primarily extract structured data that can be broadly
categorized as: Named entities are references to known objects such as locations, people,
products, and organizations. Concepts/topics are terms in the documents that are frequently
referenced in a collection of documents. For example, in the above scenario of portable music
players, terms such as "battery life," "appearance," and "accessories" may be important
concepts/topics that appear in the survey. Such information can potentially be used as a basis
for categorizing the results of the survey. Sentiment analysis produces labels such as
"positive," "neutral," or "negative" with each text document (or part of a document such as a
sentence). This analysis can help answer questions such as which product received the most
negative feedback.
Cloud Data Services. Managing enterprise BI today requires handling tasks such as
hardware provisioning, availability, and security patching. Cloud virtualization technology
(for example, Amazon EC21) allows a server to be hosted in the cloud in a virtual machine,
and enables server consolidation through better utilization of hardware resources. Hosted
servers also offer the promise of reduced cost by offloading manageability tasks, and
leveraging the pay-as-you-go pricing model to only pay for services that are actually used.
The success of hardware virtualization in the cloud has prompted database vendors to
virtualize data services so as to further improve resource utilization and reduce cost. These
data services initially started as simple key-value stores but have now begun to support the
functionality of a single node relational database as a hosted service (for example, Microsoft
SQL Azure18). While the primary initial users of such cloud database services are relatively
simple departmental applications (OLTP), the paradigm is being extended to BI as well (for
example, Pentaho25).
The need for the full range of BI services over the data collected by these applications raises
new challenges for cloud database services. First, the performance and scale requirements of
large reporting or ad hoc queries will require database service providers to support a
massively parallel processing system (parallel DBMS and/or MapReduce-based engine) in
the cloud, Second, these services are multi-tenant, and complex SQL queries can be resource
intensive. Thus, the ability to provide performance Service Level Agreements (SLAs) to
tenants and judiciously allocate system resources across tenant queries becomes important.
Third, many of the technical challenges of traditional "in-house" BI such as security and fine
grained access control become amplified in the context of cloud data services. For example,
techniques for processing queries on encrypted data become more important in public clouds.
For these reasons, an intermediate step in adoption of BI technologies may be in private
clouds, which hold promise similar to public clouds but with more control over aspects such
as security.
Conclusion
The landscape of BI in research and industry is vibrant today. Data acquisition is becoming
easier and large data warehouses with 10s to 100s of tera-bytes or more of relational data are
becoming common. Text data is also being exploited as a valuable source for BI. Changes in
the hardware technology such as decreasing cost of main memory are impacting how the
backend of large data-warehouses are architected. Moreover, as cloud data services take root,
more changes in the BI backend architecture are expected. Finally, there is increasing
demand to deliver interactive BI experiences on mobile devices for today's knowledge
workers. There are ample opportunities to enable novel, rich, and interactive BI applications
on the next generation of mobile devices. Thus, business intelligence software has many
exciting technical challenges and opportunities still ahead that will continue to reshape its
landscape.
1.Boost productivity. Through traditional data gathering methods, users need to compile and
analyze data and write related reports. This can be incredibly time consuming, especially for
small businesses that may not have the employees to do it. With a BI program, you can pull
data and create the reports at the click of a button thus freeing up time and resources allowing
employees to be more productive on their own tasks.
2.To gain sales & market intelligence. Whether you are a sales person or a marketer, you
probably like to keep track of your customers – probably using a Customer Relationship
Management (CRM) solution to help you. A CRM solution collects the data about your
customers and tries to make sense of it, presents it to you in various tables and charts. That
may include the entire sales cycle, from winning new customers, to servicing and tracking
existing customers, to providing post-sales services.
3.Results closer to established goals. How many times have you established a set of
seemingly reasonable goals for a project or other venture, only to end up with an outcome
that is drastically different? With BI you can keep track of information, and have it accessible
by parties that need it, when they need it. BI goes a long way in helping achieving what you
aim for.
5.Gain insights into consumer behaviour. One of the main advantages of investing in BI
and skilled personnel is the fact that it will boost your ability to analyze the current consumer
buying trends. Once you understand what your consumers are buying, you can use this
information to develop products that match the current consumption trends and consequently
improve your profitability.
6.To improve visibility. If you want to improve your control over various important
processes in your organization, BI will improve the visibility of these processes and make it
possible to identify any areas that need improvement.
7.To turn data into actionable information. A BI system is an analytical tool that can give
you the insight you need to make successful strategic plans for your organization. This is
because such a system would be able to identify key trends and patterns in your
organization’s data and consequently make it easier for you to make important connections
between different areas of your business that may otherwise seem unrelated.
These are just seven of the benefits your company can realize through implementing a
Business Intelligence system, and for 7 steps to generating your own BI reports, request a
copy of our eBook: 7 Steps To Superior Business Intelligence.
Business Intelligence, BI is a concept that usually involves the delivery and integration of
relevant and useful business information in an organization. Companies use BI to detect
significant events and identify/monitor business trends in order to adapt quickly to their
changing environment and a scenario. If you use effective business intelligence training in
your organization, you can improve the decision making processes at all levels of
management and improve your tactical strategic management processes.
Here are some of the main reasons for investing in a good BI strategy and system.
One of the main advantages of investing in BI and skilled personnel is the fact that it will
boost your ability to analyze the current consumer buying trends. Once you understand
what your consumers are buying, you can use this information to develop products that match
the current consumption trends and consequently improve your profitability.
To Improve Visibility:
Is stock going missing and you can’t identify where it’s going?
Is there an error in your accounts and you need to drill down to invoicing or billing quickly to
make your P&L balance – can you do this in a few clicks or do you need to re-run a report
that takes forever?
If you want to improve your control over various important processes in your organization,
you should consider investing in a good BI system. This will improve the visibility of these
processes and make it possible to identify any areas that need improvement. Moreover, if you
currently have to skim through hundreds of pages in your detailed periodic reports to assess
the performance of your organisation’s processes, you can save time and improve
productivity by having skilled intelligence analysts using relevant BI software. Good BI
should be at the heart of every organisation as it can provide increased control. Visibility is
essential.
A BI system is an analytical tool that can give you the insight you need to make successful
strategic plans for your organization. This is because such a system would be able to identify
key trends and patterns in your organisation’s data and consequently make it easier for you to
make important connections between different areas of your business that may otherwise
seem unrelated. As such, a BI system can help you understand the implications of various
organisational processes better and enhance your ability to identify suitable opportunities
for your organization, thus enabling you to plan for a successful future.
To Improve Efficiency:
One of the most important reasons why you need to invest in an effective BI system is
because such a system can improve efficiency within your organisation and, as a result,
increase productivity. You can use business intelligence to share information across different
departments in your organization. This will enable you to save time on reporting processes
and analytics. This ease in information sharing is likely to reduce duplication of roles/duties
within the organization and improve the accuracy and usefulness of the data generated by
different departments. Furthermore, information sharing also saves time and improves
productivity.
Whether you are a sales person or a marketer, you probably like to keep track of your
customers – probably using a CRM to help you. CRM stands for Customer Relationship
Management. It refers to software that handles all aspects of an organization’s interactions
with its customers. In other words, it collects the data about your customer and tries to make
sense of it, presents it to you in various tables and charts. That may include the entire sales
cycle, from winning new customers, to servicing and tracking existing customers, to
providing post-sales services. CRM systems are now more involved in decision-support
processes than ever before. In our next articles, we will look at how BI can help increase your
sales efficiently and gain a further insight into your current market as well as market entry
support.
BI can also be used to gain an insight into what your competitors are doing. This strengthens
your company’s ability to make decisions and plan for the future. We will look further into
this area of expertise in our next articles.
Conclusion
In order to reap all the benefits of an effective business intelligence system and strategy for
your company, we recommend you invest in skilled BI personnel as well as training, and a
software designed for analytical efficiency and accessibility. You should also make sure that
the system and strategy you choose can analyze both the content and context of data. If you
require any training or consulting in Business Intelligence, then please contact us.
If you’re looking around for Business Intelligence (BI) solutions, you’ll certainly find plenty
of options. Some are flashy. Others are plain. Some are expensive. Others are free.
To answer that question, let’s get down to the absolute basics of BI. A good BI solution will
provide simple access to any aspect of your data whenever necessary–ultimately leading to
well-informed business decisions.
Now, that’s easy to say, but what does that look like in real-life? How can you know whether
or not a BI solution will fit that description before you buy? What features or elements do
you look for?
To help you answer that question, let’s take a look at a full-featured BI application, and then
highlight each feature and explain why it’s important. Sound good? Okay, let’s get started.
Click the image below to view the demo BI app.
Now, that application has a ton of useful features, which you may not be obvious at first
glance. Let’s go through each feature and explain what it does and why it’s an essential part
of a BI application.
1. Automatic Drill-downs
A good BI application gives you a broad view of your data
with the option to zoom in and out as needed. The automatic drill-downs found on each row
in the demo serve this purpose, starting users out with a high level view of their data and
letting them drill down to the most minute details, all within the same application.
2. Report Filters
3. Smart Selections
4. Interactive Graphs
Interactive graphs make Business Intelligence apps more intuitive for
end users. Users can click on any aspect of the graph to drill down for more information.
Additionally, you’ll notice from the demo that the graphs automatically adjust when filters
are selected.
5. Automatic Execution
6. Import Applications
To get a complete view of your data, it’s often necessary to embed data or charts
from other applications. A good Business Intelligence application will make this process
simple, letting you seamlessly pull all necessary data into one BI application. For instance,
while it’s not obvious, the second graph in the demo application is actually imported from
another application.
7. Saved Searches
The ability to save your searches is a real time-saving BI feature, especially for those
who frequently sort and filter by the same criteria. Rather than making the same selections
every time, saved searches let you store that criteria for easy access in the future.
8. Mobile versions
As smartphones and tablets become more prevalent, BI applications must be available
on multiple device types. In the example above, the application will automatically display a
different presentation layer depending on the device used.
Conclusion
Choosing between BI solutions can be difficult if you don’t know what to look for. While the
right BI solution largely depends on your business, the features listed above should provide
you with a good checklist of basic areas to look for before choosing the right solution. If
you’d like to read more on the subject, here’s a list of a few essential application types that
every good Business Intelligence solution should include.
Data sourcing :
Business Intelligence is about extracting information from multiple sources of data. The data
might be: text documents - e.g. memos or reports or email messages; photographs and
images; sounds; formatted tables; web pages and URL lists. The key to data sourcing is to
obtain the information in electronic form. So typical sources of data might include: scanners;
digital cameras; database queries; web searches; computer file access etc.
Data analysis:
Situation awareness :
Business Intelligence is about filtering out irrelevant information, and setting the remaining
information in the context of the business and its environment. The user needs the key items
of information relevant to his or her needs, and summaries that are syntheses of all the
relevant data (market forces, government policy etc.). Situation awareness is the grasp of the
context in which to understand and make decisions. Algorithms for situation assessment
provide such syntheses automatically.
Risk assessment :
Decision support
Business Intelligence is about using information wisely. It aims to provide warning you of
important events, such as takeovers, market changes, and poor staff performance, so that you
can take preventative steps. It seeks to help you analyze and make better business decisions,
to improve sales or customer satisfaction or staff morale. It presents the information you
need, when you need it.
Some Definitions:
Transactional data is the info stored to track the interactions, or business transactions, carried
out by an organization. (Brian Larson, 2008)
Online transaction processing (OLTP) systems record business interactions as they happen.
They support the day to day operation of an organization. (Brian Larson, 2008)
Data Mart
A data mart is a body of historical data in an electronic repository that does not participate in
the daily operations of the organization. Instead, this data is used to create business
intelligence. The data in the data mart usually applies to a specific area of the organization.
(Brian Larson, 2008)
While using organizational OLTP systems as a source for BI, number of problems can result.
For that what we need to do is take the information stored in these OLTP systems and move it
into a different data store. We need to store the data so it is available for BI needs outside of
OLTP systems. When data is stored in this manner, it is referred to as a Data Mart. (data
copied from OLTP systems periodically and written to data mart is known as data load)
When designing a data mart, the rules of normalization are replaced by a different method of
design organized around "facts'. These new design approaches are called stars and
snowflakes.
The data used for BI can be divided into four categories: measures, dimensions, attributes,
and hierarchies. These four types of data help us to define the structure of data mart.
Measure:
Measure forms the basis of Business Intelligence. They are the basic building blocks of for
effective decision making.
Measures are the facts used for information. Therefore, the tables that have measure
information are known as fact tables.
Distributive: if the result derived by applying the function to n aggregate values is the same
as that derived by applying the function on all the data without partitioning.
Holistic: if there is no constant bound on the storage size needed to describe a sub aggregate.
E.g., median (), mode (), rank ().
Dimension:
Measures and dimensions are stored in a data mart in one of the two layouts/schemas i.e. Star
schema and Snowflake schema.
Star Schema
A star schema is a relational database schema used to hold measures and dimensions in a data
mart. The measures are stored in a fact table and dimensions are stored in a dimension tables.
(Brian Larson, 2008)
Star schema uses two types of tables: fact tables and dimension tables.
Attributes are the additional information about the dimension members in a data mart. They
are also used to store the information that may be used to limit or filter the records selected
from the data mart during the data analysis.
Attribute:
Hierarchy:
A dimension is part of the larger structure with many levels. This structure is known as a
hierarchy.
Snowflake schema contains all the advantages of a good relational design. It doesn't result in
duplicate data and is, therefore, easier to maintain. The only disadvantage of snowflake
schema is that it requires a number of table joins when aggregating measures at upper levels
of the hierarchy. Hence in larger data marts this can lead to performance problems.
Data Warehouse
Ralph Kimball defines the Data Warehouse as" the conglomeration of an organization's data
warehouse staging and presentation areas, where operational data is specifically structured for
In simple words, R. Kimball said "A data warehouse is a central repository for all or
significant parts of the data that an enterprise's various business systems collect".
A data warehouse is a copy of transaction data specifically structured for query and analysis
• It is subject oriented since all data should be related to a specific subject instead
• Time variant due to the fact that every record stored has been accurate at one
moment in time.
Enterprise warehouse
Data Mart
a subset of data warehouse that is of value to a specific groups of users. Its scope is confined
to specific, selected groups, such as marketing data mart
Virtual warehouse
Information processing
This supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts
and graphs
Analytical processing
Data mining
Data mining is the analysis of data with the intent to discover gems of hidden information in t
he vast quantity of data that has been captured in the normal course of running the business
(Moss, 2003).
Hence, from the above definition, we can say that it discovers knowledge from the hidden
parts
Data warehouse must make an organization's information easily accessible and consistent.
Ralph Kimball and Bill Inmon formed the two different approaches to data warehouse design.
These two approaches are: Top-down, bottom-up approaches or a combination of both.
In top-down approach , first data warehouse is build and then the data marts.
Here the data warehouse is designed based on the normalized enterprise wide data model
In this approach DW acts as a single repository that feeds data into data marts
Advantages:
The top-down design approach exhibits highly consistent dimensional views of data across
data marts as the data marts are loaded from the centralized repository.
It is easy to create a data mart against the data stored in the data warehouse.
Disadvantages:
In bottom-up approach, first data marts are created and then data warehouse.
It starts with one data mart but later on more data marts can be added.
Advantages:
This approach is easier and faster to implement as one needs to deal just with smaller subject
areas in the beginning.
Disadvantages:
In this approach, there is a long term risk of inconsistencies due to the use of multiple data
marts.
This approach tried to mix the best of both top-down as well as bottom-up approaches.
Build out first few data marts that are mutually exclusive and critical.
Then build the enterprise model and move the data to data warehouse.
Business Process:
The most important thing in the business process is to identify the business requirements of a
company and to analyze them thoroughly.
Granularity is the detailed data analysis captured from the data warehouse. More the detail,
higher the granularity and vice-versa.
Fact table:
It is similar to the transaction table in an OLTP system. It stores the facts or measures of the
business. E.g.: SALES, ORDERS
Hence it contains the metrics resulting from a business process or measurement event, such
as the sales ordering process or service call event.
In addition to the measurements, the only other things a fact table contains are foreign keys
for the dimension tables.
Dimension table:
It is similar to the master table in an OLTP system. It stores the textual descriptors of the
business. E.g.: CUSTOMER, PRODUCT
Hence it contains the descriptive attributes and characteristics associated with speci¥c
events, such as the customer, product, or sales representative associated with an order being
placed.
Data Integration Process Design ( still more to write about dimension etl and facts etl)
Extract, transform and load (ETL) is the core process of data integration and is typically
associated with data warehousing. ETL tools are employed in order to populate data
warehouse with up-to-date records extracted from source system, therefore, they are useful in
organizing the steps of the whole process as a work flow. There are some prominent tasks
carried out under this work flow which include: (i) the identification of the relevant
information at the source side; (ii) the extraction of this information; (iii) the transportation of
this information to the Data Staging Area (DSA), where most of the transformation takes
place usually; (iv) the transformation (i.e., customization and integration) of the information
extracted from the multiple sources into a common format; (v) the cleansing of the resulting
data set, on the basis of the database and business rules; and (vi) the propagation and loading
of the data to the data warehouse and the refreshment of data marts.
One important function of ETL is "cleansing" data. The ETL consolidation protocols also
include the elimination of duplicate or fragmentary data, so that what passes from the E
portion of the process to the L portion is easier to assimilate and/or store. Such cleansing
operations can also include eliminating certain kinds of data from the process.
Data cleansing removes inconsistencies and errors from transactional data so it has the
consistency necessary for use in data mart. (Brian Larson, 2008)
Data cleansing transforms data into a format that doesn't cause problems in the data mart
environment. It converts inconsistent data types into a single type. Data cleansing translates
dissimilar identifiers to a standard set of codes for the data mart. In addition, it repairs or
removes any data that does not meet the business rules required by the measures calculated
from his data mart.
Data cleansing is usually done as a part of a larger process. This process extracts the data
from the OLTP systems and loads it into a data mart. Thus, the entire procedure is known as
extract, transform , and load - or ETL.
The Extract, Transform, and Load (ETL) process extracts data to copy from one or more
OLTP systems, performs any required data cleansing to transform the data into a consistent
format, and loads the cleansed data by inserting it into the data mart. (Brian Larson, 2008)
There are several ways of implementing ETL. The most prominent way is to pull the data
from source systems, put it in a staging area it, and then transform it and load it into the data
warehouse, as per the top diagram of the figure. Alternatively, instead of putting the data in a
staging area, sometimes the ETL server directly with no staging does the transformation and
then updates the data warehouse, as shown in the bottom diagram of the figure. The staging
area is a physical database or files. Putting the data into the staging area means inserting it
into the database or writing it in files.
SQL Server 2008 Business Intelligence Application Development (SSIS, SSAS, SSRS)
Microsoft says that SQL Server Integration Services (SSIS) "is a platform for building high
performance data integration solutions, including extraction, transformation, and load (ETL)
packages for data warehousing."
define a workflow
The first version of SSIS was released with SQL Server 2005. SSIS is a replacement for Data
Transformation Services (DTS) which was available with SQL Server 7.0 and SQL Server
2000. SSIS builds on the capabilities introduced with DTS.
Microsoft SQL Server 2005 Analysis Services (SSAS) delivers online analytical processing
(OLAP) and data mining functionality for business intelligence applications. Analysis
Services supports OLAP by letting you design, create, and manage multidimensional
structures that contain data aggregated from other data sources, such as relational databases.
For data mining applications, Analysis Services lets you design, create, and visualize data
mining models that are constructed from other data sources by using a wide variety of
industry-standard data mining algorithms.
SSRS-
Microsoft SQL Server 2005 Reporting Services (SSRS) delivers enterprise, Web-enabled
reporting functionality so you can create reports that draw content from a variety of data
sources, publish reports in various formats, and centrally manage security and subscriptions.
For information about other SQL Server2005 components, tools, and resources. Microsoft
SQL Server Reporting Services enables organizations to transform valuable enterprise data
into shared information for insightful, timely decisions at a lower total cost of ownership.
SQL Server Reporting Services is a comprehensive, server-based solution that enables the
creation, management, and delivery of both traditional, paper-oriented reports and interactive,
Web-based reports. An integrated part of the Microsoft Business Intelligence framework,
Reporting Services combines the data management capabilities of SQL Server and Microsoft
Windows Server with familiar and powerful Microsoft Office System applications to deliver
real-time information to support daily operations and drive decisions.
Staging process
With the help of SSIS (SQL Server Integration Services), we can load the data from trade
capture system into the database. Here we can extract, transform and load (ETL) the packages
for data
warehousing. Since behind this system there is no organization supporting it all of the data w
as created by collecting some information from external sources.
Following are the steps to how to create the SSIS packages using Microsoft Visual Studio for
this system.
Create a new flat file connection manager pointing to the source file you want to load. Also
create a new OLEDB connection manager.
Drag a new flat file source. Point this to flat file connection you created before.
Inside the Data Flow Task, drag a new OLEDB destination. Point it to the FRM db.
Create a new table and join it to the source. Check the mappings.
Again drag and drop the Data Flow Task in control flow.
Now for the data conversion, drag a data conversion toolbox. Double click on it and select the
column whose data type has to be changed.
Then again drag two OLEDB destination toolbox. (One will be the final successfully
accomplished typed load destination and another will be the error generated destination)
Now double click on typed load destination, point it to the FRM db and create a new table
and join it to the source. Also check the mappings.
After this double click on the error generated destination, point it to the FRM db and enter the
table name created before and check the mappings.
Above discussed are the overall simple steps to describe how SSIS packages are created.
Now will see in detail how all these steps go in actual creation.
For this system, initially the data created was analyzed using Excel with .csv file extension.
There are two .csv files in this system containing quite a large data. And these are the flat file
source for this system.
Following figure shows the total number of created tables in FRM named database for this
system.
SQL terms:
DBMS - Database management system.
SQL - Structured Query Language is a standard language for communication with a relational
database management system (RDBMS).
Schema. Consists of a library, a journal, a journal receiver, an SQL catalog, and optionally a
data dictionary. A schema groups related objects and allows you to find the objects by name.
SQL statements:
The most common statements used in this report for query writing are as follows, whereas,
other statements are just listed as per the category of statements under which they fall.
SELECT statement:
Command
Description
CREATE DATABASE
CREATE INDEX
Creates a new index on a table column
CREATE SEQUENCE
CREATE TABLE
CREATE TRIGGER
CREATE VIEW
SELECT
INSERT
UPDATE
DELETE
DROP DATABASE
DROP INDEX
Removes a column index from an existing table
DROP SEQUENCE
DROP TABLE
DROP TRIGGER
DROP VIEW
CREATE USER
ALTER USER
DROP USER
GRANT
REVOKE
CREATE FUNCTION
Creates a new SQL function within a database
CREATE LANGUAGE
CREATE OPERATOR
CREATE TYPE
Miscellaneous statements
Following are the most common and basic SQL statements in detail as per the above written
statements:
ALTER TABLE
CREATE FUNCTION
CREATE INDEX
CREATE PROCEDURE
CREATE SCHEMA
CREATE TABLE
CREATE TRIGGER
CREATE VIEW
DROP TABLE
DROP FUNCTION
DROP INDEX
DROP PROCEDURE
DROP SCHEMA
DROP TRIGGER
DROP VIEW
GRANT FUNCTION
GRANT PROCEDURE
GRANT PROCEDURE
RENAME
REVOKE FUNCTION
REVOKE PROCEDURE
REVOKE TABLE
CLOSE
COMMIT
DELETE
FETCH
INSERT
LOCK TABLE
OPEN
REFRESH TABLE
ROLLBACK
SAVEPOINT
SELECT INTO
SET variable
UPDATE
VALUES INTO
DESCRIBE
EXECUTE
EXECUTE IMMEDIATE
PREPARE
Miscellaneous Statements
CALL
CONNECT
DECLARE PROCEDURE
DECLARE STATEMENT
DECLARE VARIABLE
DESCRIBE TABLE
DISCONNECT
FREE LOCATOR
GET DIAGNOSTICS
HOLD LOCATOR
INCLUDE
RELEASE
SET CONNECTION
SET OPTION
SET PATH
SET SCHEMA
SET TRANSACTION
SIGNAL
WHENEVER
For the creating the tables
) ON [PRIMARY]
GO
) ON [PRIMARY]
GO
) ON [PRIMARY]
GO
) ON [PRIMARY]
GO
) ON [PRIMARY]
GO
) ON [PRIMARY]
GO
Finance
Swap
A swap is an agreement between two or more parties to exchange sequences of cash flows
over a period in the future. Suppose a company may have borrowed money under an
adjustable interest rate security such as a mortgage and is now fearful that the interest rate is
going to rise. It wants to protect itself against rises in the interest rates without going through
the refinancing of the mortgage. The company or individual liable for an adjustable rate looks
for someone who will pay the adjustable interest payments in return for receipt of fixed rate
payments. This is called a swap. A swaption (option on a swap) gives the holder the right to
enter into or the right to cancel out of a swap. (R.Kolb, J. Overdahl, 2007)
Currency swaps
Credit swaps
Commodity swaps
Equity swaps.
Options
An option that gives the buyer or holder the right but not the obligation to sell shares (or other
financial instruments) at a fixed price on or before a given date.
There are two major classes of options: call options and put options.
The owner of a call option has the right to purchase the underlying good at a specific price,
and this right lasts until the specific date. The owner of the put option has the right to sell the
underlying well at a specific price, and this right lasts until the specific date. (R.Kolb, J.
Overdahl, 2007)
Forwards
A forward contract is an agreement between two parties for the delivery of a physical asset
(e.g. oil or gold) at a certain time in the future for a certain price, fixed at the inception of the
contract. The parties agreeing to the contract are known as counterparties. (R.Kolb, J.
Overdahl, 2007)
Futures
A futures contract is a type of forward contract with highly standardized and precisely
specified contract terms. As in all forward contract, a futures contract calls for the exchange
of some good at a future date for cash, with the payment for the good to occur at that future
date. The purchaser of the futures contract undertakes to receive delivery of the good and pay
for it, while the seller of a futures contract promises to deliver the good and receive payment.
The price of the good is determined at the initial time of the contract. (R.Kolb, J. Overdahl,
2007)
There are several financial risk parameters such as Delta, Theta, Gamma, Vega, and Rho.
Delta
Delta for individual options, and position Delta for strategies involving combinations of
positions, are measures of risk from a move of the underlying price. For example, if you buy
an at-the-money call or put, it will have a Delta of approximately 0.5, meaning that if the
underlying stock price moves 1 point, the option price will change by 0.5 points (all other
things remaining the same). If the price moves up, the call will increase by 0.5 points and the
put will decrease by 0.5 points. While a 0.5 Delta is for options at-the-money, the range
of Delta values will run from 0 to 1.0 (1.0 being a long stock equivalent position) and from -
1.0 to 0 for puts (with -1.0 being an equivalent short stock position). (John Summa, 1998).
Gamma
Delta measures the change in price of an option resulting from the change in the underlying
price. However, Delta is not a constant. When the underlying moves so does the Delta value
on any option. This rate of change of Delta resulting from movement of the underlying is
known as Gamma. And Gamma is largest for options that are at-the-money, while smallest
for those options that are deepest in- and out-of-the-money. Gammas that get too big are risky
for traders, but they also hold potential for large-size gains. Gammas can get very large as
expiration nears, particularly on the last trading day for near-the-money options. (John
Summa, 1998).
Theta
Theta is a measure of the rate of time premium decay and it is always negative (leaving
position Theta aside for now). Anybody who has purchased an option knows what Theta is,
since it is one of the most difficult hurdles to surmount for buyers. As soon as you own an
option (a wasting asset), the clock starts ticking, and with each tick the amount of time value
remaining on the option decreases, other things remaining the same. Owners of these wasting
assets take the position because they believe the underlying stock or futures will make a
move quick enough to put a profit on the option position before the clock has ticked too long.
In other words, Delta beats Theta and the trade can be closed profitably.
When Theta beats Delta, the seller of the option would show gains. This tug of war
between Delta and Theta characterizes the experience of many traders, whether long
(purchasers) or short (sellers) of options. (John Summa, 1998).
Vega
When any position is taken in options, not only is there risk from changes in the underlying
but there is risk from changes in implied volatility. Vega is the measure of that risk. When the
underlying changes, or even if it does not in some cases, implied volatility levels may change.
Whether large or small, any change in the levels of implied volatility will have an impact on
unrealized profit/loss in a strategy. Some strategies are long volatility and others are short
volatility, while some can be constructed to be neutral volatility. For example, a put that is
purchased is long volatility, which means the value increases when volatility increases and
falls when volatility drops (assuming the underlying price remains the same). Conversely, a
put that is sold (naked) is short volatility (the position loses value if the volatility increases).
When a strategy is long volatility, it has a positive position Vega value and when short
volatility, its position Vega is negative. When the volatility risk has been neutralized,
position Vega will be neither positive nor negative. (John Summa, 1998).
Rho
Rho is a risk measure related to changes in interest rates. Since the interest rate risk is
generally of a trivial nature for most strategists (the risk free interest rate does not make large
enough changes in the time frame of most options strategies), it will not be dealt with at
length in this tutorial.
When interest rates rise, call prices will rise and put prices will fall. Just the reverse occurs
when interest rates fall. Rho is a risk measure that tells strategists by how much call and put
prices change as a result of the rise or falls in interest rates. The Rho values for in-the-money
options will be largest due to arbitrage activity with such options. Arbitragers are willing to
pay more for call options and less for put options when interest rates rise because of the
interest earnings potential on short sales made to hedge long calls and opportunity costs of
not earning that interest. (John Summa,1998).
FV = PV * (1 + (i * N ) ]
Where,
N - number of periods
Simple Interest
I = PV * i * N
Where,
N - number of periods
FV = PV * ( 1 + i )N
Where,
PV = present value
N = number of periods
Compounded Interest
i = FV - PV
where,
PV = present value
Annuity
FV = PMT * [((1 + i )N - 1 ) / i ]
Where,
N = number of periods
PV * (1 + i )N = PMT * [ ( 1 + i )N - 1 ] / i
Where,
N = number of periods
Present Value:
Present value is the value on a given date of a future payment or series of future payments,
discounted to reflect the time value of money and other factors such as investment risk.
Present value calculations are widely used in business and economics to provide a means to
compare cash flows at different times on a meaningful "like to like" basis.
http://www.frickcpa.com/tvom/images/PVANN_basic_formula.gif
http://www.frickcpa.com/tvom/images/PVANN_basic_formula_legend.gif
BI solutions in the financial services industry have been delivered most effectively within
individual lines of business. For financial analysis and customer service, Bi has allowed
individual financial products and lines of business to use integrated information. BI is used to
consolidate information, support sales and marketing, and manage risk.
Risk Management: Richard Skriletz (Aug 2003) classified risk in financial services in two
forms. The first is the risk to lending institutions of default on payments such as mortgage
payments. Defaults are a problem because they disrupt cash flow, affect securities issued by
the financial institution that are backed by loans and can affect (as one of the rating factors)
the rating agencies' evaluations of the financial strength of the company. BI data mining and
statistical modeling are being applied successfully to the problem of managing risk associated
with defaults on loans. These tools help to identify patterns in data that correlate with loan
defaulters so that the process of underwriting or approving loan applications can screen out
applicants with these data patterns.
The second area of risk is the evaluation of the company by rating agencies. Evaluations are
meant to provide an assessment of the stability of the financial institution, its ability to fulfill
its obligations to customers and its relative (compared to other companies rated) strength for
investors. Relative strength is a risk factor because an agency's rating can affect the cost of
capital to the financial institution, and the cost of capital is the "raw material" of financial
products and services. It determines the operating margin (the difference between costs and
gross revenues) of the company because rates for many financial products and services (such
as mortgage rates) are set by the market. A lower cost of capital (due to a better rating) can
increase the operating margin and provide an advantage over competitors with lower ratings.
Agencies ask for reports and data to use in their evaluation and monitoring processes. RCG
IT clients using BI solutions to consolidate information and monitor risks such as defaults
have been able to meet the information needs of rating agencies easily and demonstrate their
ability to respond to changes in the business environment. Consider the question, no longer
hypothetical, of assessing the impact of a terrorist attack such as that of 9/11 on your
business. An effective BI solution can help answer this question quickly and easily.
The major opportunity for BI in financial services is clearly for enterprise-wide integration of
customer, product, channel and operational data. After all, the financial services business is
largely about information, and most financial services companies have used or experimented
with BI technology successfully. Such companies should value their information assets and
use them to improve products, services and business operations. Too much information is
locked in the wide array of application systems including those for departmental and line of
business products, customer relationship management, enterprise resource planning, financial
management and e-business.
Henry Vlokya, (12 Aug 2009) said that there are challenges to creating a truly integrated,
enterprise-wide information environment in financial services companies. There are decision-
makers in large financial services organizations that believe that their data volumes are too
great, their information processing needs too specialized, and their end users too set in their
ways for a BI solution to be effective. Here they should be encouraged to look at the retail
industry where many BI best practices have been developed. In retail, the volume of
transactions processed, the wide range of BI analytics used and the variety of end users who
employ BI results are all greater than those found in financial services.
The challenges are not technical. Rather, organizational issues present the greatest challenge
for financial services companies. In particular, the management culture of financial services
companies is focused on financial performance and profit. This culture comes from the
entrepreneurial moneymaking that is at the heart of the financial services industry. Currently,
it is the rare executive who is willing to invest in an enterprise-wide solution and all that it
entails; but there are more such executives appearing every day. (Justin Aucoin , 17 Sep
2009)
RCG IT is working with financial services companies that are creating a centralized,
integrated information facility to perform in-depth customer analytics. The intents vary,
including developing a better understanding of customer behavior, providing better
performing cross-selling capabilities and identifying new opportunities to improve products
and services. In each of these cases, the drive for enterprise-wide information integration
comes from a high-level executive.
One financial services client, who was dissatisfied with the results of its five-year experience
with in-house developed BI, outsourced its BI function. The company did this in order to get
results needed from its operationally mission-critical use of BI without continuing to invest in
learning and experimentation with the technology. As BI solutions become as important to
the financial services industry as they are in retail and other industries, more executives will
be making decisions to integrate information on an enterprise-wide basis and ensure that the
right skills and experiences are in place to deliver high-value business results.
BI tools
Spreadsheets
Reporting and querying software: applications that extract, sort, summarize, and
present selected data
Online analytical processing (OLAP)
Digital dashboards
Data mining
Business activity monitoring
Data warehouse[13]
Local information systems
Data cleansing[14]
A data warehouse is a federated repository for all the data that an enterprise's various
business systems collect. The repository may be physical or logical.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve
data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This
contrasts with a transactions system, where often only the most recent data is kept. For
example, a transaction system may hold the most recent address of a customer, where a data
warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a
data warehouse should never be altered.
A data warehouse is a copy of transaction data specifically structured for query and
analysis.
This is a functional view of a data warehouse. Kimball did not address how the data
warehouse is built like Inmon did; rather he focused on the functionality of a data warehouse.
A data warehouse is a powerful database model that significantly enhances the user’s ability
to quickly analyze large, multidimensional data sets.
It cleanses and organizes data to allow users to make business decisions based on facts.
Hence, the data in the data warehouse must have strong analytical characteristics creating
data to be analytical requires that it be –subject- oriented, integrated, time – referenced
and non – volatile.
This means a data warehouse has a defined scope and it only stores data under that
scope. So for example, if the sales team of your company is creating a data warehouse –
the data warehouse by definition is required to contain data related to sales.
Data Warehouses group data by subject rather by activity. In contrast, transactional systems
are organized around activities – payroll processing, shipping products, loan processing, and
the like.
Data organized around activities cannot answer questions such as, “how many salaried
employees have a tax deductions of ‘X’ amount across all branches of the company?’’ this
request would require have searching and aggregation of employee and account records of all
the branches.
Imagine the query response time for a company having branches all over the country with
employee strength of 20,000!
INTEGRATED DATA
Integrated data refers to de – duplicating information and merging it from many sources into
one consistent location.
When short listing your top 20 customers, you must know that ‘’HAL’’ and ‘’Hindustan
aeronautics limited’’ are one and the same. There must be just one customer number for any
form of HAL or Hindustan aeronautics limited, in your database.
This means that the data stored in a data warehouse make sense. Fact and figures are
related to each other and they are integrable and project a single point of truth.
Much of the transformation and loading work that foes into the data warehouse is centered on
integrating data and standardizing it,
The most important and most scrutinized characteristic of the analytical data is its prior state
of bing. In other words, time-referenced data essentially refers to its time – valued
characteristic. For example, the user may ask ‘’what were the total sales of product ‘A’ for
the past three years on New Year’s Day across region ‘Y’?’’ to answer this question, you
need to know the sales figures of the product on new year’s day in all the branches for that
particular region.
This means that data is not constant, as new and new data gets loaded in the warehouse, data
warehouse also grows in size
Time – referenced data when analyzed can also help in spotting the hidden treads between
different associative data elements, which may not be obvious to the naked eye. This
exploration activity is termed ‘’data mining’’.
Since the information in a data warehouse is heavily queried against time, it is extremely
important to preserve it pertaining to each and every business event of the company. The non
– volatility of data, characteristic of data warehouse, enables users to dig deep into history
and arrive at specific business decisions based on facts.
This means that data once stored in the data warehouse are not removed or deleted from it
and always stay there no matter what.
If there is a single key to survival in the 1990s and beyond, it is being able to analyze, plan,
and react to changing business conditions in a much more repaid fashion. In order to do this,
to managers, analysts, and knowledge workers in our enterprises, need more and better
information.
Information technology (IT) has made possible the revolution in the way organizations
operate throughout the world today. But the sad truth is, in many organizations, despite the
availability of powerful computers on each desk and communication that span the globe,
large numbers of executives and decision – makers cannot get their hands on exiting critical
information in the organization.
Every day, organizations large and small, create billions of bytes of data about all aspects of
their business; millions of individual facts about their customers, products, operations and
people. But for the most part, this is locked up in a maze of computer systems and is
exceedingly difficult to get at. This phenomenon has been described as “data in jail”.
Industry experts have estimated that only a small fraction of the data that is captured,
processed and stored in the enterprise, is actually available to executives and decision
makers. While technologies for the manipulation and presentations of data have literally
exploded, it is only recently that those involved in developing IT strategies for large
enterprise have concluded that large segments of the enterprise are “data poor”.
The fundamental difference between operational systems and data warehousing systems is
that operational systems are designed to support transaction processing whereas data
warehousing systems are designed to support online analytical processing (or OLAP, for
short).
Based on this fundamental difference, data usage patterns associated with operational
systems are significantly different than usage patterns associated with data warehousing
systems. As a result, data warehousing systems are designed and optimized using
methodologies that drastically differ from that of operational systems.
The table below summarizes many of the differences between operational systems and data
warehousing systems
The business analyst get the information from the data warehouses to measure the
performance and make critical adjustments in order to win over other business holders in the
market. Having a data warehouse offers the following advantages:
Since a data warehouse can gather information quickly and efficiently, it can enhance
business productivity.
A data warehouse provides us a consistent view of customers and items, hence, it
helps us manage customer relationship.
A data warehouse also helps in bringing down the costs by tracking trends, patterns
over a long period in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to understand and analyze the
business needs and construct a business analysis framework. Each person has different
views regarding the design of a data warehouse. These views are as follows:
The top-down view - This view allows the selection of relevant information needed
for a data warehouse.
The data source view - This view presents the information being captured, stored,
and managed by the operational system.
The data warehouse view - This view includes the fact tables and dimension tables.
It represents the information stored inside the data warehouse.
The business query view - It is the view of the data from the viewpoint of the end-
user.
Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of
the data warehouse architecture.
Bottom Tier - The bottom tier of the architecture is the data warehouse database
server. It is the relational database system. We use the back end tools and utilities to
feed data into the bottom tier. These back end tools and utilities perform the Extract,
Clean, Load, and refresh functions.
Middle Tier - In the middle tier, we have the OLAP Server that can be implemented
in either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on multidimensional
data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
Top-Tier - This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.
From the perspective of data warehouse architecture, we have the following data warehouse
models:
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to
build a virtual warehouse. Building a virtual warehouse requires excess capacity on
operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to
specific groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group. For
example, the marketing data mart may contain data related to items, customers, and sales.
Data marts are confined to subjects.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an
entire organization
It provides us enterprise-wide data integration.
The data is integrated from operational systems and external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.
Load Manager
This component performs the operations required to extract and load process.
The size and complexity of the load manager varies between specific solutions from one data
warehouse to other.
The data is extracted from the operational databases or the external information providers.
Gateways is the application programs that are used to extract data. It is supported by
underlying DBMS and allows client program to generate SQL to be executed at a server.
Open Database Connection(ODBC), Java Database Connection (JDBC), are examples of
gateway.
Fast Load
In order to minimize the total load window the data need to be loaded into the
warehouse in the fastest possible time.
The transformations affects the speed of data processing.
It is more effective to load the data into relational database prior to applying
transformations and checks.
Gateway technology proves to be not suitable, since they tend not be performant when
large data volumes are involved.
Simple Transformations
While loading it may be required to perform simple transformations. After this has been
completed we are in position to do the complex checks. Suppose we are loading the EPOS
sales transaction we need to perform the following checks:
Strip out all the columns that are not required within the warehouse.
Convert all the values to required data types.
Warehouse Manager
The size and complexity of warehouse managers varies between specific solutions.
Note: A warehouse Manager also analyzes query profiles to determine index and
aggregations are appropriate.
Query Manager
Query manager is responsible for directing the queries to the suitable tables.
By directing the queries to appropriate tables, the speed of querying and response
generation can be increased.
Query manager is responsible for scheduling the execution of the queries posed by the
user.
The following screenshot shows the architecture of a query manager. It includes the
following:
Detailed information is not kept online, rather it is aggregated to the next level of detail and
then archived to tape. The detailed information part of data warehouse keeps the detailed
information in the starflake schema. Detailed information is loaded into the data warehouse to
supplement the aggregated data.
The following diagram shows a pictorial impression of where detailed information is stored
and how it is used.
Note: If detailed information is held offline to minimize disk storage, we should make sure
that the data has been extracted, cleaned up, and transformed into starflake schema before it
is archived.
Summary Information
Summary Information is a part of data warehouse that stores predefined aggregations. These
aggregations are generated by the warehouse manager. Summary Information must be treated
as transient. It changes on-the-go in order to respond to the changing query profiles.
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational
model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this
chapter, we will discuss the schemas used in a data warehouse.
Star Schema
Note: Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street,
city, province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state
and country.
Snowflake Schema
Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
<b<>Note: Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.</b<>
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
Syntax for Cube Definition
define cube < cube_name > [ < dimension-list > }: < measure_list >
The star schema that we have discussed can be defined using Data Mining Query Language
(DMQL) as follows:
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier
type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state,
country))
Fact Constellation Schema Definition
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:
ETL refers to three processes - Extract, Transform and Load. Simply defined, ETL enables
the collection of data from various sources into one data store, ready for analysis.
ETL can be implemented with scripts (custom DIY code) or with a dedicated ETL tool. ETL
performs a number of important functions including:
Parsing/Cleansing - Data generated by applications is created in various formats like
JSON, XML or CSV. During the parsing stage, data is mapped into a table format
with headers, columns and rows, and the specified fields are extracted.
Data Enrichment - In order to prepare data for analytics, certain enrichment steps are
usually required, including: tweaking, injecting expert knowledge, geo modifications,
matching between sources and correcting bugs.
Setting Velocity – Velocity refers to the frequency of data loading, whether new data
should be inserted, or if existing data needs to be updated.
Data Validation - There are cases where data is empty, corrupted, missing crucial
elements, too thin or too bloated. ETL finds these occurrences and determines
whether to stop the entire process, skip it or set it aside for inspection while alerting
the relevant administrators.
The short answer is that ETL saves significant time on data extraction and preparation - time
that could be better spent on extracting insights.
Each of the 3 main components in the ETL saves time and development effort by doing it
once in a dedicated data flow:
Extract – There is a saying, “A chain is only as strong as its weakest link.” In the context of
ETL, the strength of the chain is also determined by its first link. The extract stage determines
different data sources, refresh rate (velocity) of each source, and priorities (extract order)
between them – all of which heavily impact time-to-insights.
Transform - After extracting the data into an ETL environment, transformations bring clarity
and order to the initial data swamp. For example, dates are consolidated into specified time
buckets, strings are parsed to their business meanings, transactions are modeled into events,
location data is translated to coordinates, zip codes or cities/countries, measures are summed
up, averaged,rounded and useless data and errors are set aside for later inspection.
Load – In the last phase, much as in the first, targets and refresh rates are determined.
Moreover, the load phase determines whether loading will be done by increments or “upsert”
(update existing data and insert new data) is required for the new batches of data.
Implementing ETL in a DWH
When an ETL process is used to load a database into a data warehouse (DWH), each phase is
represented by a physical layer:
Mirror/Raw layer - This layer is a copy of the source files or tables, with no logic or
enrichment. Source data is copied and added to the target mirror tables, which then
hold historical raw data that is ready to be transformed.
Staging layer - Once the raw data from the mirror tables is transformed, all
transformations are stored in staging tables. These tables hold the final form of the
data for the incremental part of the ETL cycle in progress.
Schema layer - These are the destination tables, which contain all the data in its final
form after cleansing, enrichment and transformation.
Aggregating layer - In some cases it is beneficial to aggregate data to a daily or store
level from the full dataset. This can improve report performance, enable the addition
of business logic to calculated measures and make it easier for report developers to
understand the data.
Today's data is indeed big - not only in size but in impact, possible interpretations and use
cases.
Beyond management and monitoring of revenue flow and users, each element of the modern
organization requires unique insights from big data:
In order to satisfy such diverse data demands, without getting lost in parallel data versions,
ETL creates an environment that sustains democratic data governance, delivering:
Ready to move forward with your ETL? Find out how easy it is to integrate your data with
Xplenty here.
With Segment and Xplenty, you can sort, filter and aggregate your customer behavior data
without writing any code, or installing or maintaining anything.
Childhood dreams do come true - in 2015 "Batman vs. Superman" will bring the world’s
biggest superheroes to battle on-screen, finally solving that eternal debate who will prevail (I
put...
Writing your own ETL code is not trivial. What starts out as a simple ETL process gets more
complex over time. So does the coding, which becomes less manageable. A...
Try Xplenty free for 7 days. No credit card required.
In computing, Extract, Transform, Load (ETL) refers to a process in database usage and
especially in data warehousing. The ETL process became a popular concept in the 1970s.[1]
Data extraction is where data is extracted from homogeneous or heterogeneous data sources;
data transformation where the data is transformed for storing in the proper format or structure
for the purposes of querying and analysis; data loading where the data is loaded into the final
target database, more specifically, an operational data store, data mart, or data warehouse.
Since the data extraction takes time, it is common to execute the three phases in parallel.
While the data is being extracted, another transformation process executes while processing
the data already received and prepares it for loading while the data loading begins without
waiting for the completion of the previous phases.
ETL systems commonly integrate data from multiple applications (systems), typically
developed and supported by different vendors or hosted on separate computer hardware. The
disparate systems containing the original data are frequently managed and operated by
different employees. For example, a cost accounting system may combine data from payroll,
sales, and purchasing.
Data warehousing is the backbone of any business intelligence solution and it needs to be
built with careful consideration to be scalable and evolve as your business grows. It also
requires a systematically built and easy to maintain ETL process.
Congruent aids you in planning, designing and implementing an effective database solution
to keep you way ahead of your competitors. With data coming in from various disparate
sources and in different forms, it is important to have a Data Warehousing development
partner who has deep understanding and experience of working with various source systems
as well to enable faster and effective development of the data warehouse.
Congruent thoroughly studies your current systems, business needs to understand what you
are trying to achieve with the data warehousing solution and what level of reporting and
analytical capabilities you expect of it. Congruent is technically equipped to deliver a world
class business centric database solution to meet your current and future requirements for
gathering & presenting business intelligence to users.
Our Data Warehousing Development Services
Data Modeling - Define and analyze business data requirements. SQL Server
Management Studio, MySQL Workbench, Oracle SQL Developer Data Modeler etc.
are some of the data modelling tools we are familiar with
Data Integration - Integrate data from different unrelated sources stored using
various technologies to provide a combined view of data which provides useful
information. Our developers have the necessary technical experience to work with
ETL tools
Data Mart Development - Our developers are capable of building individual data
marts dedicated separately to each business function addressing all the needs of a
specific team or department
Data Analysis - Our data analysis process helps to discover business critical
information from raw data after thoroughly analyzing it with various tools
Design, development, operation & support of the Data Warehouse
Features of our Data Warehousing Development services
Scalable - A data warehousing solution that is scalable and grows along with your
business
Data Accuracy - Highly accurate and consistent data helping you make better
informed decisions
Consolidated view - One single place to store the data from all disparate sources and
in the right format to support Business Intelligence solutions
Quick to respond - Minimal response time for queries enabling faster decision
making
Self-service - Supports self-service BI with little or no dependence on IT
Why Congruent?
The objective of a data warehouse is to make large amounts of data easily accessible to the
users, hence allowing the users to extract information about the business as a whole. But we
know that there could be some security restrictions applied on the data that can be an obstacle
for accessing the information. If the analyst has a restricted view of data, then it is impossible
to capture a complete picture of the trends within the business.
The data from each analyst can be summarized and passed on to management where the
different summaries can be aggregated. As the aggregations of summaries cannot be the same
as that of the aggregation as a whole, it is possible to miss some information trends in the
data unless someone is analyzing the data as a whole.
Security Requirements
Adding security features affect the performance of the data warehouse, therefore it is
important to determine the security requirements as early as possible. It is difficult to add
security features after the data warehouse has gone live.
During the design phase of the data warehouse, we should keep in mind what data sources
may be added later and what would be the impact of adding those data sources. We should
consider the following possibilities during the design phase.
Whether the new data sources will require new security and/or audit restrictions to be
implemented?
Whether the new users added who have restricted access to data that is already
generally available?
This situation arises when the future users and the data sources are not well known. In such a
situation, we need to use the knowledge of business and the objective of data warehouse to
know likely requirements.
User access
Data load
Data movement
Query generation
User Access
We need to first classify the data and then classify the users on the basis of the data they can
access. In other words, the users are classified according to the data they can access.
Data Classification
There are some issues in the second approach. To understand, let's have an example. Suppose
you are building the data warehouse for a bank. Consider that the data being stored in the data
warehouse is the transaction data for all the accounts. The question here is, who is allowed to
see the transaction data. The solution lies in classifying the data according to the function.
User classification
The following approaches can be used to classify the users:
Users can be classified as per the hierarchy of users in an organization, i.e., users can
be classified by departments, sections, groups, and so on.
Users can also be classified according to their role, with people grouped across
departments based on their role.
Let's have an example of a data warehouse where the users are from sales and marketing
department. We can have security by top-to-down company view, with access centered on the
different departments. But there could be some restrictions on users at different levels. This
structure is shown in the following diagram.
But if each department accesses different data, then we should design the security access for
each department separately. This can be achieved by departmental data marts. Since these
data marts are separated from the data warehouse, we can enforce separate security
restrictions on each data mart. This approach is shown in the following figure.
Classification on basis of Role
If the data is generally available to all the departments, then it is useful to follow the role
access hierarchy. In other words, if the data is generally accessed by all If the data is
generally available to all the departments, then it is useful to follow the role access hierarchy.
In other words, if the data is generally accessed by all
Audit Requirements
Auditing is a subset of security, a costly activity. Auditing can cause heavy overheads on the
system. To complete an audit in time, we require more hardware and therefore, it is
recommended that wherever possible, auditing should be switched off. Audit requirements
can be categorized as follows:
Connections
Disconnections
Data access
Data change
Note : For each of the above-mentioned categories, it is necessary to audit success, failure, or
both. From the perspective of security reasons, the auditing of failures are very important.
Auditing of failure is important because they can highlight unauthorized or fraudulent access.
Network Requirements
Network security is as important as other securities. We cannot ignore the network security
requirement. We need to consider the following issues:
These restrictions need to be considered carefully. Following are the points to remember:
The process of encryption and decryption will increase overheads. It would require
more processing power and processing time.
The cost of encryption can be high if the system is already a loaded system because
the encryption is borne by the source system.
Data Movement
There exist potential security implications while moving the data. Suppose we need to
transfer some restricted data as a flat file to be loaded. When the data is loaded into the data
warehouse, the following questions are raised:
If we talk about the backup of these flat files, the following questions are raised:
Some other forms of data movement like query result sets also need to be considered. The
questions raised while creating the temporary table are as follows:
We should avoid the accidental flouting of security restrictions. If a user with access to the
restricted data can generate accessible temporary tables, data can be visible to non-authorized
users. We can overcome this problem by having a separate temporary area for users with
access to restricted data.
Documentation
The audit and security requirements need to be properly documented. This will be treated as a
part of justification. This document can contain all the information gathered from:
Data classification
User classification
Network requirements
Data movement and storage requirements
All auditable actions
Security affects the application code and the development timescales. Security affects the
following area.
Application development
Database design
Testing
Application Development
Security affects the overall application development and it also affects the design of the
important components of the data warehouse such as load manager, warehouse manager, and
query manager. The load manager may require checking code to filter record and place them
in different locations. More transformation rules may also be required to hide certain data.
Also there may be requirements of extra metadata to handle any extra objects.
To create and maintain extra views, the warehouse manager may require extra codes to
enforce security. Extra checks may have to be coded into the data warehouse to prevent it
from being fooled into moving data into a location where it should not be available. The
query manager requires the changes to handle any access restrictions. The query manager will
need to be aware of all extra views and aggregations.
Database design
The database layout is also affected because when security measures are implemented, there
is an increase in the number of views and tables. Adding security increases the size of the
database and hence increases the complexity of the database design and management. It will
also add complexity to the backup management and recovery plan.
Testing
Testing the data warehouse is a complex and lengthy process. Adding security to the data
warehouse also affects the testing time complexity. It affects the testing in the following two
ways:
It will increase the time required for integration and system testing.
There is added functionality to be tested which will increase the size of the testing
suite.
OLAP (Online Analytical Processing) is the technology behind many Business Intelligence
(BI) applications. OLAP is a powerful technology for data discovery, including capabilities
for limitless report viewing, complex analytical calculations, and predictive “what if”
scenario (budget, forecast) planning.
TYPES OF OLAP
Gartner coined the term HTAP in a paper in the beginning of 2014 to describe new in-
memory data systems that do both online transaction processing (OLTP) and online
analytical processing (OLAP).
HTAP relies on newer and much more powerful, often distributed, processing: sometimes it
involves a new hardware “appliance”, and it almost always requires a new software platform.
Beyond this, the key point seems to be that all the technology is sited in the relational
database. And so, there’s no more data replication, and new transactional information
becomes part of an analytical model in as fast a time as is technologically possible.
HTAP represents a new way to tie data together in a way that hasn’t been possible before– a
real uniting of relational data stored in tables with the data models that are used for decision
making by the business leaders.
For an example of an HTAP product, check out Olation® from PARIS Tech, the sponsor of
OLAP.com. Olation can be categorized as an HTAP product — even the name Olation
implies the combination of “OLAP” and “relational” technologies.
The structure of a multidimensional model is not a series of tables (as exists in a relational
database) but what is generally referred to as a cube. Cubes modeled in a multidimensional
database extend the concept associated with spreadsheets: just as a cell in a spreadsheet
represents the intersection of two dimensions (sales of product by region), a cell in a cube
represents the intersection of an infinite number of dimension members (e.g., Products,
Customers, Regions, Months …nth dimension). As in a spreadsheet, a cell might be
calculated by formulas involving other cells.
In short, multidimensional databases allow users to add extra dimensions, rather than
additional tables, as in a relational model. And the MOLAP cube structure allows for
particularly fast, flexible data-modeling and calculations. For one, locating cells is vastly
simplified—an application can identify a cell location by name (at the intersection of
dimension members) rather than by searching an index or the entire model (via SQL
SELECT statements), as in a relational database. Further, multidimensional models
incorporate advanced array-processing techniques and algorithms for managing data and
calculations. As a result, multidimensional databases can store data very efficiently and
process calculations in a fraction of the time required of relational-based products.
For one, relevant data must be transferred from relational systems,which is aa potentially
“redundant” re-creation of data in another (multidimensional) database. Once data has been
transferred, there may be no simple means for updating the MOLAP “engine” as individual
transactions are recorded by the RDBMS. Also, MOLAP products are typically proprietary
systems. For some IT departments, introducing a new database system is an anathema, even
if it means significantly greater productivity for the type of planning, analysis and reporting
that end-users rely on the (MOLAP) solution to perform.
For a good example of a fast, scalable MOLAP product, check out PowerOLAP® from
PARIS Tech, the sponsor of OLAP.com.
ROLAP products (for Relational OLAP) are credited with being able to directly access data
stored in relational databases. The notion is that they can readily retrieve transactional data,
although this becomes suspect when very large data sets are in play, or if more complex
calculations are to be delivered, based on the transactional data. ROLAP products enable
organizations to leverage their existing investments in RDBMS (relational database
management system) software.
ROLAP products access a relational database by using SQL (structured query language),
which is the standard language that is used to define and manipulate data in an RDBMS.
Subsequent processing may occur in the RDBMS or within a mid-tier server, which accepts
requests from clients, translates them into SQL statements, and passes them on to the
RDBMS.
ROLAP products provide GUIs and generate SQL execution plans that typically remove end-
users from the SQL writing process. However, this over-reliance on processing via SQL
statements—including processing for multidimensional analysis—is a drawback. Whether it
is generated “transparently” or not, SQL is the language of relational tables: SQL’s
vocabulary is limited and its grammar often inflexible, at least to accommodate the most
sophisticated modeling required for multidimensional analyses.
There are further drawbacks to structuring a multidimensional model solely within relational
tables: Before end-users can submit requests, the relevant dimension data must be extracted
and reformatted in de-normalized structures known as star schema or snowflakes (so-called
because of the way the tables are conjoined). These tabular structures are necessary to
provide acceptable analytical performance. Sophisticated ROLAP applications also require
that aggregate tables be pre-built and maintained, eliminating the need to process summary
data at runtime
One advantage of ROLAP over the other styles of OLAP analytic tools is that it is deemed to
be more scalable in handling huge amounts of data. ROLAP sits on top of relational
databases therefore enabling it to leverage several functionalities that a relational database is
capable of.
HOLAP is the product of the attempt to incorporate the best features of MOLAP and ROLAP
into a single architecture. This kind of tool tries to bridge the technology gap of both products
by enabling access to or use of both multidimensional database (MDDB) and Relational
Database Management System (RDBMS) data stores. HOLAP systems store larger quantities
of detailed data in the relational tables while the aggregations are stored in the pre-calculated
cubes. HOLAP also has the capacity to “drill through” from the cube down to the relational
tables for delineated data.Some of the advantages of this system are better scalability, quick
data processing and flexibility in accessing of data sources. The issue with HOLAP systems
lies precisely in the fact that they are hybrids: at best they partake of the strengths of other
systems…but they also evince the weaknesses of each, in an attempted mashup of two
distinct technologies.
Characteristics of OLAP
The most distinctive characteristic of modern OLAP tools is their capacity for
multidimensional analysis. In multidimensional analysis, data are processed and viewed as
part of a multidimensional structure. This type of data analysis is particularly attractive to
business decision makers because they tend to view business data as data that are related to
other business data.
To better understand this view, let’s examine how a business data analyst might investigate
sales figures.
In the above tabular (operational) view of sales data is not well suited to decision support,
because the relationship between INVOICE and LINE does not provide a business
perspective of the sales data. But consider the following figure.
The end user’s view of sales data from a business perspective is more closely represented by
the multidimensional view of sales than by the tabular view of separate tables. Note also that
the multidimensional view allows end users to consolidate or aggregate data at different
levels: total sales figures by customers and by date. Finally, the multidimensional view of
data allows a business data analyst to easily switch business perspectives (dimensions) from
sales by customer to sales by division, by region, and so on.
3-D graphics, pivot tables, crosstabs, data rotation, and three dimensional cubes. Such
facilities are compatible with desktop spreadsheets, statistical packages, and query and report
packages.
These include business-oriented variables (market share, period comparisons, sales margins,
product margins, and percentage changes), financial and accounting ratios (profitability,
overhead, cost allocations, and returns), and statistical and forecasting functions. These
functions are provided automatically, and the end user does not need to redefine their
components each time they are accessed.
These provide support for what-if scenarios, variable assessment, variable contributions to
outcome, linear programming, and other modeling tools.
To deliver efficient decision support, OLAP tools must have advanced data access features.
Such features include:
• Access to many different kinds of DBMSs, flat files, and internal and external data sources.
• Access to aggregated data warehouse data as well as to the detail data found in operational
databases.
• The ability to map end-user requests, expressed in either business or model terms, to the
appropriate data source and then to the proper data access language (usually SQL). The query
code must be optimized to match the data source, regardless of whether the source is
operational or data warehouse data.
• Support for very large databases. As already explained the data warehouse can easily and
quickly grow to multiple gigabytes and even terabytes.
Advanced OLAP features become more useful when access to them is kept simple. OLAP
tools have equipped their sophisticated data extraction and analysis tools with easy-to-use
graphical interfaces. Many of the interface features are “borrowed” from previous generations
of data analysis tools that are already familiar to end users. This familiarity makes OLAP
easily accepted and readily used.
Client/server architecture provides a framework within which new systems can be designed,
developed, and implemented. The client/server environment enables an OLAP system to be
divided into several components that define its architecture. Those components can then be
placed on the same computer, or they can be distributed among several computers. Thus,
OLAP is designed to meet ease-of-use requirements while keeping the system flexible.
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent,
and interactive access to information. This chapter cover the types of OLAP, operations on
OLAP, difference between OLAP, and statistical databases and OLTP.
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
Multidimensional OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
Specialized SQL servers provide advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:
Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows the
pivot operation.
In this the item and location axes in 2-D slice are rotated.
OLAP vs OLTP
APPLICATION OF OLAP
OLAP is useful in helping to determine why the data appears the way it does. For instance, if
the sales for North America are way up or way down for a given quarter, it's easy to expand
the North American geography into the states to see which state or states may be responsible
for the difference.
OLAP tools also allow a breakdown structure of the data where you start with a single piece
of data and you dissect it into a series of levels looking at the data for something interesting.
For instance, let's say that we're looking at a total sales number for the month of June.
Breaking down this number might first occur by country as a member of the geography
dimension. In seeing a larger than expected number for North America, you might choose to
break the number down into product lines. From the product lines breakdown you may notice
that your gold plated widget sales were up substantially. You decide to look at the sales by
state so you select the gold plated widget product line and expand the state level of the
geography hierarchy. Each level in the breakdown tells you more details about what makes
up the number you're interested in.
DATA MINING
Data mining is the computing process of discovering patterns in large data sets involving
methods at the intersection of artificial intelligence, machine learning, statistics, and database
systems.[1] It is an interdisciplinary subfield of computer science.[1][2][3] The overall goal of
the data mining process is to extract information from a data set and transform it into an
understandable structure for further use.[1] Aside from the raw analysis step, it involves
database and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-processing of
discovered structures, visualization, and online updating.[1] Data mining is the analysis step of
the "knowledge discovery in databases" process, or KDD.[4]
Sifting through very large amounts of data for useful information. Data mining uses artificial
intelligence techniques, neural networks, and advanced statistical tools (such as cluster
analysis) to reveal trends, patterns, and relationships, which might otherwise have remained
undetected. In contrast to an expert system (which draws inferences from the given data on
the basis of a given set of rules) data mining attempts to discover hidden rules underlying the
data. Also called data surfing.
Key difference: Data Mining is actually the analysis of data. It is the computer-assisted
process of digging through and analyzing enormous sets of data that have either been
compiled by the computer or have been inputted into the computer. Data warehousing
is the process of compiling information or data into a data warehouse. A data
warehouse is a database used to store data.
Data Mining is actually the analysis of data. It is the computer-assisted process of digging
through and analyzing enormous sets of data that have either been compiled by the computer
or have been inputted into the computer. In data mining, the computer will analyze the data
and extract the meaning from it. It will also look for hidden patterns within the data and try to
predict future behavior. Data Mining is mainly used to find and show relationships among the
data.
The purpose of data mining, also known as knowledge discovery, is to allow businesses to
view these behaviors, trends and/or relationships and to be able to factor them within their
decisions. This allows the businesses to make proactive, knowledge-driven decisions.
The term ‘data mining’ comes from the fact that the process of data mining, i.e. searching for
relationships between data, is similar to mining and searching for precious materials. Data
mining tools use artificial intelligence, machine learning, statistics, and database systems to
find correlations between the data. These tools can help answer business questions that
traditionally were too time consuming to resolve.
Data Mining includes various steps, including the raw analysis step, database and data
management aspects, data preprocessing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered structures, visualization,
and online updating.
In contrast, data warehousing is completely different. However, data warehousing and data
mining are interrelated. Data warehousing is the process of compiling information or data
into a data warehouse. A data warehouse is a database used to store data. It is a central
repository of data in which data from various sources is stored. This data warehouse is then
used for reporting and data analysis. It can be used for creating trending reports for senior
management reporting such as annual and quarterly comparisons.
The purpose of a data warehouse is to provide flexible access to the data to the user. Data
warehousing generally refers to the combination of many different databases across an entire
enterprise.
The main difference between data warehousing and data mining is that data warehousing is
the process of compiling and organizing data into one common database, whereas data
mining is the process of extracting meaningful data from that database. Data mining can only
be done once data warehousing is complete.
Advantages and Disadvantages of Data
Mining
Data mining is an important part of knowledge discovery process that we can analyze an
enormous set of data and get hidden and useful knowledge. Data mining is applied
effectively not only in the business environment but also in other fields such as weather
forecast, medicine, transportation, healthcare, insurance, government…etc. Data mining has a
lot of advantages when using in a specific industry. Besides those advantages, data mining
also has its own disadvantages e.g., privacy, security and misuse of information. We will
examine those advantages and disadvantages of data mining in different industries in a
greater detail.
Data mining helps marketing companies build models based on historical data to predict who
will respond to the new marketing campaigns such as direct mail, online marketing
campaign…etc. Through the results, marketers will have an appropriate approach to selling
profitable products to targeted customers.
Data mining brings a lot of benefits to retail companies in the same way as marketing.
Through market basket analysis, a store can have an appropriate production arrangement in a
way that customers can buy frequent buying products together with pleasant. In addition, it
also helps the retail companies offer certain discounts for particular products that will attract
more customers.
Finance / Banking
Data mining gives financial institutions information about loan information and credit
reporting. By building a model from historical customer’s data, the bank, and financial
institution can determine good and bad loans. In addition, data mining helps banks detect
fraudulent credit card transactions to protect credit card’s owner.
Manufacturing
By applying data mining in operational engineering data, manufacturers can detect faulty
equipment and determine optimal control parameters. For example, semiconductor
manufacturers have a challenge that even the conditions of manufacturing environments at
different wafer production plants are similar, the quality of wafer are a lot the same and some
for unknown reasons even has defects. Data mining has been applying to determine the
ranges of control parameters that lead to the production of the golden wafer. Then those
optimal control parameters are used to manufacture wafers with desired quality.
Governments
Data mining helps government agency by digging and analyzing records of the financial
transaction to build patterns that can detect money laundering or criminal activities.
The concerns about the personal privacy have been increasing enormously recently especially
when the internet is booming with social networks, e-commerce, forums, blogs…. Because of
privacy issues, people are afraid of their personal information is collected and used in an
unethical way that potentially causing them a lot of troubles. Businesses collect information
about their customers in many ways for understanding their purchasing behaviors trends.
However businesses don’t last forever, some days they may be acquired by other or gone. At
this time, the personal information they own probably is sold to other or leak.
Security issues
Security is a big issue. Businesses own information about their employees and customers
including social security number, birthday, payroll and etc. However how properly this
information is taken care is still in questions. There have been a lot of cases that hackers
accessed and stole big data of customers from the big corporation such as Ford Motor Credit
Company, Sony… with so much personal and financial information available, the credit card
stolen and identity theft become a big problem.
Information is collected through data mining intended for the ethical purposes can be
misused. This information may be exploited by unethical people or businesses to take
benefits of vulnerable people or discriminate against a group of people.
If you are interested in using data mining, it is important to have an understanding of how it
works. The method that is used with data mining to make predictions is called modeling.
Modeling is the process of creating a model.
However, if you have a database which has information related to the income, credit history,
age, sex, and occupation of your customers, you can use data mining tools to find the
common attributes that are related to the customers that frequently make long distance phone
calls.
The use of data mining may allow you to learn that most of your high value customers are
middle aged women that are 45 years of age. You may also find that these women have an
average income that is in excess of $50,000 a year. Now that you know a bit about your best
customers, you can now tailor your advertising efforts to suit their needs. By doing this, you
will greatly increase your chances of earning a profit. Computer algorithms are frequently
used in data mining programs. However, the factors which have led to the increasing
popularity of data mining technologies are the increase in both processing power and storage.
Another thing that has led to the rapid popularity of data mining technology are graphical
interfaces. These interfaces have made the programs easier to use, and this has allowed them
to be adapted by a larger segment of the population. Artificial neural networks are a cutting
edge technology that is being used more in data mining applications.
Unlike computer algorithms, neural networks are not linear, and are capable of learning.
Neural networks are modeled after the human mind, and have powerful applications in data
mining that have not been fully explored. In addition to this, decision trees play an important
role in the development of data mining programs.
As the name implies, decision trees are structures have a number of different decisions. Each
decision could be called a branch. The decisions define the rules for a given set of data. The
next element that makes up an important part of data mining is called rule induction . A rule
induction will pull rules from data which are based on an "if-then" scenario.
The next part that makes up data mining is a genetic algorithm. The genetic algorithm will
utilize techniques that are based on mutation and natural selection. The last important part of
data mining tools is called the nearest neighbor. The nearest neighbor will categorize records
with other records that are similar within a database.
There are a number of real-world applications of data mining programs. Generally, having
information which is highly detailed will allow you to make predictions that are equally
detailed. Using this detailed information to make predictions about the behavior of your
customers can allow you to make large profits.
Companies can use data mining tools to get answers to complex questions. For example, a
credit card company that wants to increase its revenues could use data mining to find out if
reducing the minimum payments would allow them to earn more interest. If the company has
detailed information related to their customers, they should be able to make accurate
predictions about how customers will react to policies.
Data mining is a popular technological innovation that converts piles of data into useful
knowledge that can help the data owners/users make informed choices and take smart actions
for their own benefit. In specific terms, data mining looks for hidden patterns amongst
enormous sets of data that can help to understand, predict, and guide future behavior. A more
technical explanation: Data Mining is the set of methodologies used in analyzing data from
various dimensions and perspectives, finding previously unknown hidden patterns,
classifying and grouping the data and summarizing the identified relationships.
The elements of data mining include extraction, transformation, and loading of data onto the
data warehouse system, managing data in a multidimensional database system, providing
access to business analysts and IT experts, analyzing the data by tools, and presenting the
data in a useful format, such as a graph or table. This is achieved by identifying relationship
using classes, clusters, associations, and sequential patterns by the use of statistical analysis,
machine leaning and neural networks.
Data can generate revenue. It is a valuable financial asset of an enterprise. Businesses can use
data mining for knowledge discovery and exploration of available data. This can help them
predict future trends, understand customer’s preferences and purchase habits, and conduct a
constructive market analysis. They can then build models based on historical data patterns
and garner more from targeted market campaigns as well as strategize more profitable selling
approaches. Data mining helps enterprises to make informed business decisions, enhances
business intelligence, thereby improving the company’s revenue and reducing cost overheads.
Data mining is also useful in finding data anomaly patterns that are essential in fraud
detection and areas of weak or incorrect data collation/ modification. Getting the help of
experienced data entry service providers in the early stages of data management can make the
subsequent data mining easier.
The art of data mining has been constantly evolving. There are a number of innovative and
intuitive techniques that have emerged that fine-tune data mining concepts in a bid to give
companies more comprehensive insight into their own data with useful future trends. Many
techniques are employed by the data mining experts, some of which are listed below:
Data mining relies on the actual data present, hence if data is incomplete, the results
would be completely off-mark. Hence, it is imperative to have the intelligence to sniff
out incomplete data if possible. Techniques such as Self-Organizing-Maps (SOM’s),
help to map missing data based by visualizing the model of multi-dimensional
complex data. Multi-task learning for missing inputs, in which one existing and valid
data set along with its procedures is compared with another compatible but
incomplete data set is one way to seek out such data. Multi-dimensional preceptors
using intelligent algorithms to build imputation techniques can address incomplete
attributes of data.
3. Database Analysis:
Databases hold key data in a structured format, so algorithms built using their own
language (such as SQL macros) to find hidden patterns within organized data is most
useful. These algorithms are sometimes inbuilt into the data flows, e.g. tightly coupled
with user-defined functions, and the findings presented in a ready-to-refer-to report
with meaningful analysis.
A good technique is to have the snapshot dump of data from a large database in a
cache file at any time and then analyze it further. Similarly, data mining algorithms
must be able to pull out data from multiple, heterogeneous databases and predict
changing trends.
4. Text Analysis:
This concept is very helpful to automatically find patterns within the text embedded in
hordes of text files, word-processed files, PDFs, and presentation files. The text-
processing algorithms can for instance, find out repeated extracts of data, which is
quite useful in the publishing business or universities for tracing plagiarism.
A data warehouse or large data stors must be supported with interactive and query-
based data mining for all sorts of data mining functions such as classification,
clustering, association, prediction. OLAP (Online Analytical Processing) is one such
useful methodology. Other concepts that facilitate interactive data mining are
analyzing graphs, aggregate querying, image classification, meta-rule guided mining,
swap randomization, and multidimensional statistical analysis.
Listed below are some of the popular multi-purpose data mining tools
that are leading the trends:
This is very popular since it is a ready made, open source, no-coding required
software, which gives advanced analytic s. Written in Java, it incorporates
multifaceted data mining functions such as data preprocessing, visualization,
predictive analysis, and can be easily integrated with WEKA and R-tool to directly
give models from scripts written in the former two.
9. WEKA:
This is written in C and FORTRAN, and allows the data miners to write scripts just
like a programming language/platform. Hence, it is used to make statistical and
analytical software for data mining. It supports graphical analysis, both linear and
nonlinear modeling, classification, clustering and time-based data analysis.
Python is very popular due to ease of use and its powerful features. Orange is an open
source tool that is written in Python with useful data analytic s, text analysis, and
machine-learning features embedded in a visual programming interface. NTLK, also
composed in Python, is a powerful language processing data mining tool, which
consists of data mining, machine learning, and data scraping features that can easily
be built up for customized needs.
12. Knime:
Primarily used for data preprocessing – i.e. data extraction, transformation and
loading, Knime is a powerful tool with GUI that shows the network of data nodes.
Popular amongst financial data analysts, it has modular data pipe lining, leveraging
machine learning, and data mining concepts liberally for building business
intelligence reports.
Data mining tools and techniques are now more important than ever for all businesses, big or
small, if they would like to leverage their existing data stores to make business decisions that
will give them a competitive edge. Such actions based on data evidence and advanced
analytics have better chances of increasing sales and facilitating growth. Adopting well-
established techniques and tools and availing the help of data mining experts shall assist
companies to utilize relevant and powerful data mining concepts to their fullest potential.
Problem definition
A data mining project starts with the understanding of the business problem. Data
mining experts, business experts, and domain experts work closely together to define
the project objectives and the requirements from a business perspective. The project
objective is then translated into a data mining problem definition.
In the problem definition phase, data mining tools are not yet required.
Data exploration
Domain experts understand the meaning of the metadata. They collect, describe, and
explore the data. They also identify quality problems of the data. A frequent exchange
with the data mining experts and the business experts from the problem definition
phase is vital.
In the data exploration phase, traditional data analysis tools, for example, statistics,
are used to explore the data.
Data preparation
Domain experts build the data model for the modeling process. They collect, cleanse,
and format the data because some of the mining functions accept data only in a certain
format. They also create new derived attributes, for example, an average value.
In the data preparation phase, data is tweaked multiple times in no prescribed order.
Preparing the data for the modeling tool by selecting tables, records, and attributes,
are typical tasks in this phase. The meaning of the data is not changed.
Modeling
Data mining experts select and apply various mining functions because you can use
different mining functions for the same type of data mining problem. Some of the
mining functions require specific data types. The data mining experts must assess
each model.
In the modeling phase, a frequent exchange with the domain experts from the data
preparation phase is required.
The modeling phase and the evaluation phase are coupled. They can be repeated
several times to change parameters until optimal values are achieved. When the final
modeling phase is completed, a model of high quality has been built.
Evaluation
Data mining experts evaluate the model. If the model does not satisfy their
expectations, they go back to the modeling phase and rebuild the model by changing
its parameters until optimal values are achieved. When they are finally satisfied with
the model, they can extract business explanations and evaluate the following
questions:
At the end of the evaluation phase, the data mining experts decide how to use the data
mining results.
Deployment
Data mining experts use the mining results by exporting the results into database
tables or into other applications, for example, spreadsheets.
The Intelligent Miner™ products assist you to follow this process. You can apply the
functions of the Intelligent Miner products independently, iteratively, or in combination.
The following figure shows the phases of the Cross Industry Standard Process for data
mining (CRISP DM) process model.
IM Modeling helps you to select the input data, explore the data, transform the data, and mine
the data. With IM Visualization you can display the data mining results to analyze and
interpret them. With IM Scoring, you can apply the model that you have created with IM
Modeling.
What is Market Basket Analysis?
In market basket analysis (also called association analysis or frequent itemset mining), you
analyze purchases that commonly happen together. For example, people who buy bread
and peanut butter also buy jelly. Or people who buy shampoo might also buy conditioner.
What relationships there are between items is the target of the analysis. Knowing what your
customers tend to buy together can help with marketing efforts and store/website layout.
Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain
group of items, you are more (or less) likely to buy another group of items. For example, if you are in
an English pub and you buy a pint of beer and don't buy a bar meal, you are more likely to buy crisps
(US.
Market basket analysis can also be used to cross-sell products. Amazon famously uses an
algorithm to suggest items that you might be interested in, based on your browsing history or
what other people have purchased.
A popular urban legend is that a grocery store, after running market basket analysis, found
that men were likely to buy beer and diapers together. Sales increased sales by placing beer
next to the diapers.
It sounds simple (and in many cases, it is). However, pitfalls to be aware of:
For large inventories (i.e. over 10,000), the combination of items may explode into
the billions, making the math almost impossible.
Data is often mined from large transaction histories. A large amount of data is usually
handled by specialized statistical software (see below).
Basic Terminology
An itemset is the set of items a customer buys at the same time. It’s typically stated as a logic
rule like IF {bread, peanut butter} THEN {jelly}. An itemset can consist of no items (a null
amount though, is usually ignored) to all items in the data set.
The support count is a count of how often the itemset appears in the transaction database. The
support is how often the item appears, stated as a probability. For example, if the support
count is 21 out of a possible 1,000 transactions, then the probability is 21/1,000 or 0.021.
The confidence is the conditional probability that the items will be purchased together.
Calculations
Calculations are rarely performed by hand, due to large number of combinations possible
from even relatively small datasets. Software that can perform market basket analysis
include:
Predictive modeling is a process that uses data mining and probability to forecast outcomes. Each
model is made up of a number of predictors, which are variables that are likely to influence future
results. Once data has been collected for relevant predictors, a statistical model is formulated.
The assumption is that the new data comes from the similar distribution as the data you used
to build your decision tree. In many cases this is a correct assumption and that is why you can
use the decision tree for building a predictive model.
gabrielac adds
In the book "Data Mining Concepts and Techniques", Han and Kamber's view is that
predicting class labels is classification, and predicting values (e.g. using regression
techniques) is prediction.
In decision analysis, a decision tree can be used to visually and explicitly represent decisions
and decision making. In data mining, a decision tree describes data (but the resulting
classification tree can be an input for decision making). This page deals with decision trees in
data mining.
Classification tree analysis is when the predicted outcome is the class to which the
data belongs.
Regression tree analysis is when the predicted outcome can be considered a real
number (e.g. the price of a house, or a patient's length of stay in a hospital).
The term Classification And Regression Tree (CART) analysis is an umbrella term used to
refer to both of the above procedures, first introduced by Breiman et al.[3] Trees used for
regression and trees used for classification have some similarities - but also some differences,
such as the procedure used to determine where to split.[3]
Some techniques, often called ensemble methods, construct more than one decision tree:
Decision tree learning is the construction of a decision tree from class-labeled training
tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node
denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or
terminal) node holds a class label. The topmost node in a tree is the root node.
Regression analysis is an important tool for modelling and analyzing data. Here, we fit a
curve / line to the data points, in such a manner that the differences between the distances of
data points from the curve or line is minimized. I’ll explain this in more details in coming
sections.
There are multiple benefits of using regression analysis. They are as follows:
Regression analysis also allows us to compare the effects of variables measured on different
scales, such as the effect of price changes and the number of promotional activities. These
benefits help market researchers / data analysts / data scientists to eliminate and evaluate the
best set of variables to be used for building predictive models.
For the creative ones, you can even cook up new regressions, if you feel the need to use a
combination of the parameters above, which people haven’t used before. But before you start
that, let us understand the most commonly used regressions:
1. Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among
the first few topics which people pick while learning predictive modeling. In this
technique, the dependent variable is continuous, independent variable(s) can be continuous or
discrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or
more independent variables (X) using a best fit straight line (also known as regression
line).
The difference between simple linear regression and multiple linear regression is that,
multiple linear regression has (>1) independent variables, whereas simple linear regression
has only 1 independent variable. Now, the question is “How do we obtain best fit line?”.
This task can be easily accomplished by Least Square Method. It is the most common method
used for fitting a regression line. It calculates the best-fit line for the observed data by
minimizing the sum of the squares of the vertical deviations from each data point to the line.
Because the deviations are first squared, when added, there is no cancelling out between
positive and negative values.
We can evaluate the model performance using the metric R-square. To know more details
about these metrics, you can read: Model Performance metrics Part 1, Part 2 .
Important Points:
2. Logistic Regression
Logistic regression is used to find the probability of event=Success and event=Failure. We
should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/
No) in nature. Here the value of Y ranges from 0 to 1 and it can represented by following
equation.
Since we are working here with a binomial distribution (dependent variable), we need to
choose a link function which is best suited for this distribution. And, it is logit function. In
the equation above, the parameters are chosen to maximize the likelihood of observing the
sample values rather than minimizing the sum of squared errors (like in ordinary regression).
Important Points:
Natural language processing (NLP) is a field of computer science, artificial intelligence and
computational linguistics concerned with the interactions between computers and human (natural)
languages, and, in particular, concerned with programming computers to fruitfully process large
natural language corpora.
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process
of deriving high-quality information from text. High-quality information is typically derived through
the devising of patterns and trends through means such as statistical pattern learning.
About us
Cogito
Solutions
Products
Customers
Partners
News
Investor Relations
Text mining can be used to make the large quantities of unstructured data accessible and
useful, thereby generating not only value, but delivering ROI from unstructured data
management as we’ve seen with applications of text mining for Risk Management
Software and Cybercrime applications.
These 10 text mining examples can give you an idea of how this technology is helping
organizations today.
1 – Risk management
No matter the industry, Insufficient risk analysis is often a leading cause of failure. This is
especially true in the financial industry where adoption of Risk Management Software
based on text mining technology can dramatically increase the ability to mitigate risk,
enabling complete management of thousands of sources and petabytes of text documents, and
providing the ability to link together information and be able to access the right information
at the right time.
2 – Knowledge management
Not being able to find important information quickly is always a challenge when managing
large volumes of text documents—just ask anyone in the healthcare industry. Here,
organizations are challenged with a tremendous amount of information—decades of
research in genomics and molecular techniques, for example, as well as volumes of clinical
patient data—that could potentially be useful for their largest profit center: new product
development. Here, knowledge management software based on text mining offer a clear
and reliable solution for the “info-glut” problem.
3 – Cybercrime prevention
The anonymous nature of the internet and the many communication features operated through
it contribute to the increased risk of internet-based crimes. Today, text mining intelligence
and anti-crime applications are making internet crime prevention easier for any
enterprise and law enforcement or intelligence agencies.
Text mining, as well as natural language processing are frequent applications for customer
care. Today, text analytics software is frequently adopted to improve customer experience
using different sources of valuable information such as surveys, trouble tickets, and customer
call notes to improve the quality, effectiveness and speed in resolving problems. Text
analysis is used to provide a rapid, automated response to the customer, dramatically
reducing their reliance on call center operators to solve problems.
Text analytics is a tremendously effective technology in any domain where the majority of
information is collected as text. Insurance companies are taking advantage of text mining
technologies by combining the results of text analysis with structured data to prevent
frauds and swiftly process claims.
6 – Contextual Advertising
Digital advertising is a moderately new and growing field of application for text
analytics. Here, companies such as Admantx have made text mining the core engine for
contextual retargeting with great success. Compared to the traditional cookie-based
approach, contextual advertising provides better accuracy, completely preserves the user’s
privacy.
7 – Business intelligence
This process is used by large companies to uphold and support decision making. Here, text
mining really makes the difference, enabling the analyst to quickly jump at the answer
even when analyzing petabytes of internal and open source data. Applications such as the
Cogito Intelligence Platform (link to CIP) are able to monitor thousands of sources and
analyze large data volumes to extract from them only the relevant content.
8 – Content enrichment
While it’s true that working with text content still requires a bit of human effort, text
analytics techniques make a significant difference when it comes to being able to more
effectively manage large volumes of information. Text mining techniques enrich content,
providing a scalable layer to tag, organize and summarize the available content that makes
it suitable for a variety of purposes.
9 – Spam filtering
E-mail is an effective, fast and reasonably cheap way to communicate, but it comes with a
dark side: spam. Today, spam is a major issue for internet service providers, increasing
their costs for service management and hardware\software updating; for users, spam is an
entry point for viruses and impacts productivity. Text mining techniques can be
implemented to improve the effectiveness of statistical-based filtering methods.
Today, social media is one of the most prolific sources of unstructured data;
organizations have taken notice. Social media is increasingly being recognized as a valuable
source of market and customer intelligence, and companies are using it to analyze or predict
customer needs and understand the perception of their brand. In both needs Text analytics
can address both by analyzing large volumes of unstructured data, extracting opinions,
emotions and sentiment and their relations with brands and products.
Web structure mining, one of three categories of web mining for data, is a tool used to identify the
relationship between Web pages linked by information or direct link connection. This structure data
is discoverable by the provision of web structure schema through database techniques for Web
pages.
Web usage mining refers to the discovery of user access patterns from Web usage logs. Web
structure mining tries to discover useful knowledge from the structure of hyperlinks. Web content
mining aims to extract/mine useful information or knowledge from web page contents. This tutorial
focuses on Web Content Mining.
Web mining is a rapid growing research area. It consists of Web usage mining, Web structure
mining, and Web content mining. Web usage mining refers to the discovery of user access
patterns from Web usage logs. Web structure mining tries to discover useful knowledge from
the structure of hyperlinks. Web content mining aims to extract/mine useful information or
knowledge from web page contents. This tutorial focuses on Web Content Mining.
Web content mining is related but different from data mining and text mining. It is related to
data mining because many data mining techniques can be applied in Web content mining. It
is related to text mining because much of the web contents are texts. However, it is also quite
different from data mining because Web data are mainly semi-structured and/or unstructured,
while data mining deals primarily with structured data. Web content mining is also different
from text mining because of the semi-structure nature of the Web, while text mining focuses
on unstructured texts. Web content mining thus requires creative applications of data mining
and/or text mining techniques and also its own unique approaches. In the past few years,
there was a rapid expansion of activities in the Web content mining area. This is not
surprising because of the phenomenal growth of the Web contents and significant economic
benefit of such mining. However, due to the heterogeneity and the lack of structure of Web
data, automated discovery of targeted or unexpected knowledge information still present
many challenging research problems. In this tutorial, we will examine the following
important Web content mining problems and discuss existing techniques for solving these
problems. Some other emerging problems will also be surveyed.
All these tasks present major research challenges and their solutions also have immediate
real-life applications. The tutorial will start with a short motivation of the Web content
mining. We then discuss the difference between web content mining and text mining, and
between Web content mining and data mining. This is followed by presenting the above
problems and current state-of-the-art techniques. Various examples will also be given to help
participants to better understand how this technology can be deployed and to help businesses.
All parts of the tutorial will have a mix of research and industry flavor, addressing seminal
research concepts and looking at the technology from an industry angle.
Let’s start with the size of the business intelligence market. Just a year ago, experts were
projecting a need for an additional 1.5 million data handling business intelligence
professionals, just in the U.S., with a projected business intelligence and analytics market of
over $20 billion within three years. Since then, we’ve seen some of these numbers borne out
in the rapid trend towards business intelligence adoption.
With business intelligence tools so front and center in today’s marketplace, what will we see
happen to this sort of popular business methodology as we move forward into 2017 and
beyond?
Whereas many of today’s tools are isolated in silos, independently operated by users, and not
connected to a broader network, there is somewhat of a consensus that the next generation of
business intelligence will be geared toward larger sets of users – and more connected to
greater systems.
“BI development has been focussed on small form-factor devices, but the locus will now shift
to very large touch devices,” McKenna wrote. “This will enable teams of colleagues to work
towards decisions through the side by side exploration of data in real thought time.”
This type of trend is now evident, with the expansion of some BI platforms toward more
evolved collaborative and machine learning systems.
Another key concept around the future of business intelligence is that as marketers glean
more and more data about users, those users will jump on the bandwagon and get proactive
about studying themselves. This can lead to more transparent data markets and a more
engaged community.
At the same time, a leveling out of advanced data-gathering technologies will provide more
self-service analysis. Technologies like JSON and will bring more diverse data into the
common Internet, and allow it to be “matched,” linked and manipulated in new and
interesting ways. An article from Tableau at the end of 2016 suggests that this greater access
to data will drive more engagement.
“People will have business discussions using data, engaging each other with facts and
insights,” Francois Ajenstat wrote. “They’ll also be able to collaborate more easily with
experts in the organization and drive insights faster by leveraging the collective intelligence
of the organization.”
Another likely trend rests on common ideas about what happens when you deliver more
demographic and market data to business networks armed with ever-advancing analytical and
automation tools.
Eventually, you get to a place where business intelligence work doesn’t have to be jump-
started by human users anymore. Instead, you reach a tipping point where you’re more likely
to passively receive this intelligence than you are to go looking for it in a report, or even on a
dashboard.
Up to this point, companies have been widely trumpeting innovation in their visual dashboard
designs. Features like more sophisticated charting and graphing have become paramount.
Data visualization has become the watchword. But some suggest that with this type of
advanced being pretty well saturated, we are about see tools get a lot more proactive in
automatically delivering BI to recipients.
A 2011 Quora post by Jason Kolb shows emphasis on this trend has been building for a
while.
“Relevant data will find you, and not vice versa,” Kolb wrote, talking about trends like real-
time deep analysis and more personalized data, several years prior to the current marketplace,
where we’re already seeing some of these “AI” types of advances deliver rich data to
consumers and business users. Now, as we enter 2017, sources like this Targit article are still
hinting at the self-service, user-friendly innovations and proactivity to come.
Network Advancements
At the same time, as software technologies are doing more with BI data, new networking
structures are emerging to handle these massive troves of data and their flow into and out of
business systems.
We’ve seen the cloud take over large swathes of the business world, where third-party
companies offer a raft of services to individual clients to help them manage business
intelligence and analytical data. But behind that vendor wall, you have advances like network
virtualization. Here, engineers are figuring out new ways to deploy hardware architectures,
and to automatically scale with the volume of data. New technologies like container
architectures are also helping this process along.
One major player in the data revolution is Apache — with a range of Hadoop and related
open source tools, the Apache foundation has played a critical role. However, paid services
like Amazon Web services are also major participants in the rush to provide architectural
support for all of the new business intelligence methods hitting the executive’s desk.
All of this describes some of the modern advances in business intelligence as we continue to
progress through the era of big data. The future of business intelligence is likely to be much
more automated and aggressively utilized, with fewer bottlenecks in terms of interface
limitations and the free flow of data. It’s all part of a quickly evolving model that businesses
don’t want to lose out on in the years to come.
For our purpose we will reserve the term "moral" for use in a personal decision making
context. This means that we will use "moral" when dealing with personal or life decisions
with a focus on "right conduct" as the result of a personal choice. Ethical decision making
will be reserved for use in a group decision making context. Specifically, we will address
ethical decision making in business as providing the guiding requirements or goals for right
conduct. These requirements often come as the result of organizational definition, agreement,
or long-standing custom. There is clear recognition that ultimately a personal choice must be
made with respect to right conduct, but business ethics will provide the assessment
framework for correct behavior in the business organization.
Utilitarian
Moral rights
Universalist
Cost-benefit
Fairness or justice
Common good
Virtue
Deontological (based on study of moral obligation)
Theological
Contextualist
Principle-based
As well as others
The good news is that, in general, most approaches will lead to similar choices for most
decisions involving ethics. There are obvious and sometimes notable exceptions, but these
often involve ethical dilemmas that can only be addressed in the context of the specific
decision being made.
Clearly our thoughts affect our words and deeds, and in a group context, ethics in decision
making can be evaluated through the tangible evidence and outcomes from words and
actions. Again, thoughts and motivation are left to the personal realm. As a consequence,
evaluation of appropriate ethical behavior will have limitations. In all outcomes there are the
following possibilities:
Given the difficulty in exposing true motivation, ethical assessments will inherently be
limited to an evaluation emphasis on action or outcome.
Here are some criteria that can help ensure appropriate ethical considerations are part of the
decisions being made in the organization:
Compliance - Does it conform to the company's values and code of ethics? Does it
meet (should exceed) legal requirements?
Promote good and reduce harm - What solution will be good to the most people while
minimizing any possible harm?
Responsibility - What alternative provides the most responsible response? Does the
solution ensure meeting our duties as a good corporate citizen?
Respects and preserves rights - Does the option negatively impact an individual's or
organization's rights?
Promotes trust - Does the solution lead to honest and open communication? Is it
truthful? Is there full disclosure?
Builds reputation - Would a headline of your decision generate pride or shame? Does
your solution add to or detract with the identity you want for the organization?
Given this understanding that should exist between the company and the individual, a change
to the company's values and code of conduct should be given careful consideration. Changing
the basis for the organization's ethics in decision making, in theory, requires a new agreement
with each individual to reconcile with their personal moral choices. In practice, this change
can lead to conflict as an individual's morals now lead to choices that violate the company's
decision making ethics.
What’s the difference between ERP software and BI software? These are two pieces of
software that are often confused, muddled up and perceived to be something they are not.
What they are however are two very distinct pieces of software that do very different things.
Enterprise Resource Planning (or ERP) software doesn’t really live up to its name. In fact,
does it at all? Planning, it doesn’t do much of that. And resource? Wasn’t that just another
buzzword from yesteryear? What matters most here is the term enterprise, so bear that in
mind. I’m all for renaming it. Answers and ideas on a postcard!
They are pretty much a way of integrating departments and functions across a company into
one single computer system. Every department manages information and data differently.
What ERP software aims to do is integrate all of this into one single computer system that
will serve everyone’s needs.
When used properly, ERP software can transform any business and generate a huge return,
both financially and operationally.
Let me give you an example. You take a customer order. Typically that order is jotted down
on an order sheet by your sales rep, before being keyed into their sales report and passed to
the next department to process. That one order then begins a long paper-based journey
through every arm of the business before reaching its end point. What is worrying however is
that no one really knows the status of the order until the customer calls in to grumble that it
hasn’t arrived.
ERP software vanquishes the archaic systems being used in HR, finance, manufacturing and
warehouses, replacing them with a single software solution. This means that from the
moment the order is taken to the time it lands with your customer; the entire business knows
where it is and what is happening with it.
What is BI software?
BI software draws heavily upon the information stored in your ERP system, yet has the
primary objective of analyzing this information. With a history dating back to the 1860s, the
very idea of BI software has been thrown around for over 150 years. While not the
impressive bits of kit we know today, BI has its roots firmly set in understanding how being
the first to hear news of impending battles impacted a banker named Sir Henry Furnese. You
can read more about the history of BI software in another of my posts here.
Feeding on the information stored in your ERP software system, a BI solution will take this
data, analyze it and present it back to you in an actionable and easy to interpret format.
Dashboards, reports and key sales information are fed back to the business in a manner that
can actually be used, rather than being locked away within the ERP software, never to see the
light of day.
While ERP software is traditionally a tool used to process orders, generate invoices and store
all your transactional data, it really wasn’t made for analysis unless you have the patience of
a Saint. While of course doable, trying to analyze the information stored in your ERP
software will lead to more Excel spreadsheets, pivot tables and macros than you can shake a
stick at. And if you’re anything like me, opening a hefty Excel document is like staring into
the pits of hell.
And this is where ERP and BI software really differ. BI software is your number cruncher.
Churning out trends, any movements in a customer’s buying patterns and even your best (and
worst!) performing line items, BI does all the legwork for you so you can put ‘Excel for
Dummies’ down.
We speak to so many people that think there’s no real difference between ERP and BI and
that their ERP software can do just what BI software can, but in short, it can’t. They are two
very different things, with different objectives and end games for your business. An ERP will
keep you organized and everyone reading off the same page, where BI software will make
you proactive in your selling and forward thinking in the way you approach your customers.
In comparison, Business Intelligence is not quite as archaic as some ERP software systems,
but is something relatively new to the tech arena, relying wholly on ERP and back office data
to work. You don’t need top spec ERP software with all the bells and whistles, but you do
need one, of sorts.
As you can see, the difference between BI and ERP systems is really rather large, so I hope
this has cleared up a few things.
We would love to hear what you think too. Have you considered BI software? What are your
thoughts on ERP software? Did you assume ERP and BI software were the same? Share your
thoughts in the comments section below.
Unlike many alternatives, the single solution for both reporting and analysis ensures minimal
administration and training overhead. Additionally, ZAP BI provides a number of reporting
components and tools to make the report creation process easy to use.
Real-time Reports – Gain all the benefits from Reporting Services reports while
providing users with a single reporting environment. Integrate real time reports and
geo maps from Reporting Services into ZAP reports and dashboards, and slice them
by any dimension.
Visualization – Add a chart to a report in a single click, choose from a wide range of
charts, and make charts 3D. See hotspots in the data at a glance with heat maps that
render conditional formatting of data points. Use inline KPI and chart resources in
pivot tables.
True List Reporting – Each member has member properties, which can be dragged
into an analysis. For example, the Customer resource can be dragged on as a row, and
the corresponding ‘member properties’, such as phone numbers and addresses can be
added. Users can also add a measure as another column for the same report.
Cell-level Calculations – Users can create cell-level calculations and formulas that
apply at the column level. Calculations can be created that won’t be affected by filters
on a report.
Downloading – Users can export analyses to several file types, including Excel and
PDF. Charts can also be downloaded as image files.
Drilling – Users can drill-through, drill-up and drill-down to obtain exactly the right
piece of data for a report.
Cell Annotations – Users can annotate cells and have these comments print within
the report pack.
Dynamic Named Sets – If a new item is added, it is added to the Named set and
therefore available to the analytic. For example, users can create a Dynamic range of
account numbers. If Finance created a new account within that range, the new account
would automatically be listed in the report.
Report Types
The following table provides a listing of report types. Additional reporting resources and
tools are also provided.
Analyses are the core tool of ZAP BI because they offer the maximum
flexibility in presenting information. For example, you can use an
analysis to: create a customer list; compare the sales for two products;
create a complex multiple-page report filtered by years; and summarized
by a multi-axis chart.
Analyses also allow you to drill through from the presented data (for
example, to a drill-through report detailing each individual sale making
up the sales total), and to highlight particular cells using conditional
formatting rules to make finding significant discrepancies easy.
By default, analyses are viewed as pivot tables, but they can also be
viewed as charts.
These reports are the core tool of ZAP BI, because they allows users to
achieve the most with their information (for example, create a simple
table, comparing the sales for two products, or create a complex
multiple-page report, filtered by years and summarized by a multi-axis
chart).
Analyses also allow users to drill through presented data (for example, to
a sub-report detailing each individual sale making up the sales total), and
create complex highlighting rules to make finding significant
discrepancies easy. By default, analyses are viewed as pivot tables, but
they can also be viewed as charts.
Charts, like analyses, display data from the Dimension Tree, but do so
graphically. They have filters and slicers like analysis reports. Any
Charts analysis can be turned into a chart, and any chart can display the analysis
(pivot table) on which it is based. Charts can be formatted in many styles
and colors to give a powerful presentation of the data.
A dashboard is a group of resources combined into one unified visual
display. The resources' data can then be dynamically updated and
displayed together. For example, an analysis report, KPI, and a scorecard
might all be displayed in a dashboard. Once added to a dashboard,
resources can be filtered and sliced either as a group or individually.
Dashboard
Dashboards are powerful because they allow a consumer to access a
whole host of different information, grouped together by purpose. For
example, a 'Top 10' Dashboard might display an analysis table of the top
10 selling bicycles, a series of scorecards showing which models are
selling better or worse than expectations, and a group of KPIs showing
Button Item Description
which models are selling better or worse than last month.
Drill-through reports show the detailed records from the OLAP cube that
are used to calculate the summary value displayed in a particular cell in
an analysis. Drill-through reports resemble analysis reports, but they
don’t have a Rows placeholder.
Users can navigate their data on the fly to find answers to even the most
complex questions, without having to write new reports. Users can drill
through to multiple cells at once and ZAP BI renders the complete list of
detailed records.
A KPI, or Key Performance Indicator, is a simple, visual representation
of the performance of a particular aspect of your business. Properly
configured KPIs allow you to dynamically track items such as sales,
returns, or order quantities in a format that is instantly and visually
appealing. KPIs can also be used as building blocks for other visual
reporting tools (for example, scorecards, dashboards and conditional
KPI
events).