BI Lab Manual

Dr. D. Y.
Patil Educational Federation’s

Dr. D. Y. Patil College of Engineering & Innovation,
Varale, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING
Dr. Y. PATIL
EDUCATIONAL FEDERATION
VARALE
Lab Practice-VI
Final Year Computer Engineering
Semester-VIII
PART-II 410253 : Elective VI :Business Intelligence
Subject Code: 410253(C)
Prepared By:
Mr. Sagar A. Dhanake
(Assistant Professor, Computer Engineering)
Academic year 2022-23

Guidelines for Instructor's Manual:
List of recommended programming assignments and sample mini-projects is provided for reference.
Referring to these, Course Teacher or Lab Instructor may frame the assignments/mini-project by
understanding the prerequisites, technological aspects, utility and recent trends related to the
respective courses. Preferably there should be multiple sets of assignments/mini-project and
distributed among batches of students. Real world problems/application based assignments/mini-
projects create interest among learners serving as foundation for future research or startup of
business projects. Mini-project can be completed in group of 2 to 3 students. Software Engineering
approach with proper documentation is to be strictly followed. Use of open source software is to
be encouraged. Instructor may also set one assignment or mini-project that is suitable to the
respective course beyond the scope of syllabus.
Operating System recommended: - 64-bit Open source Linux or its derivative

Programming Languages: C++/JAVA/PYTHON/R
Programming tools recommended: Front End: Java/Perl/PHP/Python/Ruby/.net,
Backend: MongoDB/MYSQL/Oracle, Database Connectivity: ODBC/JDBC,
Additional Tools: Octave, Matlab, WEKA,powerBI
Guidelines for Laboratory /Term Work Assessment:

Continuous assessment of laboratory work is to be done based on overall performance and lab Home
Faculty of Engineering Savitribai Phule Pune University Syllabus for Fourth Year ofComputer
Engineering assignments performance of student. Each lab assignment assessment will assign
grade/marks based on parameters with appropriate weightage.
Suggested parameters for overall assessment as well as each lab assignment assessment include-
timely completion, performance, innovation, efficient codes, punctuality and neatness reserving
weightage for successful mini-project completion and related documentation.
Sr.
No. Laboratory Assignments
1 Import the legacy data from different sources such as ( Excel , SqlServer, Oracle etc.)
and load in the target system. ( You can download sample database such as
Adventureworks, Northwind, foodmart etc.)
2 Perform the Extraction Transformation and Loading (ETL) process to construct the
database in the Sqlserver.
3 Create the cube with suitable dimension and fact tables based on ROLAP, MOLAP
and HOLAP model.
4 Import the data warehouse data in Microsoft Excel and create the Pivot table and Pivot
Chart
5 Perform the data classification using classification algorithm. Or Perform the data
clustering using clustering algorithm.
6 Business Intelligence Mini Project: Each group of 4 Students (max) assigned one case
study for this;
A BI report must be prepared outlining the following steps:
a) Problem definition, identifying which data mining task is needed.
b) Identify and use a standard data mining dataset available for the problem.
Assessment:
Laboratory Assignments:
• The external continuous assessment will be 50 marks.

• The outline of distribution assignments Marks for various aspects /mechanisms towardsContinuous
Assessment is as follows:
Sr. No. Distribution of Marks Marks

1 Term Work 50
Total 50
Dr. D. Y. Patil Educational Federation’s
Dr. D. Y. PATIL COLLEGE OF ENGINEERING & INNOVATION
Department of Computer Engineering
A.Y. 2022-23
Assignment No. 1
Aim : Import the legacy data from different sources such as (Excel , Sql Server, Oracle etc.) and
load in the target system.(You can download sample database such as Adventure works,
Northwind, foodmart etc.)
Objective : To introduce the concepts and components of Business Intelligence (BI)
Prerequisite :
1. Basics of dataset extensions.
2. Concept of data import.
Theory :
Legacy Data:
Legacy data, according to Business Dictionary, is "information maintained in an old or out- of-date
format or computer system that is consequently challenging to access or handle."
Sources of Legacy Data :
Where does legacy data come from? Virtually everywhere. Figure 1 indicates that there are many
sources from which you may obtain legacy data. This includes existing databases, often relational,
although non-RDBs such as hierarchical, network, object, XML, object/relational databases, and
NoSQL databases. Files, such as XML documents or "flat files"• such as configuration files and
comma-delimited text files, are also common sources of legacy data. Software, including legacy
applications that have been wrapped (perhaps via CORBA) and legacy services such as web services
or CICS transactions, can also provide access to existing information. The point to be made is that
there is often far more to gaining access to legacy data than simply writing an SQL query against an
existing relational database.
1
A.Y. 2022-23
How to import legacy data step by step?

Step 1: Open Power BI Desktop.
Step 2: Click on Get data following list will be displayed → select Excel.
2
A.Y. 2022-23
Step 3: Select required file and click on Open, Navigator screen appears.
Step 4: Select file and click on Transform Data.
3
A.Y. 2022-23
Step 5: Power query editor appears.
Step 6: Again, go to Home Get Data and select OData feed.
4
A.Y. 2022-23
Step 7: Click and
Paste url as http://services.odata.org/V3/Northwind/Northwind.svc/
Click on ok.
Step 8: Select Orders table And click on Transform Data .
Note: If you just want to see preview you can just click on table name without clicking on
checkbox Click on edit to view table.
5
A.Y. 2022-23
Step 9: Power query editor appears.
Conclusion: In this way we import the Legacy datasets using the Power BI Tool.
6
A.Y. 2022-23
Assignment No. 2
Aim : Perform the Extraction Transformation and Loading (ETL) process to construct the
database in the Sql server.
Objective : To introduce the concepts and components of Business Intelligence (BI)

Prerequisite :
Software Requirements : SQL Server 2012 Full version
Theory:
Extraction Transformation and Loading (ETL): -
ETL, which stands for extract, transform and load, is a data integration process that combines data
from multiple data sources into a single, consistent data store that is loaded into a data warehouse or
other target system.
As the databases grew in popularity in the 1970s, ETL was introduced as a process for integrating
and loading data for computation and analysis, eventually becoming the primary method to process
data for data warehousing projects.
ETL provides the foundation for data analytics and machine learning work streams. Through a series
of business rules, ETL cleanses and organizes data in a way which addresses specific business
intelligence needs, like monthly reporting, but it can also tackle more advanced analytics, which can
improve back-end processes or end user experiences. ETL is often used by an organization to:
• Extract data from legacy systems
• Cleanse the data to improve data quality and establish consistency
• Load data into a target database
How ETL works :

The easiest way to understand how ETL works is to understand what happens in each step of the
process.
Extract
During data extraction, raw data is copied or exported from source locations to a staging area. Data
management teams can extract data from a variety of data sources, which can be structured or
unstructured. Those sources include but are not limited to:
• SQL or NoSQL servers

• CRM and ERP systems
• Flat files
• Email
• Web pages
1
A.Y. 2022-23
The benefits and challenges of ETL
ETL solutions improve quality by performing data cleansing prior to loading the data to a different
repository. A time-consuming batch operation, ETL is recommended more often for creating smaller
target data repositories that require less frequent updating, while other data integration methods—
including ELT (extract, load, transform), change data capture (CDC), and data virtualization—are
used to integrate increasingly larger volumes of data that changes or real-time data streams.
What Is ETL Process?
To put it simply, the data ETL process including extracting and compiling rawdata,
transforming it to make it intelligible, and loading it into a target system, such as a databaseor data
warehouse, for easy access and analysis. ETL short for Extract, Transform, Load, is an important
component in the data ecosystem of any modern businesses. The ETL process is what helps break
down data silos and makes data easier to access for decision-makers.
Since data coming from multiple sources has a different schema, every dataset must be transformed
differently before utilizing BI and analytics. For instance, if you are compiling data from source
systems like SQL Server and Google Analytics, these two sources will need to be treated individually
throughout the ETL process. The importance of this process has increased since big data analysis has
become a necessary part of every organization.
ETL tools :-
In the past, organizations wrote their own ETL code. There are now many open source and
commercial ETL tools and cloud services to choose from. Typical capabilities of these products
include the following:
• Comprehensive automation and ease of use: Leading ETL tools automate the entire dataflow,
from data sources to the target data warehouse. Many tools recommend rules for extracting,
transforming and loading the data.
• A visual, drag-and-drop interface: This functionality can be used for specifying rules and data
flows.
• Support for complex data management: This includes assistance with complex calculations,
data integrations, and string manipulations
• Security and compliance: The best ETL tools encrypt data both in motion and at rest and are
certified compliant with industry or government regulations, like HIPAA and GDPR.
2
A.Y. 2022-23
Step 1: Open SQL Server Management Studio to restore backup files.
Step 2: Right Click on Databases → Restore Databases.
3
A.Y. 2022-23
Step 3: Select Device → Click on … icon towards end of device box.
Step 4: Click on Add → Select path of backup files → Select Both file and OK → OK → OK.OK
4
A.Y. 2022-23
5
A.Y. 2022-23
Step 5: Open SQL Server Data Tools.
Select File → New → Project → Business Intelligence → Integration Services Project and Give
appropriate Project Name → OK.
Step 6: Right Click on Connection Managers in solution explorer and click on New Connection
Manager.
Add SSIS Connection Manager window appears → Select OLEDB Connection Manager and
Click on Add.
6
A.Y. 2022-23
Step 7: Configure OLE DB Connection Manager Window appears → Click on New
Step 8: Select Server Name(As per your machine) and database name and click on Test
connection → OK → OK → Connection is added to connection manager.
7
A.Y. 2022-23
Step 9: Drag and Drop Data Flow Task in Control Flow Tab.
Step 10: Drag OLE DB Source from Other Sources and drop into Data Flow tab.
8
A.Y. 2022-23
Step 11: Double click on OLE DB Source → OLE DB Source Editor appears → click on New to
add Connection Manager.
Select [Sales].[Store] table from drop down → OK.
Step 12: Drag OLE DB Destination in Data Flow Tab and Connect both.
9
A.Y. 2022-23
Step 13: Double Click on OLE DB Destination → Click on New to run the query to get
[OLE DB Destination] in name of the table or the view → OK .
In Mappings → OK.
In Error Output → OK.
Step 14: Click On Start Debugging (f5).
10
A.Y. 2022-23
Step 15: Go to SQL Server Management Studio
In Database Tab → Adventureworks → Right click on [DBO].[OLE DB Destination] → Script

table as → Select to → New Query Editor Window
Step 16: Execute the query to get Output.
Conclusion: -
In this way we import the Extraction Transformation and Loading (ETL) process to construct the
database in the Sql server.
11
A.Y. 2022-23
Assignment No. 3
Aim : Create the cube with suitable dimension and fact tables based on ROLAP, MOLAP and HOLAP
model.
Objective : To introduce the concepts and components of OLAP, MOLAP and HOLAP model.
Prerequisite:
Software Requirements : SQL Server 2012 Full version
Theory:
What is OLAP?
OLAP was introduced into the business intelligence (BI) space over 20 years ago, in a time where
computer hardware and software technology weren’t nearly as powerful as they are today. OLAP
introduced a (typically analysts) to easily perform multidimensional analysis of large volumes of business
data.
Aggregating, grouping, and joining data are the most difficult types of queries for a relational database
to process. The magic behind OLAP derives from its ability to pre-calculate and pre- aggregate data.
Otherwise, end users would be spending most of their time waiting for query results to be returned by the
database. However, it is also what causes OLAP-based solutions to be extremely rigid and IT-intensive.
Limitations of OLAP cubes :
1. OLAP requires restructuring of data into a star/snowflake schema

2. There is a limited number of dimensions (fields) a single OLAP cube
3. It is nearly impossible to access transactional data in the OLAP cube
4. Changes to an OLAP cube requires a full update of the cube – a lengthy process
OLAP stands for Relational Online Analytical Processing. ROLAP stores data in columns and rows (also
known as relational tables) and retrieves the information on demand through user submitted queries. A
ROLAP database can be accessed through complex SQL queries to calculate information. ROLAP can
handle large data volumes, but the larger the data, the slower the processing times.
Because queries are made on-demand, ROLAP does not require the storage and pre-computation of
information. However, the disadvantage of ROLAP implementations are the potential performance
constraints and scalability limitations that result from large and inefficient join operations between large
tables. Examples of popular ROLAP products include Metacube by Stanford Technology Group, Red
Brick Warehouse by Red Brick Systems, and AXSYS Suite by Information Advantage.
A.Y. 2022-23
What is MOLAP?
MOLAP stands for Multidimensional Online Analytical Processing. MOLAP uses a multidimensional
cube that accesses stored data through various combinations. Data is pre- computed, pre-summarized,
and stored (a difference from ROLAP, where queries are served on- demand).
A multicube approach has proved successful in MOLAP products. In this approach, a series of dense,
small, precalculated cubes make up a hypercube. Tools that incorporate MOLAP include Oracle Essbase,
IBM Cognos, and Apache Kylin. Its simple interface makes MOLAP easy to use, even for inexperienced
users. Its speedy data retrieval makes it the best for “slicing and dicing” operations. One major
disadvantage of MOLAP is that it is less scalable than ROLAP, as it can handle a limited amount of data.
What is HOLAP?
HOLAP stands for Hybrid Online Analytical Processing. As the name suggests, the HOLAP storage
mode connects attributes of both MOLAP and ROLAP. Since HOLAP involves storing part of your data
in a ROLAP store and another part in a MOLAP store, developers get the benefits of both.
With this use of the two OLAPs, the data is stored in both multidimensional databases and relational
databases. The decision to access one of the databases depends on which is most appropriate for the
requested processing application or type. This setup allows much more flexibility for handling data. For
theoretical processing, the data is stored in a multidimensional database. For heavy processing, the data
is stored in a relational database.
Microsoft Analysis Services and SAP AG BI Accelerator are products that run off HOLAP.
Sisense and Elasticubes :

Similar to OLAP-based solutions, Sisense is a Business Intelligence software designed to enable solutions
where multiple business users perform ad-hoc data analysis on a centralized data repository. On the other
hand, Sisense does not achieve this by pre-calculating query results, but rather by utilizing state-of-the-
art technology called ElastiCube. It is a sophisticated columnar database, which was specifically designed
for Business Intelligence solutions. Its unique storage and memory processing technology radically
change the way business intelligence solutions access data.
Powered by ElastiCube, Sisense delivers distinct advantages over OLAP-based solutions:
1. Instant query response times, without pre-calculation or pre-aggregation of data

2. Creation of complicated star/snow flake schemas is not required
3. A data warehouse is not required, but easily supported
4. There are no physical limits to the number of dimensions an ElastiCube can hold
5. ElastiCube provides access to data in any granularity (not merely to aggregated data)
6. Changes to ElastiCubes can be done without re-building the entire data model
7. An ElastiCube requires significantly less powerful hardware than a similar OLAP cube.
A.Y. 2022-23
Step 1: Creating Data Warehouse and run it in your SQL Server.

Downloading "Data_WareHouse__SQLScript.zip" from the article
https://www.codeproject.com/Articles/652108/Create-First-Data-WareHouse
Step 2: Go to the extracted sql file and double click on it → New Sql Query Editor will be opened
containing Sales_DW Database.
Step 3: Click on execute or press F5 by selecting query one by one → After completing execution
save and close SQL Server Management studio & Reopen to see Sales_DW in Databases Tab.
3
A.Y. 2022-23
Step 4: Go to Sql Server Data Tools → Right click and run as administrator.
Click on File → New → Project
In Business Intelligence → Analysis Services Multidimensional and Data Mining

models → appropriate project name → click OK
Step 5: Right click on Data Sources in solution explorer → New Data Source
4
A.Y. 2022-23
Step 6: Data Source Wizard appears → Click on Next → New.
Step 7: Select Server Name(As per your machine) and database name and click on Test
connection → OK → OK → Connection is added to connection manager.
5
A.Y. 2022-23
Step 8: Click Next on Data Source Wizard → Select Inherit → Next → Finish → Sales_DW.ds
gets created under Data Sources in Solution Explorer
Step 9: Creating New Data Source View
In Solution explorer right click on Data Source View → Select New Data Source View → Next
→ Next
Select FactProductSales(dbo) from Available objects and put in Includes Objects by clicking
on >
6
A.Y. 2022-23
Step 10: Click on Add Related Tables → Next → Finish.
Step 11: Sales DW.dsv appears in Data Source Views in Solution Explorer.
7
A.Y. 2022-23
Step 12: Right click on Cubes → New Cube → Next → Select Use existing tables in Select
Creation Method → Next
Step 13: In Select Measure Group Tables → Select FactProductSales → Click Next
Step 14: In Select Measures → check all measures → Next.
8
A.Y. 2022-23
Step 15: In Select New Dimensions → Check all Dimensions → Next.
Step 16: Click on Finish → Sales_DW.cube is created.
Step 17: In dimension tab → Double Click Dim Product.dim
9
A.Y. 2022-23
Step 18: Drag and Drop Product Name from Table in Data Source View and Add in Attribute
Pane at left side
Step 19: Creating Attribute Hierarchy in Date Dimension
Double click On Dim Date dimension → Drag and Drop Fields from Table shown in Data Source
View to Attributes → Drag and Drop attributes from leftmost pane of attributes to middle pane of
Hierarchy.
Drag fields in sequence from Attributes to Hierarchy window (Year, Quarter Name, Month Name,
Week of the Month, Full Date UK).
10
A.Y. 2022-23
Step 20: Deploy Cube
Right click on Project name → Properties → Do following changes and click on Apply & ok
Step 21: Right click on project name → Deploy → Deployment successful
11
A.Y. 2022-23
Step 22: To process cube right click on Sales_DW.cube → Process → Click Run
Step 23: Browse the cube for analysis in solution explorer.
Conclusion: In this way we import of the OLAP, MOLAP and HOLAP model.
12
A.Y. 2022-23
Assignment No. 4
Aim : Import the data warehouse data in Microsoft Excel and create the Pivot table and Pivot
Chart.
Objective : To introduce the concepts and components of data warehouse data in Microsoft Excel
and createthe Pivot table and Pivot Char.
Prerequisite:
Software Requirements : SQL Server 2012 Full version, Excel
Note : Make sure Power view is enabled for visualization.
Step 1: Open Excel.
Go to Data Tab → Get Data → Legacy Wizard → From Data Connection Wizard.
1
A.Y. 2022-23
Step 2: In Data Connection Wizard → Select Microsoft SQL Server → Next.
Step 3: Provide Server name → Next
Step 4: In Select Database and Table → Select Sales_DW → Check all Dimensions and import
relationship between selected tables → Next.
2
A.Y. 2022-23
Step 5: In Save data Connection file browse path and click on finish
Step 6: In import data select pivot chart and click on OK.
Step 7: In Fields put SalesDateKey in Filters, FullDateUK in axis and sum of ProductActualCost
in values.
3
A.Y. 2022-23
Step 8: In Insert tab → Go to Pivot Table
Step 9: Choose From External Data Source → Choose Connection to select existing connection
with Sales_DW and Click on Open → OK.
4
A.Y. 2022-23
Step 10: Select Rows and Values.
Pivot Table and Pivot Chart is Created.
Conclusion: -In this way we import of data warehouse data in Microsoft Excel and create the
Pivot table and Pivot Chart.
5
A.Y. 2022-23
Assignment No. 5
Aim : Perform the data classification using classification algorithm. Or Perform the data
clustering using clustering algorithm.
Objective : To introduce the concepts classification using classification algorithm. Or Perform

the data clustering using clustering algorithm.
Prerequisite:
Software Requirements : R Studio
Theory : Time series is a series of data points in which each data point is associated with a
timestamp. A simple example is the price of a stock in the stock market at different points of time
on a given day. Another example is the amount of rainfall in a region at different months of the
year. R language uses many functions to create, manipulate and plot the time series data. The data
for the time series is stored in an R object called time-series object. It is also a R data object like
a vector or data frame.
The time series object is created by using the ts() function.
Syntax : The basic syntax for ts() function in time series analysis is –
timeseries.object.name <- ts(data, start, end, frequency)
Following is the description of the parameters used –
• Data is a vector or matrix containing the values used in the time series.
• Start specifies the start time for the first observation in time series.
• End specifies the end time for the last observation in time series.
• Frequency specifies the number of observations per unit time.
Except the parameter "data" all other parameters are optional
Consider the annual rainfall details at a place starting from January 2012. We create an R time
series object for a period of 12 months and plot it.
1
A.Y. 2022-23
Code to run in R :
# Get the data points in form of a R vector.
rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
# Convert it to a time series object.
rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)
# Print the timeseries data.
print(rainfall.timeseries)
# Give the chart file a name.
png(file = "rainfall.png")
# Plot a graph of the time series.
plot(rainfall.timeseries)
# Save the file.
dev.off()
# After this again plot to get chart.
plot(rainfall.timeseries)
Output :
When we execute the above code, it produces the following result and chart –
Jan Feb Mar Apr May Jun Jul Aug Sep 2012 799.0 1174.8 865.1 1334.6 635.4 918.5 685.5
998.6 784.2 Oct Nov Dec 2012 985.0 882.8 1071.0
2
A.Y. 2022-23
Step 1: Open R Studio.
Step 2: Run above code line by line
Output :
Conclusion: - In this way we Perform the data classification using classification algorithm.

BI Lab Manual

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BI Lab Manual

Uploaded by

Copyright:

Available Formats

Dr. D. Y.

Patil Educational Federation’s

Academic year 2022-23

Operating System recommended: - 64-bit Open source Linux or its derivative

Guidelines for Laboratory /Term Work Assessment:

• The external continuous assessment will be 50 marks.

Sr. No. Distribution of Marks Marks

How to import legacy data step by step?

Step 4: Select file and click on Transform Data.

Step 5: Power query editor appears.

Step 6: Again, go to Home Get Data and select OData feed.

Step 7: Click and

Paste url as http://services.odata.org/V3/Northwind/Northwind.svc/

Step 8: Select Orders table And click on Transform Data .

Step 9: Power query editor appears.

Objective : To introduce the concepts and components of Business Intelligence (BI)

How ETL works :

• SQL or NoSQL servers

The benefits and challenges of ETL

Step 1: Open SQL Server Management Studio to restore backup files.

Step 2: Right Click on Databases → Restore Databases.

Step 3: Select Device → Click on … icon towards end of device box.

Step 5: Open SQL Server Data Tools.

Step 7: Configure OLE DB Connection Manager Window appears → Click on New

Select [Sales].[Store] table from drop down → OK.

In Error Output → OK.

Step 14: Click On Start Debugging (f5).

Step 15: Go to SQL Server Management Studio

In Database Tab → Adventureworks → Right click on [DBO].[OLE DB Destination] → Script

Step 16: Execute the query to get Output.

Limitations of OLAP cubes :

1. OLAP requires restructuring of data into a star/snowflake schema

Sisense and Elasticubes :

Powered by ElastiCube, Sisense delivers distinct advantages over OLAP-based solutions:

1. Instant query response times, without pre-calculation or pre-aggregation of data

Step 1: Creating Data Warehouse and run it in your SQL Server.

Click on File → New → Project

In Business Intelligence → Analysis Services Multidimensional and Data Mining

Step 6: Data Source Wizard appears → Click on Next → New.

Step 9: Creating New Data Source View

Step 10: Click on Add Related Tables → Next → Finish.

Step 14: In Select Measures → check all measures → Next.

Step 15: In Select New Dimensions → Check all Dimensions → Next.

Step 16: Click on Finish → Sales_DW.cube is created.

Step 17: In dimension tab → Double Click Dim Product.dim

Step 19: Creating Attribute Hierarchy in Date Dimension

Step 20: Deploy Cube

Step 21: Right click on project name → Deploy → Deployment successful

Step 23: Browse the cube for analysis in solution explorer.

Note : Make sure Power view is enabled for visualization.

Step 1: Open Excel.

Step 2: In Data Connection Wizard → Select Microsoft SQL Server → Next.

Step 3: Provide Server name → Next

Step 6: In import data select pivot chart and click on OK.

Step 8: In Insert tab → Go to Pivot Table

Step 10: Select Rows and Values.

Pivot Table and Pivot Chart is Created.

Objective : To introduce the concepts classification using classification algorithm. Or Perform

The time series object is created by using the ts() function.

timeseries.object.name <- ts(data, start, end, frequency)

Following is the description of the parameters used –

• Frequency specifies the number of observations per unit time.

Except the parameter "data" all other parameters are optional

# Get the data points in form of a R vector.