Professional Documents
Culture Documents
CTS INTERNSHIP REPORT - Mohak
CTS INTERNSHIP REPORT - Mohak
SUMMARY REPORT
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
Submitted by
MOHAK RASTOGI (17SCSE101366)
1
CERTIFICATE
I hereby certify that the work which is being presented in the Internship project report entitled
“Cognizant BISQUAD-AIA Project Report” in partial fulfillment for the requirements for the
award of the degree of Bachelor of Technology in the School of Computing Science and
Engineering of Galgotias University, Greater Noida, is an authentic record of my own work
carried out in the industry.
To the best of my knowledge, the matter embodied in the project report has not been submitted
to any other University/Institute for the award of any Degree.
2
TABLE OF CONTENTS
1 Abstract 4
2 Introduction 6
4 Target Specification 10
6 Methodology 12
7 Tools Required 28
8 Result Analysis 29
9 Conclusion 30
10 Technical References 31
3
ABSTRACT
BI informs business priorities, goals and directions by tracking and publishing predefined key
performance indicators in the form of dashboard drill-up/drill-down cubes and reports for every
aspect of business operations.
As part of the BI process, organizations collect data from internal IT systems and external
sources, prepare it for analysis, run queries against the data and create data visualizations, BI
dashboards and reports to make the analytics results available to business users for operational
decision-making and strategic planning.
The ultimate goal of BI initiatives is to drive better business decisions that enable organizations
to increase revenue, improve operational efficiency and gain competitive advantages over
business rivals. To achieve that goal, BI incorporates a combination of analytics, data
management and reporting tools, plus various methodologies for managing and analyzing data.
4
List of Figures
5
Introduction
BI (Business intelligence) encompasses technologies and analytical processes that examine data
and present actionable information based on, for example, reports, predictive analytics, data and
text mining, and business performance, and help business leaders make better- informed
decisions. Enterprises use business intelligence make a wide variety of strategic and operational
business decisions.
6
Objective of the work
should be able to Describe what is Database
should be able to Implement Structured Query Language (SQL)
should be able to Implement queries using DDL, DML, DCL
should be able to Implement queries applying operators, Function, & Clauses concepts
should be able to Implement queries using SQL joins, Sub queries, clauses
should be able to Define the Operational System and the Data Warehouse
should be able to Describe the Data Warehouse and Data Mart.
should be able to Describe the Operational Data Store
should be able to Describe the Enterprise Data Warehouse (EDW)
should be able to Describe the Extract, Transform and Load Process
should be able to Explain the Extract Transform and Load Process to load Operational
Data Store
should be able to Explain the Extract Transform and Load Process to Load Data
Warehouse.
should be able to Explain the Extract Transform and Load Process to Load Data Mart.
should be able to Explain the Advanced Extract, Transform and Load Practices
should be able to Describe the Operating System.
should be able to Explain the File System
should be able to Demonstrate the Editors
should be able to Describe the Architecture of the Informatica PowerCenter and Uses.
should be able to List the Informatica PowerCenter Components and the Objects.
should be able to Describe the Core Administrative Tasks and Configure the Informatica
Administration Tool.
should be able to Explain the creation and the configuration of the Repository and the
Integration services.
should be able to Describe the Client Tool - Repository Manager
should be able to Demonstrate the creation of the Folders and the access management.
should be able to Describe the Client Tool – Designer
7
should be able to List the Informatica Designer Tools
should be able to Demonstrate the Creation of the Source definition, Target definition and
the mapping.
should be able to Describe the Client Tool - Workflow Manager
should be able to List the Workflow Manager Tools
should be able to Demonstrate the Creation of the Session, Workflow and other Tasks.
should be able to Demonstrate the Scheduling in Informtica PowerCenter
should be able to Describe the Client Tool - Workflow Monitor
should be able to Demonstrate the monitoring of the Workflows and the tasks.
should be able to Demonstrate the Deployment of the Informatica Objects.
should be able to Explain the Informatica PowerCenter Performance bottlenecks
should be able to Demonstrate the Best Practices of the Informatica PowerCenter objects
usage.
should be able to Explain the Usage of the Power Exchange Connectors for the Cloud
and CDC.
should be able to Apply Python script for various cloud specific development.
should be able to Explain about various Python libraries.
should be able to Describe the basic concepts of BigData, Hadoop, HDFS, MapReduce,
Sqoop, Pig, Hive, Hbase
should be able to Perform all related operations on the Hbase tables
should be able to Implement the Datatypes, closures, traits, exceptional handlings,
Collections, generics and various functions
should be able to Perform operations on RDD’s using transformations and actions
should be able to Implement the transformation/action based on the needs for different
problem statements
should be able to Articulate when and where the appropriate memory levels are used.
should be able to Execute the programs, debug them, assigning different parameters like
memory allocation and executors while executing the spark program on different
execution modes.
8
should be able to Work on different streaming data using spark applications and execute
the streaming programs by reading data from different sources like Sockets, Kafka,
Flume, File etc.
should be able to Perform different operations on DStreams and persist results into
HBase
should be able to Create and perform operations on the Data frames using SparkSQL –
Load data from different sources like JDBC, csv, JSON, XML and plain text.
should Identify different ETL patterns in spark like Lambda architecture etc..,
should be able to Define test strategy , test plan and its importance
should be able to Design test scenarios(both positive and negative) from requirements
gathered.
should be able to Design test cases from test scenarios
should be able to Define a defect.
should be able to Explain the defect lifecycle workflow.
9
Target Specifications
Business intelligence is rather an umbrella term that covers the processes and methods of
collecting, storing, and analyzing data from business operations or activities to optimize
performance. All of these things come together to create a comprehensive view of a business to
help people make better, actionable decisions.
Over the past few years, business intelligence has evolved to include more processes and
activities to help improve performance. These processes include:
Data mining: Using databases, statistics and machine learning to uncover trends in large
datasets.
Reporting: Sharing data analysis to stakeholders so they can draw conclusions and make
decisions.
Performance metrics and benchmarking: Comparing current performance data to
historical data to track performance against goals, typically using customized dashboards.
Descriptive analytics: Using preliminary data analysis to find out what happened.
Querying: Asking the data specific questions, BI pulling the answers from the datasets.
Statistical analysis: Taking the results from descriptive analytics and further exploring
the data using statistics such as how this trend happened and why.
Data visualization: Turning data analysis into visual representations such as charts,
graphs, and histograms to more easily consume data.
Visual analysis: Exploring data through visual storytelling to communicate insights on
the fly and stay in the flow of analysis.
Data preparation: Compiling multiple data sources, identifying the dimensions and
measurements, preparing it for data analysis.
10
Functional partitioning of project
1. Database Design
2. Data Warehouse Basics
3. ETL Concepts
4. Data Warehouse Testing
5. Informatica Power Center
6. Python
7. Big Data and Hadoop
8. Big Data - Hbase
9. Scala
10. Spark
11. Testing
11
Methodology
What is Data?
Data can be facts related to any object in consideration. For example, your name, age, height,
weight, etc. are some data related to you. A picture, image, file, pdf, etc. can also be considered
data.
What is Database?
A database is a systematic collection of data. They support electronic storage and manipulation
of data. Databases make data management easy.
Let us discuss a database example: An online telephone directory uses a database to store data of
people, phone numbers, and other contact details. Your electricity service provider uses a
database to manage billing, client-related issues, handle fault data, etc.
Types of Databases
Distributed databases:
A distributed database is a type of database that has contributions from the common database and
information captured by local computers. In this type of database system, the data is not in one
place and is distributed at various organizations.
Relational databases:
This type of database defines database relationships in the form of tables. It is also called
Relational DBMS, which is the most popular DBMS type in the market. Database example of the
RDBMS system include MySQL, Oracle, and Microsoft SQL Server database.
12
Object-oriented databases:
This type of computers database supports the storage of all data types. The data is stored in the
form of objects. The objects to be held in the database have attributes and methods that define
what to do with the data. PostgreSQL is an example of an object-oriented relational DBMS.
Centralized database:
It is a centralized location, and users from different backgrounds can access this data. This type
of computers databases store application procedures that help users access the data even from a
remote location.
Open-source databases:
This kind of database stored information related to operations. It is mainly used in the field of
marketing, employee relations, customer service, of databases.
Cloud databases:
A cloud database is a database which is optimized or built for such a virtualized environment.
There are so many advantages of a cloud database, some of which can pay for storage capacity
and bandwidth. It also offers scalability on-demand, along with high availability.
Data warehouses:
Data Warehouse is to facilitate a single version of truth for a company for decision making and
forecasting. A Data warehouse is an information system that contains historical and commutative
data from single or multiple sources. Data Warehouse concept simplifies the reporting and
analysis process of the organization.
Database Components
13
What is a Database Management System (DBMS)?
Database Management System (DBMS) is a collection of programs that enable its users to access
databases, manipulate data, report, and represent data. It also helps to control access to the
database. Database Management Systems are not a new concept and, as such, had been first
implemented in the 1960s.
Charles Bachman's Integrated Data Store (IDS) is said to be the first DBMS in history. With time
database, technologies evolved a lot, while usage and expected functionalities of databases
increased immensely.
MySQL Database
MySQL Database is popular in relational database management systems can be used from small
business applications to big business applications.
14
Supports large databases – MySQL works with large databases. The default file size limit
for a table is 4GB, which can be increased depending on the operating system, up to 50
million rows or more in a table.
Multi-layered design – MySQL is a multi-layered server design with independent
modules. As it is fully multithreaded by using kernel threads, it uses multiple CPUs if
they are available.
Client/server environment – MySQL Server works in embedded or client/server systems.
MySQL Commands
The Data Warehouse is a collection of data in support of management decision processes, which
is:
● Subject oriented
● Integrated
● Time variant
● Non-volatile
A Data Warehouse is a relational database that is designed for query and analysis.
15
It usually contains historical data derived from transaction data and other sources.
● Risk management
● Financial analysis
16
● Marketing programs
● Profit trends
● Procurement analysis
● Inventory analysis
● Statistical analysis
● Claims analysis
● Manufacturing optimization
● Customer relationship management
What is Informatica?
Informatica is introduced as a software development company in the market. It provides a
complete data integration solution and data management system. It launched multiple products
that mainly focused on data integration.
Informatica is also introduced as a data integration tool. This tool is based on the ETL
architecture. It provides data integration software and services for different industries,
businesses, government organizations, as well as telecommunication, health care, insurance, and
financial services.It has a unique property to connect, process, and fetch the data from a different
type of mixed sources.
Informatica Architecture
17
Informatica architecture is service-oriented architecture (SOA). A service-oriented architecture is
defined as a group of services that communicate with each other. It means a simple data transfer
during this communication, or it can be two or more services that coordinate the same activity.
Repository Service: It is responsible for maintaining Informatica metadata and provides access to
the same to other services.
Integration Service: This service helps in the movement of data from sources to the targets.
Reporting Service: This service generates the reports.
Nodes: This is a computing platform to execute the above services.
Informatica Designer: It creates the mappings between source and target.
Workflow Manager: It is used to create workflows or other tasks and their execution.
Workflow Monitor: It is used to monitor the execution of workflows.
Repository Manager: It is used to manage the objects in the repository.
Informatica PowerCenter
Informatica PowerCenter is an ETL tool that is used to enterprise extract, transform, and load the
data from the sources. We can build enterprise data warehouses with the help of the Informatica
PowerCenter. The Informatica PowerCenter produces the Informatica Crop.
The Informatica PowerCenter is extracting data from its source, transforming this data according
to requirements, and loading this data into a target data warehouse.
The main components of Informatica PowerCenter are its client tools, server, repository, and
repository server. Both the PowerCenter server and repository server make up the ETL layer,
which is used to complete the ETL processing.
18
● B2B exchange.
● Data governance.
● Data migration.
● Data warehousing.
● Data synchronization and replication.
● Integration Competency Centers (ICC).
● Master Data Management (MDM).
● Service-oriented architectures (SOA) and many more.
Informatica Transformations
Informatica Transformations are repository objects which can create, read, modifies, or passes
data to the defined target structures such as tables, files, or any other targets.
In Informatica, the purpose of transformation is to modify the source data according to the
requirement of the target system. It also ensures the quality of the data being loaded into the
target.
A Transformation is used to represent a set of rules, which define the data flow and how the data
is loaded into the targets.
Classification of Transformation
Transformation is classified into two categories-the first one based on connectivity and second
based on the change in several rows. First, we will look at the transformation based on
connectivity.
1. Here are two types of transformation based on connectivity, such as:
● Connected Transformations
● Unconnected Transformations
In Informatica, one transformation is connected to other transformations during mappings are
called connected transformations.
19
Those transformations who are not link to any other transformations are called unconnected
transformations.
2. Here are two types of transformations based on the change in several rows, such as:
● Active Transformations
● Passive Transformations
Active Transformations are those who modify the data rows, and the number of input rows
passed to them. For example, if a transformation receives 10 numbers of rows as input, and it
returns 15 numbers of rows as an output, then it is an active transformation. In the active
transformation, the data is modified in the row.
Passive Transformations do not change the number of input rows. In passive transformations,
the number of input and output rows remains the same, and data is modified at row level only.
Mapping in Informatica
20
Mapping is a collection of source and target objects which is tied up together through a set of
transformations. These transformations are formed with a set of rules that define how the data is
loaded into the targets and flow of the data.
Mapping in Informatica includes the following set of objects, such as:
● Source definition: The source definition defines the structure and characteristics of the
source, such as basic data types, type of the data source, and more.
● Transformation: It defines how the source data is changed, and various functions can be
applied during this process.
● Target Definition: The target definition defines where the data will be loaded finally.
● Links: Link is used to connecting the source definition with target tables and different
transformations. And it shows the flow of data between the source and target.
What is Big Data?
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can store
it or process it efficiently. Big data is also a data but with huge size.
Types Of Big Data
1. Structured- Any data that can be stored, accessed and processed in the form of fixed
format is termed as a 'structured' data.
2. Unstructured- Any data with unknown form or the structure is classified as unstructured
data. In addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it.
3. Semi-structured-Semi- form structured data can contain both the forms of data. We can
see semi-structured data as a structured in but it is actually not defined with e.g. a table definition
in relational DBMS.
21
Volume- The name Big Data itself is related to a size which is enormous. Size of data
plays a very crucial role in determining value out of data. Hence, 'Volume' is one
characteristic which needs to be considered while dealing with Big Data.
Variety-Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured. During earlier days, spreadsheets and databases were the only sources
of data considered by most of the applications.
Velocity-The term 'velocity' refers to the speed of generation of data. How fast the data
is generated and processed to meet the demands, determines real potential in the data.
Variability-This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
WHAT IS HADOOP?
Hadoop is a high-performance distributed data storage and processing system. Its two major
subsystems are :
● HDFS for storage
● Mapreduce for parallel data processing.
It can store any kind of data from any source, inexpensively and at very large scale, and it can do
very sophisticated analysis of that data easily and quickly.
Hadoop automatically detects and recovers from hardware, software and system failures.
Hadoop provides scalable, reliable and fault tolerant services for data storage and analysis at
very low cost.
22
WHAT IS HADOOP USED FOR?
Searching/ Text mining
Log processing
Recommendation systems
Business Intelligence/Data Warehousing
Video and Image analysis
Archiving
Graph creation and analysis
Pattern recognition
Risk assessment
Sentiment Analysis
WHAT IS HBASE?
HBase is Column‐Oriented, Multi‐Dimensional, High Availability , High Performance,Non-
relational, Distributed Database.It runs on top of HDFS. It is well suited for sparse data sets,
which are common in many big data use cases.
An HBase system comprises a set of tables. Each table contains rows and columns, much like a
traditional database.
Apache HBase scales linearly to handle huge data sets with billions of rows and millions of
columns, and it easily combines data sources that use a wide variety of different structures and
schemas. It provides a fault-tolerant way of storing large quantities of sparse data.
HBase is not a direct replacement for a classic SQL database.
WHAT IS HIVE?
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis.Apache Hive supports analysis of large datasets stored in
Hadoop's HDFS and compatible file systems .All the data types in Hive are classified into four
types, given as follows:
Column Type
Literals
23
Null Values
Complex Types
Hive provides indexing to provide acceleration. Hive has built-in user defined functions (UDFs)
to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to
handle use-cases not supported by built-in functions.Hive supports SQL-like queries (HiveQL).
PIG-
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs.Pig
is made up of two pieces:
● The language used to express data flows, called Pig Latin.
● The execution environment to run Pig Latin programs
Pig Latin is a data flow language, Allows users to describe how data from one or more inputs
should be read, processed, and then stored to one or more outputs in parallel.
SQOOP-
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop
and structured datastores such as relational databases.
Sqoop can be connected to databases like Oracle, MySQL, and Teradata etc. It uses JDBC to
connect to them, so JDBC driver for each of databases is required. Sqoop is a tool designed to
transfer data between Hadoop and relational databases.Sqoop uses MapReduce to import and
export the data, which provides parallel operation as well as fault tolerance.
Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table,
--columns and --where arguments, you can specify a SQL statement with the --query argument.
What is Scala?
Scala is a modern multi-paradigm programming language designed to express common
programming patterns in a concise, elegant, and type-safe way.Scala, short for Scalable
Language
● Scala smoothly integrates the features of object-oriented and functional languages.
24
● Scala Programming = Object Oriented Programming + Functional Programming.From
the functional programming perspective- each function in Scala is a value and from the object
oriented aspect - each value in Scala is an object.
● Scala programming language can be found in use at some of the best tech companies like
LinkedIn, Twitter, and FourSquare.
What Is Spark?
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation.Spark is not a modified version of Hadoop
It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing.The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Covers wide range of workloads – Streaming – No need of other separate tools.
Why Spark?
The reason is that Hadoop framework is based on a simple programming model
(MapReduce)Here, the main concern is to maintain speed in processing large datasets in terms of
waiting time between queries and waiting time to run the program.
● Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing number of
read/write operations to disk. It stores the intermediate processing data in memory.
● Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages.
● Advanced Analytics − It supports SQL queries, Streaming data, Machine learning (ML),
and Graph algorithms.
SOFTWARE TESTING-
Software Testing is a method to check whether the actual software product matches expected
requirements and to ensure that software product is Defect free. It involves execution of
software/system components using manual or automated tools to evaluate one or more properties
25
of interest. The purpose of software testing is to identify errors, gaps or missing requirements in
contrast to actual requirements.
26
● System Testing: A level of the software testing process where a complete, integrated
system/software is tested. The purpose of this test is to evaluate the system’s compliance with the
specified requirements.
● Acceptance Testing: A level of the software testing process where a system is tested for
acceptability. The purpose of this test is to evaluate the system’s compliance with the business
requirements and assess whether it is acceptable for delivery.
27
Tools required
28
Results Analysis
Overall, the key benefits that businesses can get from BI applications include the ability to:
speed up and improve decision-making;
optimize internal business processes;
increase operational efficiency and productivity;
spot business problems that need to be addressed;
identify emerging business and market trends;
develop stronger business strategies;
drive higher sales and new revenues; and
gain a competitive edge over rival companies.
BI initiatives also provide narrower business benefits -- among them, making it easier for project
managers to track the status of business projects and for organizations to gather competitive
intelligence on their rivals. In addition, BI, data management and IT teams themselves benefit
from business intelligence, using it to analyze various aspects of technology and analytics
operations.
29
Conclusions
BI platforms are increasingly being used as front-end interfaces for big data systems that contain
a combination of structured, unstructured and semi-structured data. Modern BI software typically
offers flexible connectivity options, enabling it to connect to a range of
data sources. This, along with the relatively simple user interface (UI) in most BI tools, makes it
a good fit for big data architectures.
Users of BI tools can access Hadoop and Spark systems, NoSQL databases and other big data
platforms, in addition to conventional data warehouses, and get a unified view of the diverse data
stored in them. That enables a broad number of potential users to get involved in analyzing sets
of big data, instead of highly skilled data scientists being the only ones with visibility into the
data.
30
Technical References
31