Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 51

DATA

VISUALISATION, ETL,
DATA ACQUISITION
Dr. Firoz Anwar
CONTENTS
 Introduction
 Understanding ETL
 Available Tools
 Understanding Data Visualisation
ETL
 ETL stands for Extract, Transform and Load.
 A generic process in which data is firstly acquired, then changed or processed
and is finally loaded into data warehouse or databases or other files such as
PDF, Excel.
 Can be loaded from any data sources such as Files, any RDBMS/NoSql
Database, Websites or real-time user activity.
 Transformed data loaded into a data warehouse for business uses such as
reporting or analytics.
WHY ETL:
 Visualizing your entire data flow pipeline which helps business taking critical business
decisions.
 Transactional databases cannot answer complex business questions that can be answered by
ETL.
 ETL provides a method of moving the data from various sources into a data warehouse.
 As data sources change, the Data Warehouse will automatically update.
WHY ETL:
 ETL process can perform complex transformations and requires the extra area to store the
data.
 ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and types to
adhere to one consistent system.
 ETL is a predefined process for accessing and manipulating source data into the target
database.
 ETL offers deep historical context for the business.
ETL PROCESS
 ETL is a 3 steps process:
 Extracting/Acquisition of Data from single or multiple Data Sources
 Transforming Data as per business logic. Transformation is in itself a two steps
process- data cleansing and data manipulation.
 Loading transformed data into the target data source or data warehouse.
POPULAR SOFTWARE
 ArcGIS by Esri
 QGIS (Quantum GIS
 ENVI by Harris Geospatial Solutions
 Global Mapper by Blue Marble Geographics
 ERDAS IMAGINE by Hexagon Geospatial
 Trimble TerraSync
 Leica Infinity
 GeoMedia by Hexagon Geospatial
 OpenDroneMap
 Google Earth Engine
PROGRAMMING LANGUAGES
AND LIBRARIES
 Python: Libraries such as GDAL, Fiona, Shapely, and GeoPandas provide powerful
tools for working with geospatial data formats, performing spatial analysis, and
creating custom data processing workflows.

 R: Packages such as sf, raster, rgdal, and leaflet enable users to import, manipulate,
and visualize geospatial data, as well as perform advanced spatial analysis and
modeling.

 JavaScript (with libraries like Leaflet and Mapbox): Libraries such as Leaflet and
Mapbox provide tools for creating interactive maps, overlaying geospatial data layers,
and implementing custom spatial analysis workflows.
PROGRAMMING LANGUAGES
AND LIBRARIES
 Java (with libraries like GeoTools): Libraries such as GeoTools provide
comprehensive geospatial data processing capabilities, including support for various
data formats, spatial operations, and visualization.

 MATLAB: Toolboxes such as Mapping Toolbox and Image Processing Toolbox


provide functions for importing, processing, and visualizing geospatial data, as well as
performing spatial analysis and modeling.
DATA ACQUISITION
 Sampling and Aliasing
 Sampling rate determines how frequently data is collected from sensors.
 Nyquist theorem states that the sampling rate should be at least twice the highest frequency
component of the signal to avoid aliasing.
 Anti-aliasing filters are used to remove high-frequency components before sampling.

Lyons, R. (2011). "Understanding Digital Signal Processing." Pearson.


DATA ACQUISITION
 Noise and Filtering
 Noise in sensor data can arise from various sources, including environmental interference
and electronic components.
 Filtering techniques such as low-pass, high-pass, and band-pass filters are used to reduce
noise and extract relevant information.
 Adaptive filtering methods can dynamically adjust filter parameters based on the
characteristics of the signal.

Proakis, J., & Manolakis, D. (2006). "Digital Signal Processing: Principles, Algorithms,
and Applications." Pearson.
DATA ACQUISITION
 Calibration and Compensation
 Calibration involves adjusting sensor outputs to match known reference values.
 Compensation techniques account for sensor inaccuracies and drift over time.
 Regular calibration and compensation ensure the accuracy and reliability of sensor
measurements.

Cooper, J. W. (2019). "Introduction to the Theory and Design of Measurement


Systems." McGraw Hill.
DATA ACQUISITION
 Data Logging and Storage
 Data logging systems record sensor data over time for analysis and visualization.
 Storage options include local storage on embedded systems, external memory devices, or
cloud-based storage solutions.
 Efficient data compression techniques can reduce storage requirements while preserving
data integrity.

Scargle, J. (2013). "Data Reduction and Error Analysis for the Physical Sciences."
Cambridge University Press.
DATA ACQUISITION
 High-Speed Sampling
 High-speed sampling techniques capture data at rates exceeding conventional methods.
 Sampling rates of millions to billions of samples per second are achievable.
 Applications include high-frequency signal analysis, fast transient detection, and radar
systems.

Ibrahim, A. (2017). "High-Speed Devices and Circuits with THz Applications." CRC
Press.
DATA ACQUISITION
 Multi-Sensor Integration
 Multi-sensor integration combines data from diverse sensors to provide a comprehensive
view of the environment.
 Fusion techniques merge data from different modalities, such as vision, lidar, and inertial
sensors.
 Integration enhances perception accuracy and robustness in applications like autonomous
driving and robotics.

Koch, C. (2017). "Multisensory Integration and Attention in the Developing Brain."


Academic Press.
DATA ACQUISITION
 Sensor Fusion for Contextual Awareness
 Sensor fusion integrates data from multiple sensors to infer contextual information.
 Context-aware systems adapt their behavior based on environmental cues.
 Fusion algorithms combine spatial, temporal, and semantic information for enhanced
situational awareness.

Durrant-Whyte, H., & Bailey, T. (2006). "Simultaneous Localization and Mapping: Part
I." IEEE Robotics & Automation Magazine.
DATABASE
 Database operation are supported from python programming interface using specific python
packages specific to database in use.

 MySQL
 Oracle
 SQLite
OTHER ETL OPTION (GRAPH
DATABASE)
 Connect to Neo4j Database
REMOTE DATABASE
 Local database

# Open database connection


db = MySQLdb.connect("localhost","user","passwd","TEST" )

 Remote database
 Using Python MySQL connector library:

pip3 install mysql-connector-python


REMOTE/LOCAL DATABASE
 Example:

import mysql.connector as mysql

# enter your server IP address/domain name


HOST = "x.x.x.x" # or "domain.com”

# database name, if you want just to connect to MySQL server, leave it empty
DATABASE = "database”

# this is the user you create


USER = "python-user”

# user password
PASSWORD = "Password1$”

# connect to MySQL server


db_connection = mysql.connect(host=HOST, database=DATABASE, user=USER,password=PASSWORD)

print("Connected to:", db_connection.get_server_info())

# enter your code here!


DATABASE INSERT
# !/usr/bin/python
import MySQLdb

# Open database connection


db = MySQLdb.connect("localhost","user","passwd","TEST" )

# prepare a cursor object using cursor() method


cursor = db.cursor()

# Prepare SQL query to INSERT a record into the database.


sql = """INSERT INTO STUDENT(
NAME, SUR_NAME, ROLL_NO)
VALUES ('Sayan', 'Mukhopadhyay', 1)"""

try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the
database db.commit()
except:
# Rollback in case there is any error
db.rollback()

# disconnect from server


db.close()
DATABASE SELECT
# !/usr/bin/python
import MySQLdb

# Open database connection


db = MySQLdb.connect("localhost","user","passwd","TEST" )

# prepare a cursor object using cursor() method


cursor = db.cursor()

# Prepare SQL query to INSERT a record into the database.


sql = "SELECT * FROM STUDENT "
try:
# Execute the SQL command
cursor.execute(sql)
# Fetch all the rows in a list of lists.
results = cursor.fetchall()
for row in results:
fname = row[0]
lname = row[1]
id = row[2]
# Now print fetched result
print "name=%s,surname=%s,id=%d" % (fname, lname, id )
except:
print "Error: unable to fecth data"

# disconnect from server


db.close()
DATABASE DELETE
# !/usr/bin/python
import MySQLdb

# Open database connection


db = MySQLdb.connect("localhost","user","passwd","TEST" )

# prepare a cursor object using cursor() method


cursor = db.cursor()

# Prepare SQL query to DELETE required records


sql = "DELETE FROM STUDENT WHERE ROLL_NO =1"
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
except:
# Rollback in case there is any error
db.rollback()

# disconnect from server


db.close()
NEO4J REST CLIENT
 Connecting to Neo4j Server
 Goal of neo4j-rest-client was to enable Python programmers already using Neo4j locally
through python-embedded, to use the Neo4j REST server. So the syntax of neo4j-rest-client’s
API is fully compatible with python-embedded.
 The main class is GraphDatabase, similar to python-embedded:

from neo4jrestclient.client import GraphDatabase


gdb = GraphDatabase(http://localhost:7474/db/data/
alice = gdb.nodes.create(name="Alice", age=30))
DOCUMENT DATABASE
(MONGO DB)
DOCUMENT DATABASE
 Import Data into the Collection:

mongoimport --DB test --collection restaurants --drop --file ~/downloads/primer-dataset.json

 The mongoimport command is joined to a MongoDB instance running on localhost on port number
27017. The --file option provides a way to import the data; here it’s ~/downloads/primer-
dataset.json.
 Create a connection:

import MongoClient from pymongo.


Client11 = MongoClient()
 If no argument is mentioned to MongoClient, then it will default to the MongoDB instance running
on the localhost interface on port 27017.
DOCUMENT DATABASE
 Assign the database named primer to the local variable DB:

Db11 = client11.primer
db11 = client11['primer’]

 Accessing Collection objects can be done directly by using the dictionary style:

Coll11 = db11.dataset OR
coll = db11['dataset’]

 Insert Operation:

result=db.addrss.insert_one({<<your json >>)


 Update Operation:

result=db.address.update_one({"building": "129",
{"$set": {"address.street": "MG Road"}})
DATA MANAGEMENT
 Very large volumes of collected data.
 Sometimes, it may be impractical to store the entire raw data
 Often data gets compress or portions of the data gets dropped
 The errors and uncertainty in sensor data, have spurred the development of algorithms for
uncertain database management.
VISUALISATION
 Sometimes more useful
 Sometimes it replaces traditional ETL
 Not appropriate in pipeline
DATA VISUALISATION FOR
SENSOR DATA
• Heatmap Representation
• Contour Plots
• 3D Surface Visualization
• Choropleth Maps
• Time-Series Animation
• Spatial Clustering
• Flow Maps
• Interactive Web Maps
• Geospatial Data
• Spatial Data Mining Visualization: Using advanced visualization techniques for spatial data mining, such
as parallel coordinates plots or multidimensional scaling, enables the exploration of complex relationships
and patterns in geospatial sensor data.
MODEL-BASED
SENSOR DATA
ACQUISITION,
CLEANING & QUERY
PROCESSING
Dr. Firoz Anwar
MODEL-BASED TECHNIQUES
 A large number of research has emerged in recent times in relation to sensor data processing.
 These techniques use mathematical models for solving various problems pertaining to sensor
data acquisition and management.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
WHY MODEL-BASED
TECHNIQUES?
 It is well-known that many physical attributes, like, ambient temperature or relative humidity,
vary smoothly.
 Sensor data typically exhibits the following properties:
 Continuous (although we only have a finite number of samples),
 Finite energy or it is band-limited,
 Exhibits Markovian behavior or the value at a time instant depends only on the value at a
previous time instant.
 Most model-based techniques exploit these properties for efficiently performing various tasks
related to sensor data acquisition and management.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
MODEL-BASED TECHNIQUES
 Model-based techniques use various types of models:
 statistical,
 signal processing,
 regression-based,
 machine learning, probabilistic, and
 time series.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
MODEL-BASED TECHNIQUES

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
DATA ACQUISITION
 Sensor data acquisition is the task responsible for efficiently acquiring samples from the
sensors in a sensor network.
 Primary objective of the sensor data acquisition task is to attain energy efficiency.
 Driver:
 Most sensors are battery-powered and are located in inaccessible locations (e.g.,
environmental monitoring sensors are sometimes located at high altitudes and are
surrounded by highly inaccessible terrains).

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
DATA ACQUISITION TYPES
 Two major types of acquisition approaches:
 Pull-based and
 Push-based.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
MODEL-BASED SENSOR DATA
ACQUISITION
 Driver:
 Energy Consumption:
 Obtaining values from a sensor requires high amount of energy.
 Minimise the number of samples obtained from the sensors.
 Models are used for selecting sensors, such that user queries can be answered with
reasonable accuracy using the data acquired from the selected sensors.

 Communication Cost:
 Another energy-intensive task is to communicate the sensed values to the base station.
 Model-based techniques proposed in the literature for reducing the communication cost,
and maintaining the accuracy of the sensed values

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
SOME NOTATIONS

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
DATABASE ENTRY

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
DATA ACQUISITION TYPES
 Pull- based approach:
 Data is only acquired at a user-defined frequency of acquisition.

 Push-based approach:
 The sensors and the base station agree on an expected behaviour; sensors only send data to
the base station if the sensor values deviate from such expected behaviour.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
SENSOR DATA ACQUISITION
QUERY
 Pull-Based Data Acquisition
 User defines the interval and frequency of data acquisition.
 Pull-based systems only follow the user’s requirements, and pull sensor values as defined
by the queries.
 For example, using the SAMPLE INTERVAL clause of Query, users can specify the
number of samples and the frequency at which the samples should be acquired.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
PULL-BASED DATA
ACQUISITION
 Techniques:

 In-Network Data Acquisition

 Multi-Dimensional Gaussian Distributions

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
Image Source: “The Multivariate Gaussian Distribution” – Chuong B. Do (2008)
IN-NETWORK DATA
ACQUISITION
 Proposed/Implemented by Databases:
 TinyDB,
 Cougar and
 TiNA.

 TinyDB refers to its in-network query processing paradigm as Acquisitional Query Processing
(ACQP).
 Limitation:
 May not work due to limited range of radio communication between individual sensors and
the base station.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
DATA ACQUISITION USING
SEMANTIC OVERLAYS

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
Image Source: https://www.ircc.iitb.ac.in/IRCC-Webpage/patent3400.jsp
DATA ACQUISITION USING
SEMANTIC OVERLAYS
 Tree-based overlay that is constructed using the sensors S.
 Used for aggregating the query results from the leaf nodes to the root node.
 The overlay network is especially built for efficient data acquisition and query processing.
 Tree-based overlay network as Semantic Routing Trees (SRTs).
 A SRT is constructed by flooding the sensor network with the SRT build request. This request
includes the attribute (ambient temperature), over which the SRT should be constructed.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
DATA ACQUISITION USING
SEMANTIC OVERLAYS
 Each sensor sj, which receives the build
request, has several choices for choosing its
parent:
 if sj has no children, which is equivalent to
saying that no other sensor has chosen sj as its
parent, then sj chooses another sensor as its
parent and sends its current value vij to the
chosen parent in a parent selection message, or
 if sj has children, it sends a parent selection
message to its parent indicating the range of
ambient temperature values that its children are
covering.
 In addition, it locally stores the ambient
temperature values from its children along with
their sensor identifiers.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
DATA ACQUISITION USING
SEMANTIC OVERLAYS
 The Query is then presented to the root node of the SRT, it forwards the query to its children
and prepares for receiving the results.
 At the same time, the root node also starts processing the query locally
 The same procedure is followed by all the intermediate sensors in the SRT.

 A sensor that does not have any children, processes the query and forwards the value of v ij to
its parent.
 All the collected sensor values vij are finally forwarded to the root node, and then to the user,
as a result of the query.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
 Known as Barbie-Q (BBQ) system
 Employs multivariate Gaussian distributions for sensor data acquisition.
 Maintains a multi-dimensional Gaussian probability distribution over all the sensors in S.
 Data is acquired only as much as it is required to maintain such a distribution.
 Sensor data acquisition queries specify certain confidence that they require in the acquired
data.
 If the confidence requirement cannot be satisfied, then more data is acquired from the sensors,
and the Gaussian distribution is updated to satisfy the confidence requirements.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
 BBQ uses a multi-variate Gaussian probability density function (pdf) denoted as p(V i1, Vi2, . . . ,
Vim), where Vi1, Vi2, . . . , Vim are the random variables associated with the sensor values v i1, vi2, . .
. , vim respectively.

 In BBQ, the inferred sensor value of sensor s j, at each time ti, is defined as the mean value of V ij,
and is denoted as v ̄ij.
 Two additional constraints: (i) error bound ε, for the values v ̄ij , and (ii) the confidence 1 − δ
with which the error bound should be satisfied.
 These additional constraints are for controlling the quality of the query response.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer US
Practise

You might also like