Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Q.

Find four common sources of Big Data in the following sectors:

i) Science

ii) Technology/Engineering

Four Common sources of big data in Science sectors are:

 Smart (Fitness) Watches


 Observational Sensor Data
 Genomic and Biological Data
 Healthcare and Medical Records

Four Common sources of big data in Technology/Engineering sectors are:

 Internet of Things (IoT) Devices


 Network Traffic Logs
 Social Media Data
 Air Traffic Control Data

SOURCE- 1: Smart (Fitness) Watches

i. Wearable technology, which can continuously and remotely monitor physiological and
behavioural parameters by incorporated into clothing or worn as an accessory, introduces a
new era for ubiquitous health care. With big data technology, wearable data can be analysed
to help long-term cardiovascular care.
ii. Sources of Big data:
a. Biometric Sensor
b. GPS and Location Data
c. Environmental Sensors
d. Third-Party App Data
iii. Data is generated at high velocities, with sensors capturing information in real-time or near-
real-time, creating streams of continuous data flow. Additionally, the data comes in diverse
formats and types, ranging from numerical sensor readings to location coordinates and
textual inputs. As users wear their smartwatches throughout the day, the accumulation of
data over time results in massive datasets that require advanced analytics techniques to
process, analyse, and derive meaningful insights.
iv. It consists of the 3 V’s:
a. Volume: Smartwatches indeed generate a significant volume of data, especially
considering the continuous monitoring of biometric sensors, GPS tracking, and other
inputs throughout the day. The accumulation of data from multiple sources over
time results in large datasets, contributing to the volume aspect of Big Data.
b. Velocity: Data generated by smartwatches often arrives at high velocities,
particularly when considering real-time or near-real-time monitoring of physiological
parameters and location tracking. The constant stream of data from sensors and user
interactions requires rapid processing and analysis to derive timely insights.
c. Variety: Smartwatch data exhibits variety in terms of data types and formats. This
includes numerical sensor readings, geographical coordinates from GPS tracking,
textual inputs from user interactions, and potentially other data types from
integrated third-party apps or wearable ecosystem integration. The diversity of data
sources and formats adds to the complexity of managing and analysing the data.
v. Smartwatch applications generate a combination of structured, semi-structured, and
unstructured data, reflecting the diverse nature of the information collected from sensors,
user interactions, and integrated third-party sources.
vi. Handling Big Data generated by smartwatch applications involves a sophisticated process of
data collection, transmission, storage, processing, and analysis. Initially, data is continuously
collected from various sensors and inputs within the smartwatch. This data is then
transmitted to cloud servers or storage infrastructure for efficient storage and further
processing. Utilizing techniques such as filtering, aggregation, and statistical analysis, the
data is processed to extract valuable insights, patterns, and trends. Real-time processing
techniques enable immediate feedback or alerts to users when necessary. Advanced
analytics methods, including machine learning and AI, are applied to derive deeper insights
and provide personalized recommendations.
vii. Limitations of these approach:
a. Data Accuracy and Reliability
b. Privacy Concerns
c. Limited Battery Life and Connectivity
d. Data Integration
e. User Engagement
f. Algorithm Bias and Interpretation
viii. Two Relevant Organizations are:
a. Apple: The company's Apple Watch product line has become one of the leading
smartwatch brands globally, integrating advanced health and fitness tracking
features with its ecosystem of devices and services.
b. Fitbit: A leading manufacturer of wearable fitness trackers and smartwatches. The
company specializes in developing devices and software platforms aimed at helping
users track and improve their health and fitness.
ix. References:
a. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10228482/
b. https://www.sciencedirect.com/science/article/pii/S2667241322000246/
c. https://www.altexsoft.com/blog/big-data-healthcare/

SOURCE- 2: Internet of Things (IoT) Devices

i. Internet of Things (IoT) devices constitute a substantial source of Big Data, characterized by
diverse data types, high velocity, massive volume, and critical insights. These devices
continuously collect data from sensors, user interactions, and environmental factors,
generating a vast stream of information in real-time. Despite challenges related to data
veracity and security, IoT data holds immense value for businesses and industries, offering
insights into operational efficiency, predictive maintenance, and customer behaviour. Edge
computing has emerged as a solution to address the challenges associated with data volume
and velocity, enabling faster response times and reducing reliance on centralized data
processing. Overall, IoT big data drives innovation, improves decision-making, and
transforms various sectors through its transformative potential.
ii. Sources of Big Data:
a. Geolocation and GPS Data
b. Public Sector Data
c. Research and Scientific Instruments
d. Transactional Records
iii. They generate immense amounts of data continuously, from social media interactions and
IoT sensor readings to financial transactions and scientific research observations. This data
flows at high speeds, often in real-time or near real-time, requiring rapid processing and
analysis.
iv. It consists of 3 V’s:
a. Volume: Big Data sources generate vast amounts of data continuously. Whether it's
the billions of social media interactions happening every day, the multitude of
transactions processed by financial systems, or the continuous stream of sensor data
from IoT devices, the sheer volume of data generated overwhelms traditional data
processing systems.
b. Velocity: Data is generated at high velocity, often in real-time or near real-time.
Social media platforms, for example, process millions of posts, comments, and likes
per minute. IoT devices continuously collect data from sensors and transmit it in
real-time. The speed at which data is generated requires efficient processing and
analysis methods to derive insights and take timely actions.
c. Variety: Big Data comes in various forms, including structured, semi-structured, and
unstructured data. This includes text, images, videos, sensor readings, geospatial
data, and more. Managing and analyzing such diverse data types require flexible
data storage and processing solutions capable of handling different data formats.
v. Transaction records and customer interactions typically produce structured data, featuring
well-defined formats and organized fields that facilitate storage and analysis. In contrast,
social media data and information from websites and applications are characterized by semi-
structured elements alongside structured components, reflecting the mix of organized data
fields and flexible user-generated content like hashtags or JSON-formatted exchanges.
Conversely, data from IoT devices, social media user-generated content, and scientific
instruments often fall into the realm of unstructured data, lacking predefined schemas and
presenting challenges in analysis due to their raw, free-form nature.
vi. Big Data for e-commerce and retail applications is commonly managed through a
multifaceted approach that integrates various technologies and strategies. This typically
involves collecting and storing vast amounts of transactional data, customer interactions, and
inventory information using distributed storage solutions such as Hadoop Distributed File
System (HDFS) or cloud-based platforms like Amazon S3. Data processing and analysis are
conducted using distributed computing frameworks like Apache Spark or Apache Flink to
derive insights such as market trends, customer preferences, and demand forecasts. Real-
time analytics are facilitated through stream processing technologies such as Apache Kafka,
enabling retailers to react promptly to changing market dynamics and provide personalized
experiences. Additionally, recommendation systems powered by machine learning
algorithms enhance customer engagement by offering tailored product recommendations
based on individual preferences.
vii. Limitations of these approaches:
a. Scalability and Infrastructure Costs
b. Data Integration and Siloed Systems
c. Algorithmic Bias and Fairness
d. Regulatory Compliance
e. Complexity of Analysis and Interpretation
f. Data Quality and Accuracy
g. Privacy and Security Concerns
viii. Two Relevant Organizations are:
a. Amazon: One of the world's largest e-commerce and technology companies, offering
a wide range of products and services, including online retail, cloud computing,
digital streaming, and artificial intelligence.
b. Alibaba Group: A leading Chinese multinational conglomerate specializing in e-
commerce, retail, internet, and technology. The company operates various online
marketplaces, digital payment platforms, cloud computing services, and logistics
networks.
ix. References:
a. https://www.ptc.com/en/blogs/iiot/how-is-iot-related-to-big-data-analytics/
b. https://www.simplilearn.com/how-big-data-powering-internet-of-things-iot-
revolution-article
c. https://www.spiceworks.com/tech/big-data/guest-article/the-big-data-iot-
relationship-how-they-help-each-other/

SOURCE- 3: Health Care and Medical Records

i. Healthcare and medical records are a pivotal component of Big Data, encompassing a vast
array of patient information and clinical data generated within the healthcare ecosystem.
Electronic Health Records (EHRs) serve as comprehensive repositories containing patients'
medical history, diagnoses, treatments, and test results, while medical imaging data provides
crucial insights into anatomical structures and pathological conditions. Additionally, clinical
trials and research efforts generate substantial datasets elucidating drug efficacy, patient
outcomes, and biomarker analysis.
ii. Sources of Big Data:
a. Patient Interactions
b. Medical Procedures and Treatments
c. Health Monitoring Technologies
d. Clinical trials and Research Studies
e. Public Health Surveillance
iii. Electronic Health Records (EHRs), medical imaging data, clinical trials, health monitoring
technologies, insurance claims, and public health surveillance collectively generate massive
amounts of data on patient health, medical treatments, and healthcare operations. This data
comes in diverse formats, including structured, unstructured, and semi-structured data, and
is generated rapidly during patient encounters, procedures, and monitoring activities
iv. The 3 V’s:
a. Volume: Healthcare and medical records generate vast amounts of data on a daily
basis. Electronic Health Records (EHRs), medical imaging files, and clinical trial
databases alone can generate terabytes or even petabytes of data. The sheer volume
of data overwhelms traditional data processing and storage systems.
b. Variety: Data in healthcare comes in various formats, including structured data (such
as patient demographics and lab results), unstructured data (such as clinical notes
and medical images), and semi-structured data (such as billing codes and medical
reports). Additionally, data from wearables and health monitoring devices add to the
diversity of data types.
c. Velocity: Healthcare data is generated rapidly, with new information continuously
added during patient encounters, medical procedures, and health monitoring
activities. Real-time monitoring systems and streaming data from wearable devices
contribute to the high velocity of data generation in healthcare.
v. Healthcare data sources encompass a mix of structured, semi-structured, and unstructured
data. Structured data, exemplified by Electronic Health Records (EHRs), contain predefined
fields such as patient demographics, medical history, and laboratory results, organized in
databases for easy retrieval and analysis. Semi-structured data, as seen in health insurance
and billing transactions, includes structured elements like procedure codes but also narrative
descriptions that lack standardized formats, necessitating some level of interpretation.
Unstructured data, including clinical notes and medical imaging data, presents the greatest
challenge, consisting of free-text narratives and image-based information that lack
predefined structures.
vi. Initially, data storage and management systems such as distributed file systems or cloud-
based data warehouses are utilized to store and organize large volumes of healthcare data
efficiently. Data integration and cleansing techniques are then applied to consolidate data
from diverse sources like EHRs, medical devices, and clinical trials, ensuring data quality and
consistency. Advanced analytics and machine learning algorithms are subsequently
employed to analyse healthcare data, identify patterns, predict outcomes, and derive
actionable insights for clinical decision-making and research purposes. Data visualization
tools enable healthcare professionals to visualize complex datasets intuitively, facilitating
interpretation and decision support.
vii. Limitations of these approaches:
a. Data Quality and Consistency
b. Privacy and Security Concerns
c. Interoperability Issues
d. Complexity of Data Analysis
e. Bias and Fairness
f. Costs and Resource Constraints
viii. Two Relevant Organizations are:
a. IBM Watson Health: IBM Watson Health is a division of IBM that focuses on applying
artificial intelligence (AI) and data analytics to healthcare challenges. They offer a
range of solutions leveraging Big Data analytics, cognitive computing, and machine
learning to assist healthcare organizations in improving patient care, population
health management, and clinical decision-making.
b. Google Health: Google Health is a division of Google (Alphabet Inc.) focused on
leveraging technology, data analytics, and AI to improve healthcare outcomes and
patient experiences. They collaborate with healthcare organizations, research
institutions, and technology partners to develop innovative solutions for healthcare
delivery and medical research.
ix. References:
a. https://www.coursera.org/articles/big-data-in-healthcare
b. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8733917/ and
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4287068/
c. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0217-0
SOURCE- 4: Air Traffic Control

i. With the proliferation of flight tracking technologies and the surge in air travel, ATC systems
are inundated with vast volumes of data on aircraft positions, trajectories, and weather
conditions. Through sophisticated big data analytics, controllers can predict traffic patterns,
anticipate congestion, and optimize airspace usage to minimize delays and maximize
capacity. Moreover, big data enables real-time monitoring of aircraft performance,
identification of safety risks, and swift response to security threats.
ii. Sources of Big Data:
a. Sensors and IoT Devices
b. Human Interaction
c. Live Flight Tracking
d. Communication Networks
iii. These sources generate vast amounts of data at a rapid pace, encompassing various types
and formats, including structured, semi-structured, and unstructured data. From social
media interactions and sensor readings to digital documents and communication logs, big
data sources produce a continuous stream of information that exceeds the processing
capabilities of traditional database systems
iv. The 3 V’s:
a. Volume: Big data sources generate enormous volumes of data at a rapid pace. This
includes massive datasets generated by human interactions on social media
platforms, continuous streams of data from sensors and IoT devices, and extensive
collections of digital documents and multimedia files. The sheer volume of data
exceeds the processing capacity of traditional database systems, necessitating
scalable storage and processing solutions.
b. Velocity: Big data is generated at high velocity, often in real-time or near-real-time.
For example, social media platforms produce a constant stream of updates and
interactions, while sensor data from IoT devices is continuously generated as events
occur. Handling data at such high velocities requires efficient data ingestion,
processing, and analysis capabilities to derive actionable insights in a timely manner.
c. Variety: Big data sources exhibit a diverse range of data types and formats, including
structured, semi-structured, and unstructured data. This encompasses text data,
multimedia content, sensor readings, transaction logs, and more. Managing and
analysing this variety of data requires flexible data storage and processing
techniques that can accommodate different data formats and structures.
v. Structured data, found in transactional records and some sensor readings, adheres to a
predefined format with clear organization. Semi-structured data, as seen in social media
interactions and certain IoT device data, lacks the strict schema of structured data but
retains some level of organization, often in formats like JSON or XML. Meanwhile,
unstructured data, prevalent in textual content from emails, documents, and social media
posts, as well as multimedia files such as images and videos, lacks a predefined structure
altogether.
vi. Handling big data for diverse applications typically involves a multifaceted approach tailored
to the specific characteristics of each data type. For structured data, traditional relational
database management systems (RDBMS) are commonly used, while NoSQL databases like
MongoDB and key-value stores like Redis are preferred for semi-structured data.
Unstructured data, such as text, images, and videos, requires specialized techniques like
natural language processing (NLP) for textual data and computer vision for multimedia
content. Data integration and ETL tools are employed to merge and transform data from
various sources before analysis, with technologies like Apache Kafka and Apache NiFi
facilitating real-time data streaming and flow management. Finally, analytics and machine
learning algorithms are applied to derive insights and make predictions, leveraging
frameworks like Apache Spark and TensorFlow.
vii. Limitations of these approaches:
a. Skill Gap and Talent Shortage
b. Cost and Infrastructure Requirements
c. Integration and Interoperability
d. Complexity and Scalability
e. Data Quality and Veracity
viii. Two Relevant Organizations are:
a. Federal Aviation Administration (FAA): The Federal Aviation Administration (FAA) is a
regulatory agency responsible for overseeing civil aviation in the United States. It
regulates and manages airspace, air traffic control systems, and aviation safety
standards. The FAA operates a network of air traffic control facilities, including
control towers, radar facilities, and en-route centres, to ensure safe and efficient air
travel across the country.
b. Euro-control: It is a pan-European organization responsible for coordinating and
harmonizing air traffic management across European airspace. It collaborates with
national air navigation service providers (ANSPs) and other stakeholders to ensure
seamless and efficient air traffic operations in the region. Euro-control provides
various services and tools to support air traffic control, including air traffic flow
management, airspace design, and air navigation system planning.
ix. References:
a. https://ieeexplore.ieee.org/document/8973192
b. https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3592705_code4154652.pdf?
abstractid=3592705&mirid=1
c. https://www.researchgate.net/publication/
338938066_Big_Data_Platform_of_Air_Traffic_Management

Name: Ansh Kumar Nimboria


Roll Number: 11021210089
Branch: B.Tech CSE (DS&AI)
Section: E, 3rd Year 6th Sem

You might also like