Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 22

1

Machine Learning for Big Data

Syed Murtaza Haider Zaidi


Westcliff University
MSIT 690: Big Data Analytics
Professor Hemphill
May 29, 2021
2

Table of Content

Title Page --------------------------------------------------------------------------------------

Table of Content ------------------------------------------------------------------------------

Introduction ------------------------------------------------------------------------------------

Machine Learning for big data --------------------------------------------------------------

Application for big data-----------------------------------------------------------------------

Neural Data-------------------------------------------------------------------------------------

How does Neural data work------------------------------------------------------------------

Statistical Analysis methods------------------------------------------------------------------

Data Visualization -----------------------------------------------------------------------------

Types of data visualization -------------------------------------------------------------------

Data visualization tools-----------------------------------------------------------------------


3

Comparison of data visualization tools------------------------------------------------------

Conclusion--------------------------------------------------------------------------------------

Reference----------------------------------------------------------------------------------------

Machine Learning for Big Data


4

Introduction

Over the most recent quite a while, more information has been created than in centuries of

mankind's set of experiences In terms of commercial value, this data is a treasure trove, and also

fundamental published sources for authorities. In almost any case, the majority of this potential

will go unused or, more sadly, misconstrued as long as the technologies needed to analyze huge

volumes of data are present. Without a lot of computational capacity, extracting meaningful

insights from big data's trends, correlations, and patterns can be challenging. However, big data

analytics methodologies and technologies allow for more learning from enormous data sets. It

contains visualize data, regardless of size or form.

Machine Learning for Big Data

Machine Learning and Big Data are the existing IT sector's blue-chips. Big data stores analyze

and gather information from huge amounts of data. Machine learning, on either side, is the

ability to understand and enhance from perception without even being predictive analytics.

The central of machine learning comprises of self-learning Algorithms which develop by

constantly enhancing at their designated duty. When it is structured properly and cater proper

data, these algorithms in the end produce results in the factors of pattern identification and

predictive modeling. Data is like exercise for machine learning Algorithms. Algorithms modify
5

based on the data they are trained on, just as Elite athletes sharpen their bodies and abilities by

practicing every day. Machine learning Algorithms setoff more effectual as training datasets

become larger. As a result, when big data and machine learning are combined, it benefits double

For example; the Algorithms The algorithms assist us in maintaining up with constant influx of

data, while the volume and wide range of the very same data feeds and aids the algorithms'

growth.

Designers could perhaps expect to see delineated and analyzed outcomes, such as hidden patterns

and analytics, when we feed big data to a machine-learning algorithm, which can support with

predictive analytics. Up to 2 Mbps download and upload speeds are possible. That was a

resounding hit, and digital network grew in popularity swiftly, to the point where they were

widely utilised by the end of the twentieth century, with Apple's Steve Jobs playing a key part.

Then when technology improves, network users will have a better experience.

Machine Learning Applications for Big Data

Given below are the examples that illustrates how Big data and Machine learning can work

together:

 Web Scraping: Assume a household appliance maker learns about market trends and

consumer contentment patterns via a store's financial statements. The manufacturer

decides to web-scrape the immense number of relevant data pertaining to online feedback

from customers and customer reviews in attempt to discover out how the reviews may

have missed. The company realizes how to enhance and properly illustrate its product

lines by combining the whole data and feeding it into a high model. This leads to
6

increased sales. Even as web scraping produces a big quantities of data, it's worth

mentioning that one of the most important element is selecting the datasets.

 Cloud Networks: A research organization has a huge quantity of data they wants to

study , unfortunately they require servers , networking , storage and other security assets

to complete their task. this all will sum up as a obstructive expense. the organization

determine to allocate in EMR Amazon which is a cloud service. it offers data analysis

model with in managed framework. GPU-accelerated recognition software and

classification algorithms are examples of machine-learning models of this type. Since

these algorithms do not really learn after they've been dispatched, they can be dispersed

and endorsed by a content delivery network (CDN).

 Combined Initiative Systems: Clustering algorithm is used in the Netflix prediction

model, which suggests titles on your homepage: Big data is used to monitor your  history,

and machine-learning algorithms are used to determine what it should recommend after

that. In the same way, smart-car automakers use big data and machine learning in the

predictive-analytics systems that power their vehicles.

There's a few requirements for getting accurate results from machine learning. Clean data,

optimised tools, and a clear idea of what you want to achieve are all required in addition to a

well-designed learning algorithm.

NEURAL NETWORKS
7

Neural networks are a type of algorithms that recognize patterns and are broadly modeled well

after human mind. They use a certain kind of machine vision to interpret sensory data, labeling

or clustering original data. All actual statistics, whether pictures, audio, message, or written data,

should be transcribed into another trends they recognise, which are statistical and enclosed in

integers. The list of companies that provide neural network software :

 BioComp System Inc: It is a organization that mainly focuses in genetic Algorithms for

advisory and software designs and neural networks .

 Attra Soft: it provides variety of neural network formed products which is used for

reorganization of sounds, pictures, mining of data and trend analysis.

 Applied Analytical Systems: A company that mainly focuses in neural networks,

statistical analysis, and applied mathematics, as well as systems analysis and artificial

intelligence expert and development process.

 Jurik Research: It is An Excel append that utilizes neural networks to strengthen

predictions.

 NeuralWare: Offers neural network formed analysis products and engineering services to

companies, government agencies, industrial sectors, and academic institutions to help

them fix mining of data, categorization, forecasting, and pattern matching difficulties.

 Nonlinear Solutions Oy: Control systems and material actions models are among the

services based on nonlinear modeling, especially neural networks. Offers custom

applications, nonlinear model simulation software, technique known, and industrial

course materials.
8

How does a Neural Network work?

When an input is provided to neural network , it brings back the output . on first attempt it is not

possible to get the correct output by itself , and this is the reason at the time of learning duration,

each inputs come with its tag, deciding which output neural network should guessed. Whereas if

option chosen is the finest, the current settings are retained, then the next input is supplied.

Weights are modified if the resulting output does not suit the tag. During the process of learning,

these were the only variables that can be modified. When an input is still not correctly guessed,

this procedure might be thought of as a series of buttons that are changed into other options. A

procedure known as back propagation is used to decide which frequency should be modified. We

didn't comment on that much because the neural network we'll design won't follow this exact

procedure, but it will entail walking back through the neural network and inspecting each link

and see how the output would respond to a change in the weight. Furthermore, there has been

one more parameter that must be understood in order to influence how the neural network learns:

the "growth speed." This new variable controls the speed at which the neural network develops,

or perhaps more precisely, how it changes a weight, either incrementally or in larger steps.

STATISTICAL ANALYSIS METHODS

Over the previous ten years, the central tendency has evolved dramatically. Few things look the

same as they did in the past, even if the mechanism utilized over workstations and also the

software that allows people to interact Another thing that is radically different is the amount of

data we have at our disposal. What's been previously scant has now become a seemingly
9

insurmountable amount of information. However, if you did not understand exactly how to

examine company's data to uncover actual or meaningful means, it can be daunting. Now, what

do you get from point A, where you have a lot of data, to point B, where you can effectively

analyse it? It all boils down to employing the proper statistical analysis procedures, which are

used to process and gather data samples in order to find trends and patterns.

There are five options for this analysis: Mean , Standard Deviation , Regression, Hypotheses

development and Calculation of sample size

FIVE TECHNIQUES FOR IMPLEMENTING STATISTICAL ANALYSIS

If you're a data analyst or otherwise, so there is no denying that big data is capturing the attention

of the world. As a result, you'll need to figure out where to start. All five strategies are simple but

effective when it comes to making data-driven decisions.

1. MEAN: The mean, often known as the average, is the first approach used to undertake

statistical analysis. When calculating the mean, one adds up a list of integers then divide the

amount by the number of items on the list. Whenever this technique is utilized, it is possible to

identify a data set's overall trend as well as gain a quick and succinct perspective of the data. The

method's users also profit from the method's easy and rapid analysis. The statistical mean

determines the center point of the data being analyzed. The result is known also as mean of the

data collected. In actual situations, people regularly utilize the word Mean when discussing

studies, economics, and sports. Consider how frequently a baseball player's strikeout rate is

mentioned, that is mean.


10

How to find it: To get the mean of the data, sum all of the numbers altogether, then divide the

total by the number of numbers in the dataset or lists.

2. STANDARD DEVIATION : The standard deviation is a statistical tool for calculating the

dispersion of data from its mean. Once you have a greater variance, you're working with data

which is far from the mean. A low variance, on the other hand, indicates that most data is in

accordance with both the mean and can also be referred to as the set's predicted values. When

determining the dispersion of data points, standard deviation is commonly utilised. Now let us

pretend you're a salesperson who somehow finished a marketing research. When you obtain the

study results, you want to know how reliable the answers are so you can forecast if a bigger

portion of people will have the same responses. A low standard deviation indicates that the

responses can be projected to a broader set of customers.

HOW TO FIND IT: σ2 = Σ(x − μ)2/n

The symbol for standard deviation is σ

Σ = Sum of data set

x = value of dataset

μ = mean of the data

σ2 = Variance

n = Number of data items in total

3. REGRESSION : In statistics, regression is the relationship between a dependent variable and a

predictor variables. This could also be expressed in terms of how one variable impacts others, or
11

how changes in one variable cause changes in the other, as in cause and effect. It indicates that

each or more variables have an impact on the outcome.

HOW TO FIND: Y=a +b (x)

The y-intercept, or the value of y when x = 0, is denoted by the letter A.

X = a variable that is reliant

Y = variable that is not dependent

B= refers to the slope, or rise over run

4. HYPOTHESIS TESTING : Hypothesis testing is used in statistical analysis to examine the 2

pairs of explanatory variables inside a data collection. The technique is used to see if a given

thesis or result stands true for the given data collection. It enables the data to be compared to

alternative hypotheses and beliefs. It can also help predict how business actions will effect the

company. A hypothesis test in statistics estimates a quantity below a certain assumption. The

test's outcome indicates if the assessment is wrong or whether it has been broken. The null

hypothesis, often known as 0 hypothesis , is this assumption. Hypothesis one or the first

hypothesis, is any other hypothesis that contradicts hypothesis 0 in any way. Whenever you

undertake testing of hypothesis, the answers are statistically noteworthy if they show it could not

have happened by chance or arbitrary incidence.

HOW TO FIND IT: A statistical hypothesis test's results must be interpreted in order to make a

particular assertion, that is known as the p-value. Now let us assume the answer we are expecting

for has a equal chance of getting it right


12

5. DETERMINING THE SAMPLE SIZE: At any time when it is related to statistical analysis,

quite often the set of data is extremely great, ensuring efficient data collection for each piece of

the dataset problematic. Because this is the case, many people opt for sample size determination,

which entails studying a small effect size of data. To perform this successfully, we will have to

figure out how big our sample should be. We won't get accurate answers in the last of our

analysis if the sample size is too tiny. We will use any single of the various sampling of data

strategies to arrive at this result. We can do this by giving out a question to our consumers, and

afterwards selecting consumer data to be evaluated at arbitrary using purposive sampling. Size of

the sample that is excessively huge, but at the other hand, can result in a waste of time and

resources. We can look at things like price, effort, and the ease with which we can gather data to

decide the sample size.

HOW TO FIND IT: There is no one-size-fits-all formula for calculating size of the sample, not

like the rest of the four methods of statistical analysis. Furthermore, here are several common

guidelines to follow when calculating sample group:

 .Conduct a census when working with a smaller sample size.

 Apply a sample size from a related research with your own. For all of this, you might

wish to check through academic databases for a study that is comparable to yours.

 If you're performing a general study, you might be able to exploit an existing table to

your benefit.

 Calculate the representative sample with a sample size calculator.

 Only because there is no really a single prescription that works does not imply you will

not be sure to locate one that does. Depending on what you know or don't know about the

sample in question, there appear to be a variety of alternatives. Slovene’s and Cochran's


13

formulas are two that you might want to use.

DATA VISUALIZATION

Like the "era of Big Data" accelerates, visualizing will become a more important tool for making

use of the billions of rows and columns of the data generated each day. Visual analytics aids in

the conveying of tales by transforming data into a more understandable format and showing

trends and outliers. A good visualization narrates a tale by removing congestion from data and

emphasizing the most important facts. Unfortunately, that's not as simple as throwing the "data"

element of an illustration on top of a graph to make it appear nicer. A careful fine balance among

shape and structure is needed for optimal data display. The most basic graph may be too

uninteresting to be observed, or it could send a strong message; the far more striking

representation may completely fail to convey the proper idea, or it may raise questions. The facts

and the images must complement each other, and merging outstanding analysis with outstanding

narrative is an art.
14

TYPES OF DATA VISUALIZATION

Simple bar graphs or pie charts are typically the first things that come to mind when you think of

data visualization. Though these are an important aspect of data visualization and a frequent

starting point for many data visualizations, the proper visualization must be combined only with

proper set of data.

1. TEMPORAL: If data visualizations meet two criteria, they fall into the temporal category: they

must be linear and one-dimensional. Lines that may stand alone or overlap one other, having a

start and ending time, are commonly used in temporal representations. The advantage is that

they have been common charts from education and the workplace, which makes them more

likely to comprehend when we see it.

Example:

 Scatter plots are a type of graph that is used to show

 Diagrams of polar areas

 Sequences of time series

 Timetables

 Graphs in a straight line


15

2. HIERARCHICAL : The hierarchical category includes data visualizations that organize

groups within bigger groups. If you want to showcase groups of data, especially if they come

from a single source, hierarchical visualizations are the way to go.

Example:

 Diagrams of trees

 Ring diagrams

 Diagrams of sunbursts

3. NETWORK: Datasets are intricately linked to one another. Network traffic visualizations

depict how nodes in a network are connected to one another. To put it another way, it's

displaying links between datasets without relying on detailed arguments.

Example:

 Matrix diagrams

 Diagrams of nodes and links

 Clouds of words

 Diagrams of alluvial deposits

4. MULTIDEMSIONAL : Multidimensional data visualizations, as the name implies, contain

numerous dimensions. That implies that while creating a 3D data visualization, there's always
16

two or even more variables in play. These kind of visualizations are the most bright or gaze

because to the multiple concurrent layers and datasets. Such visualizations may help you distil a

lot of information into a few crucial points.

Example:

 Scatter plots are a type of graph that is used to show

 Graphs in the form of pie charts

 Venn diagrams are a type of diagram that is used to show

 Graphs with stacked bars

 Histograms are a type of graph.

5. GEOSPATIAL : Geospatial or geospatial data visualizations overlay familiar maps with

various data elements and link to actual physical locales. Such data visualizations are typically

sometimes used show sales or mergers through time, and are best known for their use in political

campaigns or to show market penetration in foreign companies.

Example:

 Flowchart

 Map of densities
17

 Heat map 

DATA VISUALIZATION TOOLS

Data visualization technologies make creating visual representations of massive data sets simpler

for data visualization designers. While working with sets of data containing thousands or

millions of data sets, automated the methodology makes a designer's job much easier, at least in

part.

Interfaces, yearly reports, marketing and sales brochures, shareholder presentation decks, and

nearly anywhere else information has to be digested quickly can all benefit from these data

visualizations.

There are a few features that all of the finest data visualization tools have in common. The first is

that they are simple to use. There are some pretty difficult data visualization programs available.

These included better robustness and videos, and are created in a user-friendly manner. Some,

irrespective of their other capabilities, are missing in those areas, excluding them from any list of

greatest tools.

COMPARISON OF DATA VISUALIZATION TOOLS


18

To visualize huge amounts of data, there seem to be hundreds, if not hundreds, of apps, tools,

and programs accessible. Most are fairly simple and also have a range of characteristics that

overlap. However, there's a few high points that are either more capable in terms of the kind of

visualizations they can produce or are substantially easier to use than the remaining choices.

 Tableau : Tableau offers a desktop program, server and hosted web editions, and a free public

option, among other things. CSV files, Google Ads and Analytics data, and Sales force

information are just a few of the data import possibilities accessible.

 Infogram: Infogram is a data visualisation tool that allows you to click and drag data. that even

anti can use to generate excellent data reports on marketing visuals graphics, online posts,

maps, monitors, and much more. The visualizations can be saved in a variety of formats,

including.PNG,.JPG,.GIF,.PDF, and.HTML. Interactive visualizations are indeed

conceivable, making them ideal for use in websites and apps. Infogram additionally

provides a Word press theme that simplifies the process of integrating visualizations for

Word press sites.

 Chart blocks: Data may be loaded from "everywhere" via ChartBlocks' API, including video

broadcasts, according to the company. Although they claim that data can be imported from any
19

resource in "in a few clicks," it's likely to be more complicated than other applications that have

automation components or plugins for particular sources of data.

 Google Charts : Google Charts is a freeware data visualization process that is able for working

prototype charts that can be displayed publicly. It utilizes different data, as well as the outputs are

all HTML5 and SVG, so it can be seen in devices without the need for additional plugins. Google

Spreadsheets, Google Fusion Tables, Sales force, and other SQL databases are among the sources

of data.

 Poly Maps: Polymaps is a mapping-specific JavaScript framework. With picture toppings to sign

maps to density maps, the outputs are dynamic, responsive maps in a variety of forms. Because the

pictures are created using SVG, designers can alter the graphics of the charts using CSS.It might

be difficult for developers to determine which visualization tool to employ because there are so

many options. Developers of data visualizations should consider things including simplicity of use

and if a tool offers the functionality they require.

Conclusion

Big data expands our knowledge base, while machine learning improves our major issue

abilities. When combined, the duo offer the potential to expand whole businesses. In order to

take advantage of this, one also must scale our additional tools. People can make better decisions

by programming machines to analyses data that is too large for humans to process alone.
20

This is no longer unique that big data is a driving force behind many of the world's most

successful technological companies. Yet, because more businesses adopt this to collect, analyze,

or higher price out the massive amounts of data, it is getting increasingly difficult for them to

make the most use of the information gathered.

And that is where machine learning could come in handy. Machine learning systems benefit

from data. Therefore more data a system collects, the better it learns to serve businesses. As a

result, adopting artificial intelligence for advanced analytics is a logical next step for businesses

looking to optimize the benefits of cloud computing adoption.


21

References

B Adam, IFC Smith, F Asce, Reinforcement learning for structural control. J Comput Civil Eng

22(2), 133–139 (2008)

https://www.udacity.com/blog/2020/08/machine-learning-for-big-data.html

(n.d.), What is SASE, Palo Alto Networks.

https://www.paloaltonetworks.com/cyberpedia/what-is-sase

R Bekkerman, EY Ran, N Tishby, Y Winter, Distributional word clusters vs. words for text

categorization. J Mach Learn Res 3, 1183–1208 (2003)

https://www.simplilearn.com/tutorials/deep-learning-tutorial/deep-learning-algorithm

Mobbs RJ, Phan K, Malham G, et al. Lumbar interbody fusion: techniques, indications and

comparison of interbody fusion options including PLIF, TLIF, MI-TLIF, OLIF/ATP, LLIF and ALIF. J

Spine Surg 2015;1:2–18.

https://www.ibm.com/cloud/blog/ai-vs-machine-learning-vs-deep-learning-vs-neural-networks

https://www.toptal.com/designers/data-visualization/data-visualization-tools
22

Capua JD, Somani S, Kim JS, et al. Analysis of risk factors for major complications following

elective posterior lumbar fusion. Spine (Phila Pa 1976) 2017;42:1347–54.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5041595/

https://www.sciencedirect.com/book/9780128037324/computational-and-statistical-methods-for-
analysing-big-data-with-applications

You might also like