Machine Learning and Data Analytics - Edited 1

MACHINE LEARNING AND DATA ANALYTICS 1
Machine Learning, Data Analytics and APT 37
Name of student
Institutional Affiliation
Cybersecurity threats are on the rise with the constant revolution in technology, posing a
risk to people and organizations embracing technology. Simultaneously, technology
organizations are fast developing countermeasures to combat cyber insecurity, including machine
learning and data analytics. The two concepts under the field of data science are the core of every
organization's activities, which values the safety of stakeholder information and transactions.
Firms realize that they need to harness existing data and transactions in making critical decisions
to compete and drive greater efficiencies. Machine learning and big data analytics are the key
drivers towards the capitalization of the digitalization promise. This essay will give the key
concepts of big data analytics and machine learning by focusing on key machine learning
algorithms and their specific uses. It will also specify the various applications of the two
concepts and the future directions of the concepts.
Machine learning
Machine learning is a branch of artificial intelligence under data science basing on the
idea that systems can learn from data, identify patterns, and infer with minimal human
supervision (Shalev-Shwartz, & Ben-David,2014). The concept rose from pattern recognition
and the theory that computer systems can infer without human guidance to perform certain tasks.
Its iterative aspect is crucial since the more it is exposed to new data, the more it can adapt
independently. They use previous computations to produce repeatable and reliable results.
Applications of machine learning
There are several examples of the accomplishments of machine learning in the current
technological world. These technologies include the greatly hyped self-driving Google car, fraud
detection systems, online recommendations offer by Netflix and Amazon, and knowing the
client's perspectives about a product, for instance, on the Twitter platform. “Machine learning
uses both Bayesian analysis and data mining to automatically and quickly produce models which
can analyze and deliver complex and big data more accurately” (Shalev-Shwartz, & Ben-
David,2014). Therefore, organizations can identify viable opportunities while avoiding risks and
vulnerabilities. The fundamentals for developing machine learning systems include algorithms,
data preparation capabilities, iterative and automation processes, ensemble modeling, and
scalability.
Most industries are owning and transacting large data amounts, recognize the need for
machine learning. Organizations utilize the technology to work efficiently by gleaning data
insights, thus an advantage over other firms in the industry. The primary industry using machine
learning is the Finance industry, whereby banks and other firms identify crucial data insights and
prevent fraud (Shalev-Shwartz, & Ben-David,2014). these insights are important for opportunity
identification, when shareholders should trade, and what strategy to use in a competitive market.
The data mining section of machine learning helps identify persons with high-risk profiles or
utilize cyber surveillance to warn against fraudsters. Secondly, government agencies, including
utilities and public safety, utilize machine learning to analyze their several data sources mined
from insights. For instance, the analysis of sensor data identifies ways to improve efficiency and
reduce the overall cost. The government also uses technology to monitor people undertaking
unscrupulous activities and minimize theft.
Thirdly, the health industry is a core beneficiary of machine learning. The advent of
sensors and wearable devices use data to assess and make medication prediction of a patient in
real-time. It also helps medical experts to analyze data to predict a trend and red flags that could
lead to enhanced diagnoses and treatment. Fourthly, the retail industry utilizes machine learning
to access and analyze data on goods and services to personalize shopping experiences and
implement price optimization, marketing campaigns, merchandise supply planning, and client
insights. Websites recommend items based on clients' previous purchases and their actual
personal profiles, for instance, a socialite or students.
Types of machine learning
While there are several machine learning methods, there are two most widely adopted
technologies; supervised and unsupervised learning. Supervised learning algorithm trains using
labeled examples, like an input where the desired output is predetermined. For instance, mobile
equipment has its points labeled as either 'R' for running or 'F' for failing. The algorithm gets
inputs alongside the corresponding correct outputs, and it learns by comparing its real input with
the correct output to identify errors and then modify the model accordingly. Supervised learning
uses classification, gradient boosting, prediction, and regression to learn and use the pattern to
predict the label's values on the unlabeled data (Shalev-Shwartz, & Ben-David,2014). Some of
its applications include sales and marketing, security, asset maintenance, IoT, people analytics,
and entertainment.
Unsupervised learning is used to analyze data with no historical labels; the system knows
no right answer and has to figure out the idea (Shalev-Shwartz, & Ben-David,2014). The
technology is the best fit for transactional data. Its primary goal is to explore the data and
identify its internal structure. For instance, the technology can identify customer segments with
identical attributes which can be treated equally in the marketing campaigns, or equally the key
attributes separating client segments. The popular techniques include k-means clustering, self-
organizing maps, singular value decomposition, and nearest-neighbor mapping. Its applications
include data clustering and identification of fraudulent transactions.
Other machine learning techniques include semi-supervised and reinforcement learning.
Semi-supervised learning functions like supervised learning, only that it uses both labeled and
unlabeled data for training. It uses methods like regression, prediction, and classification. It is
applicable when the labeling cost is too high to allow a fully labeled training process. Its key
application is a person's face identification on a webcam. Reinforcement learning uses trial and
error to identify the actions that produce the greatest rewards. It is often used for navigation,
robotics, and gaming. It has three primary components; the agent, the learner, the action, and the
environment where agents interact.
Data analytics
Data Analytics is a branch of data science dealing with raw data analysis to make
conclusions concerning the information. With the revolution in technology, data analytics
processes and techniques have been automated into processes and algorithms (Russom, 2011).
The data analysis process involves various steps—first, the determination of data requirements
on data grouping. The data is separated demographically by age, gender, and income. Secondly,
collecting real data through sources, including cameras, computers, personnel, and online
sources. The third step involves the organization of data for analysis, which can either be on a
spreadsheet or other software forms. Finally, the data is cleaned up for analysis, including
scrubbing and cross-checking, to remove errors and duplication. Data analytics is crucial for
businesses in performance optimization.
Types of data analytics
There are four basic types of data analytics. First is the descriptive analytics, which
describes the happenings for a given period. Whether the number of clients has risen or the sales
of a certain month greater or lesser than the previous month. The second is diagnostic analytics,
which focuses more on the reason for a certain occurrence. This analytics type involves more
varied data inputs and little hypothesizing (Russom, 2011). For instance, did the latest marketing
campaign impact sales? Did the weather affect product sales? The third is predictive analytics,
which describes the likely happening in the future. What were the average sales of groceries this
quarter? Is there a likelihood of an increase in sales in the forthcoming quarter? The final step is
prescriptive analytics, which recommends a course of action. For instance, if the likelihood of
the Covid-19 pandemic to end soon, we should expand our coffee branches across the city.
Applications of data analytics
The major sector which has adopted data analytics technology is the travel and hospitality
industry. The industry collects client data and identifies associated problems, how to fix them,
and possible opportunities. The Healthcare industry combines high contents of structured and
unstructured analytics to infer. Equally, the retail industry utilizes copious data amounts to meet
the evolving consumer demands by identifying trends and recommending products.
The application of machine learning and data analytics in the cybersecurity field is
crucial even with the fast growth of cyber-attacks, which target user data. Machine learning can
continually analyze large data amounts to detect any malware that could lead to a security
breach. Since threats in the current world have evolved and overcome previous cybersecurity
measures, machine learning uses data for a better understanding of a system while protecting
susceptible data. Additionally, machine learning will be supplemented with behavioral analysis
to enhance the visibility of threats. Likewise, Data analytics will boost cybersecurity by
enhancing several data-driven functionalities. Insights from big data analytics tools are crucial in
threat detection. Since the technology will help in threat visualization, experts can foresee the
intensity of cybersecurity threats by checking on either data sources or patterns and even trends
in cybersecurity. Also, intelligent data analytics enhance the development of predictive models,
which issue alerts whenever a risk occurs at the entry point for a cyber-attack.
Cybersecurity companies using machine learning and data analytics

The cybersecurity industry is exploding as it grows to protect its systems, data, and
networks due to increased cybercrime losses recorded annually. Since effective information
security demands nifty detection, several cybersecurity organizations are raising the bar using
machine learning and data analytics components of artificial intelligence. Some of these
companies located in major states in the United States include the following;
Versive
The company is located in Seattle, Washington, and aids organizations identify crucial
threats, thus saving time and resources that could otherwise be spent investigating alerts. Its
security system uses machine learning and artificial intelligence to categorize risks according to
its severity from routine network activity, identify activity chains that cause attacks, and help the
response team counter the attacks.
Darktrace
The firm is located in San Francisco, California, and helps companies in diverse
industries to detect and counter cyber threats in real-time. Its machine learning capacity helps
analyze the network data to make calculations and recognize patterns by detecting deviations
from normal behavior and detect threats.
Fortinet
The organization based in Sunnyvale, California, provides security solutions for any IT
infrastructure-related issue. A web-based application firewall uses machine learning and
statistical probabilities to accurately detect threats (Schroer, 2019). Its tasks include web and
network application security, secure unified access, and threat protection.
Blue Hexagon
The company located in California is grounded on the belief that deep learning will
change cybersecurity. It offers real-time network threat protection to its clients. The firm utilizes
machine learning and data analytics to create malware based on the dark web and global threat
data to tests its systems and strengthen its capacity to the limit (Schroer, 2019). It works in both
networks and the cloud to cover a variety of threats across multiple platforms. Blue Hexagon
CTO Saumitra Das recognizes the need for data analytics and machine learning and says, "When
threats look different, our firewalls and controls can't keep up because they need to know what
bad looks like before they can block it."
Callsign
The company utilizes machine learning to validate an individual's identity with a swipe
on either several keystrokes on the keyboard, a screen, or several locations. The company
combines fraud analytics and multi-factor authentication to fight fraudulent activities (Schroer,
2019). The firm's platform collects multiple data points, including locational, telecoms, and
behavioral, to correlate identity traits and combine them with threat analysis information to
ensure data safety.
I would recommend almost all of these organizations to CTO since their integrated
artificial intelligence components of machine learning, and data analytics ensure cybersecurity
efficiency. Additionally, it reduces costs since it automates breach detection at early stages; thus,
no cost will be uncured in case of a breach. Specifically, I recommend Callsign since it analyses
large amounts of data and analyzes the data from a diverse perspective, thus minimizing the
malware's likelihood of getting into systems unknowingly.
Using Machine Learning and Data Analytics to prevent APT 37

An advanced persistent threat(APT) is an attack whereby a hacker exploits a system and
remains in the system invisible for a long time. The key aim is data theft by accessing the system
and remain under the radar but does not damage the system. The most complicated part of APT
is APT 37, which is based in North Korea and was exposed by FireEye after manipulating South
Korea computer systems using Adobe Flash zero-day vulnerability to gain intelligence
information. APT 37, also known as Reaper, is a cyberespionage group located outside North
Korea but operates under the North Korean government's orders in providing the state with its
cyber group to be used for global and political advantage. It was grounded in 2012 and focused
on the public and private companies in South Korea.
APT 37 used various tactics to gain access, from compromising websites to sending
spearfishing letters in its undertakings to distribute its malware. APT used a spear-phishing letter
to its board member of a finance company in 2017. The letter contained a Microsoft Office
activity in its attachment, redirected to a malicious website that eventually installed a backdoor
known as SHUTTERSPEED. The backdoor tool enabled APT 37 to gain entry to the victim's
computer system, thus taking screenshots of the computer while installing malicious
applications. APT 37 utilized other methods to spread its malware, including POORAIM and
KARAE, for instance, file sharing sites. APT 37 also exploited reintegration groups taking
advantage of the willingness of these organizations to cooperate. Thirdly, APT 37 constantly
exploited Hangul Word Processor due to its wide use in South and North Korea. The other tactic
used by the cyberespionage group is manipulating applications like Flash by waiting for the
publication of vulnerabilities and exploiting unpatched systems. Most APT 37 exploits are
related to command and control systems, thus avoiding visibility by using call home messages
through DOG CALL, a multiple cloud provider. It uses multiple malware in its initial intrusion
and exfiltration.
The success of APT 37 was contributed by the lack of measures to detect and counter the
malwares at their early stages. However, the revolution in technology and cybersecurity has led
to integrating artificial intelligence in the fight against cyber-attacks. If the machine learning and
data analytics components of artificial intelligence had been integrated into the computer systems
cybersecurity, it would have saved the states from the losses associated with the cyber-espionage
in several ways.
APT 37 is a multi-faceted attack that necessitates several security tools to counter them. It
is imperative to have endpoint protection, user and entry behavior, traffic monitoring, and email
filtering. Some of the techniques used by APT 37 include audio capture, code signing, commonly
used port, zero-day vulnerabilities, and spear-phishing attachment. Audio capture is the capacity
to capture audio recordings to collect information. Code signing was used to masquerade tools
and malware to be legitimate binaries, while commonly used port was used to blend in with
usual network activity to prevent detection. Zero-day vulnerabilities are a security flaw known
by the vendor. Still, they have no specific fix measure, whereas spear-phishing attachment
attaches malware as an email attachment in the form of Microsoft Office document, achieved,
executable, or PDF files. The common APT 37 malwares are DODCALL and RUHAPPY, a
backdoor distributed as an encoded binary file and wiper tool attempting to overwrite the MBR.
Machine learning and APT 37
APT 37 frequently used zero-day malwares and attacks to compromise computer
systems. Its success is ascribed to the multiple attack points and its ability to remain unnoticed in
the system for a long time. Machine learning and data analytics could have aided in the detection
and prevention of the malwares. The two technologies give the experts the user behavior
analytics that can recognize anomalies and append accounts. The activity is after that generated
into the incident response report. The techniques could have prevented APT 37 attacks since the
malware would have changed the users' activity leading to the suspension of the account. The
constant changing aspect of malware of APT 37 makes most computer systems to let in the
malware since it was designed to avoid heuristic threat methods and signature. The integration of
deep instincts system of machine learning and data analytics could have detected the threats even
if unknown. Thus, if a computer system user happens to chick a malware-infected email
attachment from APT 37, the system quickly detects and neutralizes the threat.
First, machine learning technology would have ensured accuracy in cybersecurity since
more algorithms imply more accuracy. Targeted malwares, for instance, Project Sauron, often
display diverse behaviors that require a million ways and attempts to identify the malware (Eke,
Petroviski, & Ahriz, 2019). Therefore, there is a need for correlation of results from varied
algorithms to secure a high detection rate. The algorithms, including neural networks, deep
learning, binary decision tree, centroids, and perception, which has been trained, improve a
cybersecurity action's detection capacity. These algorithms are segmented based on the malware
they counter, some on new malicious files, some minimize false positives, and some focusing on
specific malware families.
Secondly, machine learning could have detected suspicious network traffic at its initial
stages. The detection of zero-days begins with the assumption that the endpoint activity in an
enterprise setting has a predictive behavior (Eke, Petroviski, & Ahriz, 2019). Thus, the
recognition of any malicious activity should commence with the analysis of typical network
behavior. Machine learning aids in processing large data amounts in real-time to establish a
normal baseline, compare these data's behaviors and identify abnormal differences in the values
with varying times (Eke, Petroviski, & Ahriz, 2019). Experts could use the technology to
analyze unusual activities in internet traffic, proxy logs, and IPS alerts.
Thirdly, machine learning could have helped in sorting email spam. APT 37 often took
advantage of zero-day exploits in recurrently used applications. In most cases, a naïve user
would be redirected from the email to a malicious website. A trained machine learning sorts the
email and separates the malware-infected emails basing on the rules it learns from pre-classified
samples (Eke, Petroviski, & Ahriz, 2019). Systems that have integrated machine learning and
artificial intelligence use both local and cloud-based detection procedures to analyze SMTP
connection information, that is, the sender IP address and domain name, header information,
URLs, phone number, and attachments. The filters utilized in data analysis interact to cover
various variants and scenarios of already-detected spam waves.
Additionally, the neural networks, a branch of machine learning, could have been used to
filter URLs. APT 37 succeeded since malware can hide in a multi-layered proxy network and use
evasion measures with the ability to provide new binaries at every access instance. Thus, there
was a need for network-based filtering, which is efficient when associated with cloud-based
service. A malicious URL is accessible for a short time to swiftly update the malicious URLs and
list of domains into the clients' computers. The success of APT 37 is also attributed to the use of
URL blacklisting-whitelisting, which is susceptible to sophisticated evasion techniques that
require a dynamic and more complex countermeasure to block malicious attacks (Eke,
Petroviski, & Ahriz, 2019).
Security experts could have used unsupervised machine learning techniques in clustering
domains or URLs to identify domain-generated algorithms used by attackers to create malware
to generate domains that act as meeting points with control and command servers. The clustering
section of machine learning could have enhanced data set into homogeneous sets based on
certain features. The technology seeks to structure unlabeled data volumes into sets sharing
certain properties. The clustering algorithm should be suitable for big data, specific, and able to
overcome complications. Every URL is analyzed using a special hash calculated using a URL
fingerprint. During this process, the system will classify it using multiple diverse algorithms,
each with a unique rate, thus yielding two algorithm groups; one labeled as malicious and the
other labeled URL clean. Subsequently, the clustering system uses a voting system to determine
whether the fingerprint is either malicious or not. The scores of each fingerprint function are
totaled to get the final score, and based on the results, the URL is labeled as either clean or
malicious.
Data Analytics in countering APT 37
At the time of APT 37, analytic tools could be used to catch up with the analysis of large
data sets that possibly carried malwares. During an incidence investigation by the cybersecurity
incidence response team, experts ought to follow the thread of events through logged data and
path interwoven through Microsoft domain, edge devices, switches, security devices, and routers
(Rice & Ringold, 2015). In every security event, the time factor is key in stopping the illegal
exfiltration of data from the network. An organization must have the capacity to quickly query
large data sets to identify, investigate, and counter a cyber-event. The queries against large data
sets should be dynamic, and its results presented swiftly.
Every cybersecurity organization needs to manage big data for security analysis to be
successful. APT 37 became successful since the experts could not analyze the large data sets
which contained malicious attachments. Data analytics mechanism could have performed a real-
time analysis of these data and separate malicious from the useful data and emails (Rice &
Ringold, 2015). The volume of log data being collected grows synonymously with the growth of
the number of log sources. Growth never follows a linear path. The individual system generates
more and more data and also new systems. Thus, if all the systems send logs to a centralized
system, the data volume becomes uncontrollable.
Big data analytics could have provided the capacity to correlate logging events based on
user behavior and time across the entire spectrum of technologies and devices. Comparatively,
the traditional SIEM tools could not provide the capability since it organized data into databases,
which eventually became too big and chunky to query and analyze. Some of the big data
analytics tools which the experts could have used include IBM QRadar Security Intelligence
Platform and Splunk Enterprise, which ensure a cybersecurity incidence response team follow
the thread from active IP sessions on the network edge to behaviors of user account credentials
that are compromised to a known event, for instance, malware alert (Rice & Ringold, 2015).
None of these activities would be possible without logging.
Additionally, Big Data analytics would have enhanced comprehensive logging.
Enterprises can have close to a million log pings in a day, depending on the organization's
activities. Most security programs keep logs from the devices at the network edge, including
intrusion detection systems and firewalls (Rice & Ringold, 2015). The logging level should be
high to ensure the visibility of APT traffic when either leaving or entering a network
environment.
Conclusion
Machine learning and Big data analytics technologies play a crucial role in cybersecurity
in enterprises that transact with big data. Planning, processing, and stakeholders are key in using
large data sets to win against APT. the chief information security officers(CISOs) should use
noble analytical tools to correlate activities across various data and decide whether they are
malicious or not. Both technologies work conjointly since machine learning makes sense of big
data from the analytics with decision-making algorithms. Additionally, machine learning
categorizes the data, identify patterns, and infer from the data to derive essential decisions.
References
Dragos, G. (2016, December 13). How machine learning helps fight advanced persistent threats.
Retrieved from PCWorld: https://www.pcworld.idg.com.au/article/611236/how-machine-
learning-helps-fight-advance-persistent-threats
Eke, H. N., Petroviski, A., & Ahriz, H. (2019). The use of machine learning algorithms for
detecting advanced persistent threats. Security of Information and Networks, 1-8.
Rice, A., & Ringold, J. (2015, January 30). Defend against APTs with big data security analytics.
Retrieved from Information Security:
https://searchsecurity.techtarget.com/feature/Defend-against-APTs-with-big-data-
security-analytics
Russom, P. (2011). Big data analytics. TDWI best practices report, fourth quarter, 19(4), 1-34.
Schroer, A. (2019, July 12). 30 COMPANIES MERGING AI AND CYBERSECURITY TO KEEP
US SAFE AND SOUND. Retrieved from Artificial intelligence cybersecurity:
https://builtin.com/artificial-intelligence/artificial-intelligence-cybersecurity
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to
algorithms. Cambridge university press.

Machine Learning and Data Analytics - Edited 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning and Data Analytics - Edited 1

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING AND DATA ANALYTICS 1

Machine Learning, Data Analytics and APT 37

Cybersecurity companies using machine learning and data analytics

Using Machine Learning and Data Analytics to prevent APT 37

You might also like