Advancement of Data Science

DATASCIENCE 1
Name of the Class
Professor's Name
University
City and State of University
Date
DATASCIENCE 2
Table of Contents
1.0 Advancement of Data science..............................................................................................................3
2.0 Spark in Data Science............................................................................................................................4
3.0 Machine Learning Implementation......................................................................................................5
3.1 Dataset..............................................................................................................................................5
3.2 Collaborative filtering.......................................................................................................................5
3.3 Logistic Regression............................................................................................................................5
DATASCIENCE 3
1.0 Advancement of Data science

Data Science is an interdisciplinary field that is it uses knowledge from other fields to process
and extract knowledge from structured and unstructured data (Donoho 2017). Data science has
evolved and continues to evolve as one of the promising technologies and highly demanded
career fields in the 21st century. It's, therefore, essential for data professionals to be updated on
the current advancements in the field to uncover useful insights or information for their
organizations or companies.
For the last two years, data science has tremendously advanced these are proved with the
numerous data science packages that have been developed as well as many libraries created by
the data science community all over the world to speed up the process of data analysis. In the
year 2018, a new version of PyTorch was released, which is a machine learning framework for
the deep neural networks and is essential for modeling and deployment as well as speeding up
the building of sophisticated data science pipelines. PyTorch itself is straightforward to learn and
provides the users with a chance of manipulating various graphical representations on the go.
Another more significant advancement in deep learning for computer vision, natural language
processing, audio signal processing, among many others (Xu 2019). With the improvement in
deep learning for vision, several challenging problems of the past, such as image classification,
object identification, and face recognition, have now tremendously improved, and therefore users
can achieve quality results. Audio signal processing, on the other hand, has made it to analyze
audio data and understand audio signals. Automated machine learning is another significant
advancement in the field of data science; this reduces the time that the data scientists spend in
solving critical application challenges. It allows the professional to use the various machine
learning models easily, thus providing data scientists with ample time to focus on other complex
problems. Automated machine learning also saves the cost of production in an organization
DATASCIENCE 4
while delivering the same results since it provides a predefined structure that applies the relevant
algorithms required, and it needs to be followed to reduce the quality time in providing accurate
results. Artificial Intelligence, which is a subfield of data science, has also tremendously grown
over the last two years; businesses allover has adopted artificial intelligence in their various
business systems, with many others strategizing on how they will adopt AI. There has also been
increased growth in the use of AI-driven applications that enhance the performance of
organizations and businesses. The integration of AI in data science has led to the development of
AI-enabled tools that have enabled companies to perform statistical analysis of their data
efficiently, and the filed continues to grow.
Data Science is ever-evolving, as illustrated above in the various trends the field continues to
grow and is more active than ever. As more innovation is realized year in year out, various
analytic tools are created to improve operations within organizations. However, as the field
continues to advance, various data privacy regulations have to be adhered to by the innovation
tools that are developed to ensure that personal privacy is not invaded and to provide that
companies won't mishandle personal data. Automated systems have to be explainable to ensure
that the algorithms used in the field of machine learning should be easy to interpret and allow for
standardization to give accurate results.
2.0 Spark in Data Science

Spark is a general-purpose data processing engine that can perform a wide range of functions
such as machine learning, graphical manipulation, data processing, among many others (Drabas
2017). Spark is used by data scientists to perform ETL processes and SQL jobs across massive
datasets. Spark can handle huge datasets at a time that can be distributed over thousands of
virtual servers. Spark has various libraries as well as APIs that support multiple languages such
as java, python, among many others. Spark fits in data science in that its capable of handling
DATASCIENCE 5
petabytes of data at a time and also has various sets of libraries and API that support multiple
languages that are used in data analysis and statistical computation. Compared to Hadoop, Spark
is faster and takes less memory as well as less time to execute, and therefore it's much preferred
than Hadoop. Spark also provides an interactive platform and low latency computing framework,
whereas, for Hadoop, it has a high latency computing framework and does not have an
interactive mode. Spark also uses memory and can also utilize disk for processing, whereas the
MapReduce in Hadoop uses the only drive to perform various operations.
3.0 Machine Learning Implementation

3.1 Dataset
In the implementation of the recommendation engine using the ALS, a small movie Lens dataset
obtained from Kaggle is used to train and validate the alternative least squares model used in the
recommendation of the movies to users based on their various ratings on the previous movies
that they have watched (Singh 2019).
3.2 Collaborative filtering

Collaborative filtering is used to recommend a movie to a user or a specific product based on
their interest's users and the preference information of the user. The recommendation of movies
to users is provided after developing a model of ratings of previously watched videos by the
users and recommend based on their preferences and tastes.
3.3 Logistic Regression

The dataset used to perform logistic regression classification model is the Pima Indians diabetes
database downloaded from Kaggle the worlds largest data science community. The data is loaded
into the data frame and 9 columns with an outcome variable which is binary that is 1 or 0. The
data is divided into a training set and testing set, and the training set taking 80% of the data,
whereas the testing data were taking 20% of the data and seeding of 12345.
DATASCIENCE 6
In PySpark ml. Logistic regression is used to predict the binary outcomes of an experiment using
the binomial regression or the multinomial outcome using the multinomial regression. The data
is loaded into the spark data frame, the feature and the label column are added, and then a
logistic regression model is formed (Singh,2019). The model is evaluated to ensure that it gives
out correct predictions at a good percentage that is reliable.

DATASCIENCE 7
Reference List
Donoho, D., 2017. 50 years of data science. Journal of Computational and Graphical
Statistics, 26(4), pp.745-766.
Drabas, T. and Lee, D., 2017. Learning PySpark. Packt Publishing Ltd.
Singh, P., 2019. Logistic Regression. In Machine Learning with PySpark (pp. 65-98). Apress,
Berkeley, CA.
Singh, P., 2019. Recommender Systems. In Machine Learning with PySpark (pp. 123-157).
Apress, Berkeley, CA.
Xu, J., 2019, August. Advancement of Data Analysis and Mining, Decision Support System, and
Computing Science Based on the Thirteenth ICMSEM Proceedings. In International
Conference on Management Science and Engineering Management (pp. 1-10). Springer,
Cham.

Advancement of Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advancement of Data Science

Uploaded by

Copyright:

Available Formats

DATASCIENCE 1

Name of the Class

City and State of University

1.0 Advancement of Data science

efficiently, and the filed continues to grow.

standardization to give accurate results.

2.0 Spark in Data Science

MapReduce in Hadoop uses the only drive to perform various operations.

3.0 Machine Learning Implementation

that they have watched (Singh 2019).

3.2 Collaborative filtering

users and recommend based on their preferences and tastes.

3.3 Logistic Regression

out correct predictions at a good percentage that is reliable.

Donoho, D., 2017. 50 years of data science. Journal of Computational and Graphical

Drabas, T. and Lee, D., 2017. Learning PySpark. Packt Publishing Ltd.

Apress, Berkeley, CA.

Computing Science Based on the Thirteenth ICMSEM Proceedings. In International

Conference on Management Science and Engineering Management (pp. 1-10). Springer,

You might also like