Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

DATASCIENCE 1

Name of the Class

Professor's Name

University

City and State of University

Date
DATASCIENCE 2

Table of Contents
1.0 Advancement of Data science..............................................................................................................3
2.0 Spark in Data Science............................................................................................................................4
3.0 Machine Learning Implementation......................................................................................................5
3.1 Dataset..............................................................................................................................................5
3.2 Collaborative filtering.......................................................................................................................5
3.3 Logistic Regression............................................................................................................................5
DATASCIENCE 3

1.0 Advancement of Data science


Data Science is an interdisciplinary field that is it uses knowledge from other fields to process

and extract knowledge from structured and unstructured data (Donoho 2017). Data science has

evolved and continues to evolve as one of the promising technologies and highly demanded

career fields in the 21st century. It's, therefore, essential for data professionals to be updated on

the current advancements in the field to uncover useful insights or information for their

organizations or companies.

For the last two years, data science has tremendously advanced these are proved with the

numerous data science packages that have been developed as well as many libraries created by

the data science community all over the world to speed up the process of data analysis. In the

year 2018, a new version of PyTorch was released, which is a machine learning framework for

the deep neural networks and is essential for modeling and deployment as well as speeding up

the building of sophisticated data science pipelines. PyTorch itself is straightforward to learn and

provides the users with a chance of manipulating various graphical representations on the go.

Another more significant advancement in deep learning for computer vision, natural language

processing, audio signal processing, among many others (Xu 2019). With the improvement in

deep learning for vision, several challenging problems of the past, such as image classification,

object identification, and face recognition, have now tremendously improved, and therefore users

can achieve quality results. Audio signal processing, on the other hand, has made it to analyze

audio data and understand audio signals. Automated machine learning is another significant

advancement in the field of data science; this reduces the time that the data scientists spend in

solving critical application challenges. It allows the professional to use the various machine

learning models easily, thus providing data scientists with ample time to focus on other complex

problems. Automated machine learning also saves the cost of production in an organization
DATASCIENCE 4

while delivering the same results since it provides a predefined structure that applies the relevant

algorithms required, and it needs to be followed to reduce the quality time in providing accurate

results. Artificial Intelligence, which is a subfield of data science, has also tremendously grown

over the last two years; businesses allover has adopted artificial intelligence in their various

business systems, with many others strategizing on how they will adopt AI. There has also been

increased growth in the use of AI-driven applications that enhance the performance of

organizations and businesses. The integration of AI in data science has led to the development of

AI-enabled tools that have enabled companies to perform statistical analysis of their data

efficiently, and the filed continues to grow.

Data Science is ever-evolving, as illustrated above in the various trends the field continues to

grow and is more active than ever. As more innovation is realized year in year out, various

analytic tools are created to improve operations within organizations. However, as the field

continues to advance, various data privacy regulations have to be adhered to by the innovation

tools that are developed to ensure that personal privacy is not invaded and to provide that

companies won't mishandle personal data. Automated systems have to be explainable to ensure

that the algorithms used in the field of machine learning should be easy to interpret and allow for

standardization to give accurate results.

2.0 Spark in Data Science


Spark is a general-purpose data processing engine that can perform a wide range of functions

such as machine learning, graphical manipulation, data processing, among many others (Drabas

2017). Spark is used by data scientists to perform ETL processes and SQL jobs across massive

datasets. Spark can handle huge datasets at a time that can be distributed over thousands of

virtual servers. Spark has various libraries as well as APIs that support multiple languages such

as java, python, among many others. Spark fits in data science in that its capable of handling
DATASCIENCE 5

petabytes of data at a time and also has various sets of libraries and API that support multiple

languages that are used in data analysis and statistical computation. Compared to Hadoop, Spark

is faster and takes less memory as well as less time to execute, and therefore it's much preferred

than Hadoop. Spark also provides an interactive platform and low latency computing framework,

whereas, for Hadoop, it has a high latency computing framework and does not have an

interactive mode. Spark also uses memory and can also utilize disk for processing, whereas the

MapReduce in Hadoop uses the only drive to perform various operations.

3.0 Machine Learning Implementation


3.1 Dataset
In the implementation of the recommendation engine using the ALS, a small movie Lens dataset

obtained from Kaggle is used to train and validate the alternative least squares model used in the

recommendation of the movies to users based on their various ratings on the previous movies

that they have watched (Singh 2019).

3.2 Collaborative filtering


Collaborative filtering is used to recommend a movie to a user or a specific product based on

their interest's users and the preference information of the user. The recommendation of movies

to users is provided after developing a model of ratings of previously watched videos by the

users and recommend based on their preferences and tastes.

3.3 Logistic Regression


The dataset used to perform logistic regression classification model is the Pima Indians diabetes

database downloaded from Kaggle the worlds largest data science community. The data is loaded

into the data frame and 9 columns with an outcome variable which is binary that is 1 or 0. The

data is divided into a training set and testing set, and the training set taking 80% of the data,

whereas the testing data were taking 20% of the data and seeding of 12345.
DATASCIENCE 6

In PySpark ml. Logistic regression is used to predict the binary outcomes of an experiment using

the binomial regression or the multinomial outcome using the multinomial regression. The data

is loaded into the spark data frame, the feature and the label column are added, and then a

logistic regression model is formed (Singh,2019). The model is evaluated to ensure that it gives

out correct predictions at a good percentage that is reliable.


DATASCIENCE 7

Reference List

Donoho, D., 2017. 50 years of data science. Journal of Computational and Graphical

Statistics, 26(4), pp.745-766.

Drabas, T. and Lee, D., 2017. Learning PySpark. Packt Publishing Ltd.

Singh, P., 2019. Logistic Regression. In Machine Learning with PySpark (pp. 65-98). Apress,

Berkeley, CA.

Singh, P., 2019. Recommender Systems. In Machine Learning with PySpark (pp. 123-157).

Apress, Berkeley, CA.

Xu, J., 2019, August. Advancement of Data Analysis and Mining, Decision Support System, and

Computing Science Based on the Thirteenth ICMSEM Proceedings. In International

Conference on Management Science and Engineering Management (pp. 1-10). Springer,

Cham.

You might also like