Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Predicting App Prices using Machine Learning on

Google Play Store Data


Eddison Bhatti
January 31, 2024

Abstract
This research aims to predict app prices on the Google Play Store
based on various features, utilizing machine learning techniques. The
study utilizes PySpark, a powerful tool for big data processing, and em-
ploys regression models such as Random Forest, Linear Regression, and
Gradient-Boosted Trees for prediction. The research also explores data
preprocessing, model training, and evaluation techniques.

1 Introduction
The increasing number of mobile applications on the Google Play Store presents
an opportunity to understand the factors influencing app prices [1]. In this
study, we leverage machine learning models to predict app prices based on key
features, including rating, reviews, and installs.

2 Data Acquisition and Preprocessing


We start by loading the dataset using PySpark and perform initial exploratory
data analysis [2]. The dataset is preprocessed by handling missing values, drop-
ping irrelevant columns, and converting data types. The features are further
cleaned and prepared for training the machine learning models.

3 Feature Engineering
We identify key features, including ’Rating,’ ’Reviews,’ and ’Installs,’ as predic-
tors for app prices [3]. These features are assembled into a vector to be used as
input for machine learning models. The dataset is split into training and testing
sets for model evaluation.

1
4 Model Training and Evaluation
Three regression models, namely Random Forest, Linear Regression, and Gradient-
Boosted Trees, are trained on the dataset [4]. The models are evaluated using
the Root Mean Squared Error (RMSE) metric, providing insights into their
predictive performance.

5 Predictive Analysis
We demonstrate the application of the trained model by predicting the price of
a new app with specified features. The research explores how the models can
be used for real-world predictions and decision-making.

6 Comparative Analysis of Models


The study compares the performance of different regression models, shedding
light on their strengths and weaknesses [5]. Each model’s RMSE is reported,
providing a quantitative measure of predictive accuracy.

7 Visualization of Predictions
To enhance understanding, the predictions of the models are visualized using a
bar chart [6]. This visualization contrasts actual prices with predicted prices,
offering an intuitive view of model performance.

8 Conclusion
In conclusion, this research demonstrates the application of machine learning
techniques to predict app prices on the Google Play Store. The study emphasizes
the importance of data preprocessing, feature engineering, and model evaluation
in achieving accurate predictions. The comparative analysis of different regres-
sion models provides insights for future research and practical applications.

9 Future Work
Future research could explore additional features, hyperparameter tuning, and
more sophisticated modeling techniques to further enhance predictive accuracy.
Additionally, the study could be extended to analyze the impact of other factors
on app prices in a dynamic and evolving mobile app market.

2
References
[1] Roma, Paolo, and Daniele Ragaglia. (2016). Revenue models, in-app pur-
chase, and the app performance: Evidence from Apple’s App Store and
Google Play. Electronic commerce research and applications, 17(2016), 173-
190. https://doi.org/10.1016/j.elerap.2016.04.007

[2] Zelaya, Carlos Vladimiro González. (2019). Towards explaining the


effect2019 IEEE 35th international conference on data engineerings
of data preprocessing on machine learning. 2019 IEEE 35th in-
ternational conference on data engineering, ICDE(2019), 2086-2090.
https://doi.org/10.1109/ICDE.2019.00245

[3] Li, Zheng, Xianfeng Ma, and Hongliang Xin. (2017). Feature engineering of
machine-learning chemisorption models for catalyst design. Catalysis today,
280(2017), 232-238. https://doi.org/10.1016/j.cattod.2016.04.013
[4] Raschka, Sebastian (2018). Model evaluation, model selection, and algo-
rithm selection in machine learning. arXiv preprint arXiv, 1811.12808
(2018), https://doi.org/10.48550/arXiv.1811.12808
[5] Pollock, Michael L., Carl Foster, Donald Schmidt, Charles Hellman, A.
C. Linnerud, and Ann Ward. (1982). Comparative analysis of physio-
logic responses to three different maximal graded exercise test proto-
cols in healthy women. American heart journal, 103(1982), 363-373.
https://doi.org/10.1016/0002-8703(82)90275-7
[6] Hong, Jiayi, Ross Maciejewski, Alain Trubuil, and Tobias Isenberg.
(2023). Visualizing and comparing machine learning predictions to
improve Human-AI teaming on the example of cell lineage. IEEE
Transactions on Visualization and Computer Graphics, (2023), .
https://doi.org/10.1109/TVCG.2023.3302308

You might also like