Screening Product Ideas Through User-Generated Content in Social Media To Assist Small and Medium Enterprises in New Product Development

SCREENING PRODUCT IDEAS THROUGH USER-GENERATED
CONTENT IN SOCIAL MEDIA TO ASSIST SMALL AND MEDIUM
ENTERPRISES IN NEW PRODUCT DEVELOPMENT
by
Landley M. Bernardo
A Capstone Project
Submitted to the Faculty of the School of Advanced Studies
In Partial Fulfillment of the Requirements for the degree of
Master in Information Technology
Computing and Information Technology Programs
School of Advanced Studies
Saint Louis University
November 2021
ENDORSEMENT
The capstone project entitled SCREENING PRODUCT IDEAS THROUGH USER-

GENERATED CONTENT IN SOCIAL MEDIA TO ASSIST SMALL AND MEDIUM
ENTERPRISES IN NEW PRODUCT DEVELOPMENT prepared and submitted by
LANDLEY M. BERNARDO for the degree MASTER IN INFORMATION
TECHNOLOGY has been examined and is recommended for acceptance and approval for oral
examination.
This is to certify further that LANDLEY M. BERNARDO is ready for oral examination.
MARIA CONCEPCION CLEMENTE

Adviser
This is to certify that the capstone project entitled SCREENING PRODUCT IDEAS
THROUGH USER-GENERATED CONTENT IN SOCIAL MEDIA TO ASSIST SMALL
AND MEDIUM ENTERPRISES IN NEW PRODUCT DEVELOPMENT prepared and
submitted by LANDLEY M. BERNARDO for the degree MASTER IN INFORMATION
TECHNOLOGY is recommended for oral examination.
Beverly Estephany P. Ferrer, DIT Cecilia A. Mercado, PhD Ed.

Member Member
Elisabeth D. Calub, MSIT Randy B. Domantay, DIT

Member Member
Beverly Estephany P. Ferrer, DIT Faridah Kristi Wetherick,

PhD
GPC for Computing and Information Technology Programs Dean
School of Advanced Studies School of Advanced Studies
Saint Louis University Saint Louis University
APPROVAL SHEET
Approved by the Committee on Oral Examination as ____PASSED____ on __June 25, 2021_.
Beverly Estephany P. Ferrer, DIT Cecilia A. Mercado, PhD Ed.

Member Member
Elisabeth D. Calub, MSIT Randy B. Domantay, DIT

Member Member
Accepted and approved in partial fulfillment of the requirements for the degree Master
in Information Technology.
Beverly Estephany P. Ferrer, DIT

Computing and Information Technology Programs
School of Advanced Studies
Saint Louis University
This is to certify that LANDLEY M. BERNARDO has completed all academic

requirements and PASSED the Public Lecture on FEBRUARY 18, 2021 for the degree of
Master in Information Technology.
Beverly Estephany P. Ferrer, DIT Faridah Kristi Wetherick, PhD

GPC for Computing and Information Technology Programs Dean
School of Advanced Studies School of Advanced Studies
Saint Louis University Saint Louis University
DEDICATION
I dedicate this capstone project to all the Philippines’ SMEs’ owners and their employees
who have been working so hard to continue their business operations during these trying times.
ACKNOWLEDGEMENTS
First and foremost, I would like to praise and thank God, the Almighty, who has granted
countless blessings, knowledge, and opportunity to me so that I have been finally able to
accomplish this capstone project.
Thank you to Dr. Cecilia Mercado for endorsing me to the Department of Science and
Technology (DOST) PROJECT STRAND scholarship. I cannot thank you enough for your
personal recommendation. Regardless of whether or not I am accepted, the fact that you
recommended me means so much. I would also like to thank Mr. Dalos Miguel and Ms. Maria
Concepcion Clemente for taking time out of your busy day to put in a good word for me. Your
endorsement letters mean so much to me.
Thank you to the DOST STRAND team for this life-changing opportunity you have given to
me. I was both thrilled and honored to hear that I had been named as a recipient of the DOST
PROJECT STRAND scholarship. By awarding me the said scholarship, you have given me
another chance to go back to school and create something awesome. I hope one day I will be able
to help students achieve their goals just as you have helped me.
Special thank you to my adviser, Maria Concepcion Clemente, for your patience, guidance,
and support. I have benefited greatly from your wealth of knowledge and meticulous advice. I
am extremely grateful that you took me on as an advisee and continued to have faith in me over
the course of a year.
Thank you to my panel members, Dr. Beverly Ferrer, Dr. Randy Domantay, Dr. Cecilia
Mercado, and Ms. Elisabeth Calub. Your encouraging words and thoughtful, detailed feedback
have been very important to me. Thank you for the time you have allotted to accommodate my
oral presentation despite your hectic schedule.

Thank you to Mark Enriquez and James Botigan III for helping me with the data annotation.
I sincerely appreciate the time you spent assisting me with this project. I am well aware that you
have put in a lot of hard work on the task assigned to both of you.
Thank you to Dr. Randy Domantay for suggestions and comments regarding the format and
content of my paper. Those comments are all valuable and very helpful for revising and
improving my writing.
Thank you to Dr. Gerry Paul Genove for teaching me a handful of techniques on smartly
conducting research and making one during a few of my classes with you. Thank you to my
classmates in MIT class, Richard, Nikki, Mehdi, Eddie, Chesca, Jerome, and JL to whom I had
the chance to get to know and learn new perspectives about our shared interest in Information
Technology and life.
Thank you to Georgina and Anna Marie for helping me with my oral presentation. Your
feedback when I had my mock presentation prepared me well during my actual presentation.
Thank you to my friends and other people (too many mention) who became part of my MIT
journey.
Finally, my deep and sincere gratitude to my family and relatives for their continuous and
unparalleled love, help, and support. I am grateful to my brother and cousin for always being
there for me as a friend and entertainer when I was down. I am forever indebted to my parents
for giving me the opportunities and experience that have made me who I am.
Abstract
This research aims to design and develop a tool to assist the Philippines-based Small and
Medium Enterprises (SMEs) in New Product Development (NPD), more specifically, in
screening new product ideas. The application utilized a Machine Learning (ML) model that is
based on User-Generated Content (UGC) on Twitter. Due to the Coronavirus disease 2019
(COVID-19) global pandemic, most businesses in the Philippines have struggled, and some have
gone out of business, especially SMEs. Previous studies suggest that conducting NPD could
mitigate the impact of the COVID-19 on businesses. However, the process of NPD could be
costly and tedious for SMEs, considering the limited resources they possess. Since the pandemic
broke out, the volume of UGC produced in social media has increased dramatically.
Nevertheless, studies focusing on exploiting this massive volume of UGC in NPD are quite
limited. This study developed an application that utilizes UGC on Twitter to assist SMEs in
performing new product idea screening. The application is powered by a supervised Machine
Learning (ML) algorithm, Support Vector Classifier (SVC) text classification model. Over 5
million tweets were collected and preprocessed using a variety of libraries in Python to train the
model. The criteria for screening include the potential market, the trend of demand, stability of
demand, and market acceptance. At least 2,926 rows with tweets that express potential produce
ideas based on common Philippines SMEs were extracted and vectorized using the Word2Vec
word embedding scheme. Consequently, the model achieved an accuracy rate of 84%. The
trained model was used to develop the proposed screening application. The application was
tested on different inputs, screens, and browsers to assess its quality. The output of this study
would help contribute to the limited literature that exploits social media data in developing
product ideas.
Keywords
SME, NPD, New Product Idea Screening, UGC, Machine Learning, NLP, Support Vector
Classifier, Word2Vec
Chapter 1: Introduction
1.1. Background of the study
A recent assessment shows that the Philippines' Luzon-wide lockdown that aims to contain
the COVID-19 has accumulated an output loss of 1.1 trillion pesos (NEDA, 2020). Furthermore,
the nation's highest unemployment rate of 17.7% has been recorded (PSA, 2020). The Inter-
Agency Task Force on Emerging Infectious Diseases (IATF-EID) recommends General
Community Quarantine (GCQ) to prevent further loss and stabilize the economy, which resulted
in the resumption of most business operations and other economic activities (Exec. Order 112, s.
2020). However, it had little effect on improving the economy because the pandemic has already
influenced consumers' confidence in the market (Vancic & Pärson, 2020). In a study conducted
by MSC (2020), 23% of the Small and Medium Enterprises (SMEs) temporarily closed their
operations, while 28% reduced business operations, affecting thousands of Filipino workers
nationwide (See Figure 1.1). If the impact of the pandemic on the SMEs continues, it could cause
the economy to collapse, driving more Filipinos to the edge of poverty (Bouey, 2020).
Figure 1.1 Status of the Philippines' SMEs during GCQ

SMEs are non-subsidiary, independent firms that employ 10-199 people and have total assets
not exceeding 3,00,000 pesos (MSMEs, 2008). These include restaurants, parlors, and small-time
renting houses. SMEs play a significant role in providing goods and services to the masses,
creating jobs, and promoting innovation through competition (Leano, 2006). 82% of the
businesses in the Philippines fall under the category of SME (International Trade Centre, 2020).
The ability to take up a loan with a lower interest rate in a more extended period to pay is just
one of the programs that the government has rolled out to help small businesses get the financial
support they need to continue operations (Bayanihan to Heal As One Act, 2020; SBSW, 2020).
Despite these efforts, a survey says that 62% of the SMEs reported not receiving any financial
support from the Government, not even from non-government sources, such as families and
other investors (MSC, 2020) (See Figure 1.2).
Figure 1.2 Level of financial support for Philippines’ SMEs during the implementation of
ECQ in Metro Manila
Studies suggest that New Product Development (NPD) could help businesses stay
competitive and achieve prosperity in a rapidly changing market (Booz, Allen & Hamilton,
1982; Hughes and Chaffin, 1996; Ford & Terris, 2017). NPD is a process that transforms market
opportunities into a product available in the market (Takeuchi & Nonaka, 1986). It is a seven-
stage process consisting of new product strategy development, idea generation, screening and
evaluation, business analysis, development, testing, and commercialization (Booz, & Allen &
Hamilton, 1982). The stages of NPD are illustrated in Figure 1.3. The screening is considered the
most critical stage because it selects and evaluates new product ideas from a pool of ideas
generated during new product ideation (Rochford, 1991). Its output is the deciding factor that
determines if an idea is fit for the next phase. Jespersen (2007) argues that screening is a
complex decision process highly influenced by market changes. Agrawal & Bhuiyan (2014)
created a critical success factors (CSF) framework that lists the metrics, CSF, and tools used in
each stage of NPD. In particular, the CSF for the screening phase is called Up-front homework.
It consists of numerous activities that aim to understand and analyze the current and future
market potential. Similarly, Baker & Albaum (1986) created a list of criteria for screening a new
product idea, including societal factors, business risk factors, demand analysis, market
acceptance, and competitive factors.
Figure 1.3 Stages of NPD

Over the years, various techniques were developed to perform screening of new product
ideas (Cooper, 1979; Baker & Albaum, 1986; Debrentani, 1988; Verworn, Herstatt, & Nagahira,
2006; Mu, Peng, & Tan, 2007; Soukhoroukova, Spann, & Skiera, 2001; Onarheim &
Christensen, 2012; Albar, 2013). However, these studies used a traditional approach in obtaining
the datasets (e.g., giving out questionnaires, conducting interviews) to screen new product ideas.
Although conventional data collection has proven effective, a study emphasizes that it could
bring validity issues, as questions are prone to misinterpretation (Pribyl, 1994). Furthermore,
traditional product screening is a tedious, expensive, and lengthy process, yet only 20% of new
product ideas reached commercialization (Ford & Terris, 2017; Rodríguez-Ferradas & Alfaro-
Tanco, 2016; Akram, 2017). Given the limited resources of SMEs, screening new product ideas
remains a major challenge (Owens, 2007).
Since the pandemic broke out, the number of people using social media (e.g., Facebook,
Twitter, and Instagram) has increased dramatically (Statista, 2020). In particular, businesses have
widely utilized these platforms to improve their brands and reach out to their target customers
(Effendi, Sugandini, & Istanto, 2020). Based on a survey conducted by MarketingCharts.com,
84% of firms use social media to build their brands and increase awareness (See Figure 1.4).
Another advantage that social media brings to a business that is often underestimated is the
availability of UGC (Chu & Kim, 2011; Prantl & Mičík, 2019). UGC has been defined as
publicly available content, such as text, image, video, and even audio, created by users, rather
than brands, to express one's opinion (Krumm & Davies, 2008). It is referred to by other
literature as Electronic Word-of-Mouth or eWOM (Park & Lee, 2009; Lee & Youn, 2009; Jeong
& Jang, 2011). Examples of UGC are posts (e.g., Facebook), comments (e.g., Instagram), tweets
(e.g., Twitter), and ratings and reviews (e.g., Amazon). It is the main contributor to the enormous
digital information produced on the Web, often referred to as Big Data (Tufekci, 2014). Social
media users are expected to reach 4.41 billion in 2025 (Statista, 2020) (See Figure 1.5).
Therefore, the amount of UGC generated in social media will also see a tremendous increase. On
a recent projection made by MarketingResearch.com (2020), the global UGC software market
will reach 447 billion dollars in 2026, and blogging platforms such as Twitter will be at the top
of the competition.
Figure 1.4 Applications of social media to SMEs

The emergence of UGC opens doors for new possibilities that never seemed to be possible a
few decades ago (Pantano, Giglio, & Dennis, 2019). Nascimento & Da Silveira (2017) argue that
UGC could be an alternative source of information for screening new product ideas. Rathore &
Ilavarasan (2020) explain that it is inexpensive and can provide real-time consumer behaviors.
Specifically, Twitter has been the leading source of UGC to measure customers' satisfaction
towards a particular product or brand (Bhimani, Mention, & Barlatier, 2018; Kumar, Koolwal, &
Mohbey, 2019). The application allows its users to post a concise tweet of 280 characters with
specific hashtags (e.g., #iphone12, #aegyocake, #talabykyla), making it searchable for
researchers (Kumar, Koolwal, & Mohbey, 2019). Newswire (2020) reports indicate that its Daily
Active Users improved by 34% in 2020, with 500 million tweets being sent every day.
Figure 1.5 Projection of social media users from 2020-2025.

Several studies have utilized tweets to measure products, services, or brand performances
(Fischer & Reuber, 2011; Sultana, Paul, and Gavrilova, 2016; Anto et al., 2016); Ray &
Chakrabarti, 2017; Geetha et al., 2018; Costa et al., 2013). In particular, Fischer & Reuber, 2011)
used Twitter engagements to increase reaching more target customers. Accordingly, it would be
an avenue for a business and its customers to develop a rapport and build trust. Interestingly,
Sultana, Paul, and Gavrilova (2016) use engagements to identify the behaviors of selected users.
Sentiments are the opinion of people towards a product. Sentiment Analysis is the process of
getting the polarity scores from a piece of text. It is computed in three levels: document,
sentence, and aspect levels. Studies made by Anto et al. (2016), Ray & Chakrabarti (2017), &
Geetha et al., (2018) categorized the ratings of a known mobile phone brand using tweets. In
particular, Ray & Chakrabarti (2017) made an assessment of people’s opinions on iPhone 6 in
the document and aspect levels sentiment analysis. Finally, hashtags promote company visibility
and spread awareness of their mission and vision (Costa et al., 2013).
In the past, sentiment analysis has been the most used methodology in measuring product
performance (Pang, Lee, & Vaithyanathan, 2002). It is a type of text classification technique that
assigns a label to particular text (Kowsari et al., 2019). It computes the polarity scores from a
piece of text to determine its sentiment. Its output could be categorized into positive, negative, or
neutral. There are two general approaches in text classification: lexicon-based and machine
learning-based (Dhaoui, Webster, & Tan 2017). The former relies on a dictionary or sentiment
lexicon to determine the emotions of a text (Pennebaker et al., 2015). The latter uses Machine
Learning (ML) algorithms to identify patterns on a given dataset and use the acquired knowledge
to perform the classification (Feldman, 2013). The ML approach requires a massive amount of
data for training and testing to achieve an accurate result. However, implementing sentiment
analysis using the ML approach is more accurate than the Lexicon-based approach (Dhaoui,
Webster, & Tan 2017).
ML algorithms are generally divided into three types: supervised, unsupervised, and
reinforcement learning (Feldman, 2013). A supervised ML algorithm uses a labeled dataset for
training and testing. It is further subcategorized into regression and classification. The regression
deals with numbers such as mean/average prediction, while the classification tries to label a
particular input into a finite number of classes. Text classification is a domain in ML that assigns
a label into texts called text classification/categorization. A typical classification problem is
categorizing news into sports, business, politics, or entertainment (Dadgar et al., 2016). One of
the most recent additions to supervised ML algorithms is Support Vector Machine (SVM)
(Vapnik, 1995). It is also considered as one of the most popular, along with linear regression, k-
nearest neighbors (kNN), and Naïve-Bayes (Lee & Shin, 2020). SVM uses hyperplanes to
separate the observations into different classes (Wang et al., 2006). A good hyperplane is
achieved when it has the largest distance to the nearest data point. The structure of the SVM is
shown in Figure 1.6. Osisanwo et al. (2017) compared the performance of the different ML
algorithms, including decision trees, neural networks, Naïve-Bayes, kNN, SVM, and rule-
learners. The result shows that SVM supersedes all other algorithms in terms of accuracy and
tolerance to irrelevant attributes. Furthermore, Gharib, Habib, & Fayed (2009) confirmed the
effectiveness of SVM when dealing with a large text dataset. Similarly, SVM consistently
outperforms other alternative models (Joachims, 1998). Accordingly, the algorithm has the
potential to be used for more complex text classification problems.
Figure 1.6 SVM algorithm

One of the critical tasks in building an ML model for text classification is word embedding
(Church, 2017). It is a process of associating numerical or vector representations to a text for
mathematical computations (Rong, 2014). Several word embedding approaches are available in
performing this task, including Word2Vec. Word2Vec encodes texts by considering their
semantic values. For instance, the numerical equivalents of men and women are more similar
than of men and horses. Recent studies have shown that incorporating Word2Vec in SVM when
building a text classifier yields a better performance (Zhang et al., 2015; Şahİn, 2017; Kurnia et
al., 2020). Specifically, Zhang et al. (2015) and Kurnia et al. (2020) utilized UGC via users’
comments regarding mobile applications and clothing products, respectively, to train a text
classification model.
Previous studies have shown that UGC produced in social media could be an alternative
dataset for screening a particular product. However, products that have not been created (new
product ideas) are yet to be explored (Nascimento & Da Silveira, 2017). According to
Soukhoroukova, Spann & Skiera (2012), new product ideas that are not immediately captured
may fade away in an organization in no time. Similarly, Albar (2013) concluded that 90% of new
product ideas are rejected before they even reach formal evaluation. Therefore, it is high time to
utilize the abundance of available UGC for screening product ideas. A tool is developed to
perform screening of product ideas using tweets. The application would assist the Philippines’
SMEs in selecting which product ideas to consider and invest in.
1.2. Statement of objectives
This study explored UGC produced in social media in developing new product ideas for
Philippine SMEs. Specifically, the objectives are the following:
1. To build a supervised text classification model using SVM.
2. To develop a web-based screening application using the trained SVM text classification
model.
1.3. Scope of the study
This study focuses on the screening phase of the NPD. The dataset used in the study was
limited to text-based UGC produced on Twitter; in particular, the dataset consists of tweets
posted within the Philippines from August to December 2020. The collected tweets went through
a series of data preprocessing using available Natural Language Processing (NLP) libraries in
Python. In addition, manual data annotation was conducted to verify the product ideas extracted
from the tweets. A list of common product ideas was used for considering if the extracted
product ideas were valid or not. The screening considered four criteria: market potential, stability
of demand, the trend of demand, and market acceptance (Baker & Albaum, 1986). A static text
classification model is built using a supervised ML algorithm called SVC. The input variable,
which consists of the extracted product ideas, were transformed into their vector representations
using Word2Vec encoding scheme. A parameter tuning was performed to get the ideal values of
four parameters of the SVC model, namely kernel, degree, gamma, and C. The model's
performance was assessed in four metrics: accuracy, precision, recall, and f1-score. The trained
model was exported and used in the development of the screening application. The proposed
application was implemented using the Python-based web microframework called Flask. A
proof-of-concept on how UGC could be utilized to assist SMEs in conducting NPD is developed.
1.4. Significance of the study
The output of this capstone project would help bridge the gap identified by Nascimento & Da
Silveira (2017) regarding the limited literature that utilizes data produced in social media to
develop new product ideas. The research also would prove how UGC could be an alternative
source of information when screening new product ideas (Owens, 2007). Moreover, the findings
would show a practical implication for SMEs by helping them to understand the importance of
consumer engagement in social media. Furthermore, the developed application would give the
rising numbers of SMEs putting up a business in the online platform economy nationwide
product ideas to sell. One report shows that 77 percent of Filipinos consider the online presence
of SMEs a must (Villanueva, 2020). Most importantly, it would encourage the Philippines’
SMEs to consider developing new product ideas. Studies made by Booz, Allen & Hamilton
(1982), Hughes and Chaffin (1996), and Ford & Terris (2017) stated that NPD could help
businesses stay competitive and achieve prosperity in a rapidly changing market.

Chapter 2: Methodology
This section describes the steps taken to accomplish the objectives of this study. The
methodology is divided into two main sections: section 2.1) building the model and section 2.2)
implementation. Section 2.1 discusses the tasks involved in creating the ML model for
classifying product ideas into good or not good. This section is further divided into six
subsections, including data collection, data preparation, and performance evaluation. Section 2.2,
shows a detailed discussion on how the proposed screening application is implemented using the
model built in section 2.1. Moreover, the technology stacks used in the implementation are
presented. The development is divided into two parts: front-end and back-end development. Each
section of the methodology is further explained in the succeeding paragraphs and is summarized
in Figure 2.1.
Figure 2.1 Overview of the methodology
The primary tool used to build the model (in section 2.1), is Anaconda Navigator. It is a
GUI-based application containing tools and pre-installed packages in Python for ML, also known
as conda packages. In particular, Jupyter Lab is one of the ML tools available by default in
Anaconda Navigator. It is an interactive open-source web application used for data exploration,
visualization, and analysis. Jupyter Lab was used to create the notebooks that contain the
different utility functions to build the model. Anaconda Navigator’s environments created a new
virtual environment (venv) to store the needed Python packages for building the model. A
package is a collection of modules. It is a file consisting of Python code that defines classes,
functions, and variables. Packages are used to build a powerful Python application. Creating a
venv would allow packages to be reinstalled easily when an unknown bug occurs in one of the
packages installed. Anaconda Navigator uses the channel named default as its default channel for
installing and adding packages to a venv. Additional channels were added to get more packages,
such as the conda-forge and pytorch channels. The packages stored in the venv and their roles
are listed in Appendix A.
Section 2.2 discusses the implementation of the proposed screening application. Python is the
language of choice when performing analysis for big data (Oliphant, 2007). Specifically, the
application was developed using Flask. It is an open-source web Python microframework for
building data-driven and dynamic web applications quickly. It follows the Model-View-
Controller (MVC) architecture pattern that separates the application into three main components:
the model, view, and controller; these components represent the data or the database, the display,
and the application's logic, respectively. Flask is the most popular web framework for Python,
along with Django (Aslam, Mohammed, & Lokhande, 2015; Mufid et al., 2019). It has 52,400
stars and 13,800 forks on its Github page at the time of this writing. Unlike Django, which is a
full-stack and comes with pre-built dependencies, libraries, and layouts, Flask is lightweight. It
only offers suggestions for possible tools for developing the application, giving developers the
flexibility and freedom to select other technologies for implementation. The Flask framework is
relatively new. Its first stable version (1.1.2) was recently released on April 3, 2020.
Nevertheless, its community has been growing. There are currently 621 contributors and trusted
by more than 5,000 projects, including well-known brands Netflix, Reddit, and Lyft.
In addition, the Integrated Development Environment (IDE) of choice was PyCharm
Professional. Although the professional version comes with a price, Jetbrains, the creator of
PyCharm, offers a free 1-year subscription for students. IDE is a software application that
provides comprehensive facilities to the developer for development. PyCharm is the
recommended code editor for building Python-based projects. It is equipped with all the
necessary tools for modern development, including a command-line interface (CLI), features to
connect to the database, create a venv, and integrate with Github. Also, it offers features for
handling big data and developing data-driven applications like conda integration to manage
packages for Python, scientific libraries and plots for performing data analytics and visualization,
and coding assistance for Python frameworks like Flask. The same venv used in section 2.1 was
utilized in the development. Once the venv was created, and packages were installed, a new
Flask project was initiated in PyCharm. PyCharm used the created venv as its interpreter to
import all the packages needed for the development (see Appendix A).
2.1. Building the model
A model is an ML algorithm that has been trained to identify patterns on a given dataset. It
uses its acquired knowledge to perform a prediction or classification (Feldman, 2013). This study
proposed a supervised text classification model that classifies a particular product idea as good or
not good. The SVM ML algorithm was considered in building the said model. According to
Joachims (1998), SVM consistently outperforms other alternative models in text classification.
This section is further broken down into six subsections: section 2.1.1) data collection, section
2.1.2) data preprocessing, section 2.1.3) constructing the criteria for screening, section 2.1.4)
data preparation, and section 2.1.5) performance evaluation. First, data collection explained how
the data needed for the study was gathered using an advanced scraping tool. In section 2.1, data
preprocessing provides a detailed look into how the collected data were cleaned and turned into
valuable data for modeling. It also shows the process of extracting product ideas from the actual
tweets. In the third step, the criteria for screening product ideas were constructed. Furthermore,
the results of the screening process of the extracted product ideas were labeled as good or not
good. Fourth, the dataset for modeling was prepared and divided into two subsets to train and test
the model. Finally, the model was trained, and its performance was assessed in different
performance metrics. The trained model was saved and used for the implementation of the
screening application.
2.1.1. Data collection
A variety of scraping tools are available for mining tweets on Twitter. The microblogging
platform [Twitter] allows third-party applications to connect and collect tweets using an
approved and authenticated developer account through a secured channel called Twitter
Application Programming Interface (API) (Kumar, Koolwal, & Mohbey, 2019). However, the
number of tweets allowed to be collected is limited. Specifically, the Twitter API v2 has a
threshold of 500,000 tweets per month.
Alternatively, Twitter Intelligent Tool (Twint) allows retrieval of tweets with no limits and
APIs required (Dutch Osint Guy, 2018). It is an open-sourced advanced tweet scraping tool often
used for Open-Source Intelligence (OSINT) research. OSINT tools focus on collecting,
analyzing, and using publicly posted information (e.g., reviews, tweets, and Facebook feeds) for
research purposes. Hence, Twint was utilized to collect the data needed for the study. In
particular, these data were the tweets that were posted within the Philippines from August to
December 2020.
To start off, a Linux-based Operating System (OS), Ubuntu, was loaded and configured in
Oracle Virtual Machine (VM) Virtual Box. The VM enables a machine to run more than one OS
at a time within the base OS. Since Twint is compatible with the Linux environment, the VM
was required to run Twint in Windows, the researcher’s default OS. The specifications set for the
virtual OS are shown in Table 2.1.
Table 2.1 Ubuntu’s configurations
Hardware Specification
Operating System Linux
Distro Ubuntu 18.04.01
Processor 2 CPUs
Based memory 4608MB
System type 64-bit Operating System, x64-based processor
Storage 40GB
Furthermore, Twint's dependencies, including the Python language, Pip Installs Packages
(pip), and Git, were installed into the virtual OS. Notably, a new venv was created to store the
other dependencies using pip. A venv allows the developer to build multiple Python-based
applications in a single machine. Pip is a package manager for Python. To ensure that only
tweets posted within the Philippines are included, the coordinates of Marinduque
(12.072862,122.664139) within the radius of 768.54KM were specified in the Twint's geo
parameter. The code used in data collection is shown in Appendix B. Marinduque was selected
as the coordinates because it is considered to be the geographical center of the Philippines. In

addition, the specified radius is an ideal estimation value that covers the entire Philippines,
ensuring that only the needed data are collected. The places in the Philippines where tweets were
collected are shown in Appendix C.
2.1.2. Data preprocessing
In predictive modeling (e.g., text classification), raw data cannot typically be used as it is. It
requires preprocessing to ensure that the dataset is fit to the model. Data preprocessing is a
process that aims to get rid of the noise in a given dataset. Some examples of noise include
special characters, unnecessary duplicate letters, and stopwords, such as “the,” “a,” and “can.”
The presence of these noises could compromise the performance of an ML model when
performing text classification (Yang, 2018). The process of data preprocessing varies
accordingly. It is highly dependent on the defined problem. This study utilized the programming
language Python to preprocess the collected data. Python is a multi-purpose language that serves
different needs, from basic web applications to training models for ML and Artificial Intelligence
(AI). It also offers a variety of packages from various sources to turn raw texts into useful
information (Karczewski, 2021).
Data preprocessing comprises of 5 subprocesses: (1) data wrangling, (2) data reduction, (3)
extracting the product ideas, (4) data annotation, and (5) sentiment analysis. These steps are
summarized in Figure 2.2. First, the collected data were transformed from a Tab Separated
Values (TSV) format into a dataframe which enables useful functions to aid with the data
preprocessing. Next, the rows and columns of the created dataframe were reduced to select only
the relevant data for the study. Third, numerous data cleansing techniques were carried out using
available NLP packages in Python to extract possible product ideas from the tweets. Lastly, the
polarity scores of the extracted product ideas were determined through sentiments analysis.
Figure 2.2 Steps of the data preprocessing
In the initial stages of data preprocessing, the collected UGC were moved from the
configured virtual OS to the Windows host machine via the shared folder feature of the Oracle
VM. Once the data were transferred, the data preprocessing began.
The first step of data preprocessing was data wrangling. Data wrangling restructures raw data
into the desired format to aid with the data preprocessing. The initial dataset was transformed
from its original file format [TSV] into a dataframe using the pandas package. A dataframe is a
2D data structure that provides streamlined forms of data representation. It is a table that consists
of columns as the labels and rows as the actual data. The pandas library was used because it
provides rich functions for data preprocessing. In addition, the encoding scheme UTF-8 was
specified in the pandas’ parameter to read and display the emojis of the tweets. The code used
in data preprocessing is shown in Appendix D, lines 4-8.
The second step of data preprocessing is data reduction. Data reduction is a technique to
reduce the number of features of a particular dataset. In this step, the columns and rows that were
irrelevant for the study were removed. The selection considers the data that is going to be helpful
when screening new product ideas. Starting off with the rows, non-English tweets were removed
from the dataframe. This would limit the dataset to English words that work well with available
NLP libraries in Python. Also, it would lead to the improvement of the quality of the dataset by
removing nuisance words that are hard to interpret. The code used to reduce the rows of the
dataframe is shown in Appendix D, lines 10-12. In the case of column selection, attributes that
could be used in measuring the performance of the extracted product ideas using available
information in the scraped UGC were considered. These include attributes with numerical values
which would be beneficial in getting the engagements and sentiments (Fischer & Reuber, 2011;
Sultana, Paul; Gavrilova, 2016; Anto et al., 2016; Ray & Chakrabarti, 2017; Geetha, Rekha, &
Rarthika, 2018; Costa et al., 2013). The code used to reduce the columns of the dataframe is
shown in Appendix D, lines 14-22.
The third step of data preprocessing was extracting the product ideas. According to
Nascimento & Da Silveira (2017), one way to find product ideas for business is in UGC
produced in social media. This information provides insights into consumers’ preferences and
with careful planning and the right tools at hand. This can be utilized by businesses for product
development (Rathore & Ilavarasan, 2020). Extracting product ideas consists of seven sub-steps,
including (1) data precleaning, (2) pulling out the nouns, (3) data cleaning, (4) removing the
stopwords, (5) lemmatization, (6) discarding the duplicate words, and (7) eliminating the short
words. These steps are explained in detail in the next paragraph.
The first step to extract the product ideas from the tweets was data precleaning. It was
intended to initially clean the tweet by discarding word patterns including mentions (@),
hashtags (#), and links (http and https). To do this task, a utility function was constructed using
the regex and numpy libraries. A new column named precleaned_tweet was created
separately from the original tweet to store the output of the applied operation. The code used to
preclean the tweets is shown in Appendix E. Rows with null values in the newly created
precleaned_tweet column were removed from the dataframe. The results of this step were used
in a later step, sentiment analysis.
The next step to extract the product ideas was to pull out the nouns from
the precleaned_tweets using a Part-of-speech (POS) tagger. POS tagging is a process that labels
each word on a sentence based on the part of speech (e.g., noun, verb, adverb, adjective,
pronoun, conjunction, and interjection) they belong to. Specifically, a noun is a part of speech
that names a person, thing, idea, action, or quality. An example of POS tagging is shown in
Figure 2.3.
Figure 2.3 An example of a tweet where the words are tagged and labeled using a POS
tagger. In this example, the yoga and mat are labeled as a noun.
This study used nouns to represent possible product ideas. Generally, product ideas are
expressed through nouns (Malmasi & Dras, 2015). By pulling out the nouns from the values of
the precleaned_tweet, it would help determine which tweets contain possible product ideas for
the Philippines’ SMEs. Rows with no extracted nouns were removed from the dataframe. Three
POS tagger libraries, namely, TextBlob, NLTK, and Spacy, were used, and their results were
manually evaluated to ensure the quality of the extracted nouns. The final result was stored in a
new column named pulled_out_noun. The code for extracting the nouns is shown in Appendix
F.
The third task to extract the product ideas was data cleaning. The results of the previous step
[pulling out the nouns] were cleansed to remove any unnecessary words and characters
mistakenly pulled out as nouns by the selected POS tagger. This step includes (1) converting
words to lower case, (2) expanding contractions, and (3) removing email addresses, html tags,
accented characters, special characters, and unnecessary repeated letters. The
preprocess_kgptalkie package was utilized to cleanse the nouns. A new column named
cleaned_noun was created to store the results of data cleaning. The code used to clean the
extracted nouns is shown in Appendix G. Rows with null values in the new column
cleaned_noun were removed from the dataframe.
The fourth task to extract the product ideas was the removal of the stopwords. The stopwords
include are both Tagalog (e.g., akin, ako, at, dapat) and English (e.g., can, might, by). The
complete list of the stopwords removed from the cleaned nouns are shown in Appendix H.
Stopwords are words that do not add much meaning to a sentence and can be safely removed
without compromising the meaning of a sentence. The stopwords were removed using the
Spacy library. By default, Spacy only contains the stopwords for the English language; the
Tagalog stopwords were added to the list. A new column named stopwords_free_noun was
added to the dataframe to store the new values of the applied operation. The code used to get rid
of the stopwords is shown in Appendix I. Rows with null values in the newly
added stopword_free_noun column were removed from the dataframe.
The fifth step to extract the product ideas was lemmatization. It is the process of converting a
word into its base form, removing endings such as -s, -ing, and -ed (e.g., from equipments to
equipment, from walked to walk). Sometimes a word could be an irregular verb, making its
conversion quite different (e.g., from mice to mouse, from dove to dive). Lemmatization was
employed to get the base form of the nouns, which would help to further simplify the values of
the extracted nouns. The make_base method from the preprocess_kgptalkie method
was used to perform the lemmatization. A new column named lemmatized_noun was
introduced to save the result of lemmatization. The code used to lemmatize the words is shown
in Appendix J, line 3.
Once the nouns were lemmatized, the next step was to discard the duplicates. The application
of lemmatization to the nouns resulted in words duplication, which needed to be handled. A
custom function was constructed to remove the duplicates. A new column
named unique_noun was added to the dataframe to keep the unique words. The code used in
removing the duplicates is shown in Appendix J, line 14. Rows with null values in the
new unique_noun column were removed from the dataframe.
Finally, short words were eliminated from the unique_noun column. In particular, these are
the words that have less than three characters. There is a high possibility that words under the
said category are merely a nuisance and do not represent a product idea and therefore discarded.
A custom one-liner function was designed to apply this step. A new column
named extracted_product_idea was introduced to store the result of the applied operation. The
code used to remove words with less than three characters is shown in Appendix J, line 17. Rows
with null values in the newly created extracted _product_idea column were removed from the
dataframe.
Moreover, rows with duplicate values in the extracted_product_idea column were
combined by aggregating their retweets_count, likes_count, and replies_count and getting the
average of their polarity_score. In addition, a new column named tweets_count was added to
the dataframe to store the number of occurrences of each extracted_product_idea. After the
values of retweets_count, likes_count, replies_count, polarity_score, and tweets_count were
aggregated, the duplicates were removed, leaving only the first occurrence of each
extracted_product_idea. The code used to implement all this step is shown in Appendix J, lines
and 24 and 26. The resulting dataframe was saved and used in section 2.1 to construct the criteria
for screening.
In the data preprocessing subsection, the fourth is data annotation. It is a process of labeling
the data available in various formats so that an ML model can quickly and clearly understand the
input patterns (Schreiner, 2006). This study conducted a manual data annotation to ensure that
only valid product ideas are included in the final dataset. Data annotation was carried out with
the aid of two BSIT undergraduate students. The result of all the steps mentioned above is a
dataset comprised of possible product ideas represented by the extracted_product_idea column
with corresponding attributes, including the number of retweets, likes, and replies. The said
dataset was divided into five subsets based on the month when the tweets were posted
(August.csv, September.csv, October.csv, November.csv, and December.csv). The first two

datasets were assigned to the first student, namely August.csv and September.csv. The
subsequent two subsets, October.csv and November.csv, were delegated to the second student.
The remaining subset (December.csv), which has the most number of rows, was given to the
researcher.
The objective of the data annotation is to mark the values of
the extracted_product_idea column as valid or invalid by labeling them as 1 or 0, respectively.
A new column named label was added to the dataframe to store the results of data annotation.
The basis for considering a product idea as valid is the presence of emotion (e.g., hate, want,
love, disgust) expressed towards the noun in a tweet. According to Nascimento & Da Silveira
(2017), people use social media platforms to express their emotions on any particular topic,
including product ideas. Moreover, since not all the values of extracted_product_idea are valid
product ideas, only those on the list of common product ideas for the Philippines' SMEs are
considered. For example, MoneyMax.com (2021) published a list of small businesses ideas with
small capital in 2021. A few of the product ideas in the said list are plant shop, beauty product
reselling business, and cake, dessert, and pastry business. The complete list of the category of the
product ideas used in categorizing the annotated product ideas is shown in Appendix K. The
values in the extracted_product_idea column were modified based on the closest category they
belong to and its result were stored in a new column named annotated_product_idea. The
output of both students was further validated by the researcher. Rows with 0 values in
the label column were dropped from the dataframe.
Finally, the polarity scores of the extracted_product_ideas were computed through
sentiment analysis. Sentiment analysis is the process of determining the sentiments on a piece of
text based on its polarity scores (Pang, Lee, & Vaithyanathan, 2002). The sentiments were
computed by getting the polarity scores of the values of the precleaned_tweets column in a
sentence level using Valence Aware Dictionary and sEntiment Reasoner (VADER). VADER is a
Python package that designs for complex social media data, as it considers words and
punctuations, emojis, slang, and abbreviated words commonly appear in social media texts
(Hutto & Gilber, 2014). The package measures sentiments by providing four valence scores:
positive, negative, neutral, and compound, ranging from –1 (extremely negative) to +1
(extremely positive). This study considered the results of the compound valence scores for the
analysis. It is the most commonly used basis for sentiment analysis by most researchers (Clayton,
H. & Gilbert, E., 2014). A new column named polarity_score was created and initialized using
the values of the computed compound valence scores. The code used to get the polarity scores is
shown in Appendix J, line 22. Rows with less 0.05 or the extracted_product_ideas with neutral
and negative sentiments values in the polarity_score column were dropped. The results resulting
dataframe was exported as a CSV file using the pandas library to be used for the succeeding
steps. The code used to export the preprocessed dataset is shown Appendix J, line 28.
2.1.3. Constructing the criteria for screening
Screening is the third phase of NPD. It is a complex decision process highly influenced by
market changes (Jespersen, 2007). Screening aims to evaluate new product ideas to determine
which idea is worth investing in (Rochford, 1991). A set of criteria relevant to a particular
business must be designed (Agrawal & Bhuiyan, 2014, Baker & Albaum, 1986). In this section,
the criteria for screening the values of extracted_product_idea column were constructed,
considering the factors considered by Baker & Albaum (1986). Accordingly, the said literature
used 33 criteria categorized into five factors: societal factor, business risk factor, demand
analysis, market acceptance factor, and competitive factor. The complete list of Baker &
Albaum’s criteria is shown in Appendix L. Out of the categories previously mentioned, only two
were chosen, namely, demand analysis and market acceptance. These factors were selected
because their values could be obtained using the available information in the gathered data. In
contrast, the remaining factors (societal, business risk, and competitive factors) require more
than just UGC. It involves information such as financials, skills, business values, and culture,
which simply are not available.
The first criterion for screening was the demand analysis. It is used to assess the number of
possible customers for a particular product (Hart et al., 2003). The latter [market acceptance] is
defined as the reaction of the possible customers to a specific product (Calatone & Cooper,
1979). First, the demand analysis was measured in three metrics: (1) potential market, (2) the
trend of demand, and (3) stability of demand. For the second factor, market acceptance, the
sentiment analysis was conducted. These factors and assigned criteria are summarized in Table
2.2. A product idea should satisfy all the criteria to pass the screening and be considered good.
The formula used in getting the scores of each criterion are shown in the succeeding paragraphs.
A new dataframe was created and filled with the computed values of each criterion mentioned
above using the preannotated version of the dataset; more particularly, the values of the
tweets_count, retweets_count, likes_count, replies_count, and polarity_score columns. The
said dataframe consists of seven columns: product_idea, potential_market, trend_of_demand,
stability_of_demand, market_acceptance, and label. In particular, the label column holds the
result of the screening. It has two possible values, 1 or 0. A value of 1 means that a product idea
passes the screening or good, while 0 says otherwise or not good. The dataframe was saved for
the next step, which is data preparation. The code used to screen the product ideas is shown in
Appendix M.
Table 2.2 Criteria for screening product ideas
Category Criterion Measurement Threshold
Demand Analysis Potential market engagement’s rate >=33%
Trend of demand slope of the trendline >0
Stability of demand increased in the tweet >=15%

volume
Market Acceptance Market acceptance sentiment score >=0.05/positive
Factor sentiment
The first criterion was the potential market. It is defined as the size of the total market size
for a particular product idea. The engagement rate was used to measure the potential market. It is
the ratio of the total interactions to the total number of tweets. Muñoz-Expósito (2017) proposed
a formula for getting the engagements on Twitter called the ratio of interest. The formula is
shown below.
retweets count +replies count + likescount

Potential market=
totalnooftweets
Accordingly, interactions consist of the total number of retweets, shared via email, replies,
and likes, collectively called diffusion interactions. Since tweets shared via email are not
available on the collected dataset, only retweets, replies, and likes are considered on the
computation. Scrunch, an influencer marketing platform, considered a 33% - 1% engagement
rate to be very high (Mee, 2020). A potential market of at least 33% was used as the threshold
value to consider a product idea as good. A new column named potential_market was
introduced to store the computed calculated potential market. The code is shown in Appendix N.
The second criterion was the trend of demand. It is the growth of demand over a period of
time. A study refers to this metric as a secular trend that describes data direction (e.g., upward or
downward) in the long term (Komlos, 1993). According to Trackmyhashtag, a firm that gives
analytics from raw tweets, tweet volume is the sum of the tweets, retweets, and replies
(Trackmyhashtag, 2020). The graphical method is used in measuring a secular trend (Komlos,
1993). The formula is shown below.
v ( December )−v ( August )

Slope=
12−8
Where: tweet volume=v=no of tweets+no of retweets + no of replies
Accordingly, a good trend is when the curve generated is smooth, which means that the
scores above the line should be greater than or equal to the score below it. Sample trendlines are
shown in Figure 2.4. The slope of the trendline from August to December was monitored to get
the trend of demand. A value greater than 0 was used as the threshold value to consider a product
idea as good. A new column named trend_of_demand was created to hold the values of the
trend of demand. The code is shown in Appendix O.

Figure 2.4 Sample trends of demand
The third was the stability of demand. It is the fluctuation in the market size from one year to
another. The increase in the tweet volume was compared to measure the stability of demand.
Usually, the standard is to assess the stability of demand annually, but since the collected data
are tweets posted in 6-month time, the increase in the tweet volume from November to
December 2020 was calculated. The formula is shown below.
previous−current
Stabilityofdemand= ∗100
previous
Where:
previous=tweetvolume ∈ November 2020
current = tweet volume in December 2020
Baremetrics, a company that provides analytics and insights for business, reported that
companies should have 15%-45% stability of demand yearly. A threshold of at least 15% was set
as the value of the stability of demand to consider a product idea. A new column named
stability_of_demand was created to store the values of the stability of demand. The code is
shown in Appendix P.
Finally, market acceptance is described as the sentiments of the people towards a particular
product idea. The computed compound valence scores from the sentiment analysis step were
utilized to measure the sentiments. According to Hutto & Gilber (2014), compound valence
scores can be negative, positive, and neutral. The interpretation table is shown in Table 2.3.
Accordingly, a score value greater than or equal to 0.05 is considered positive. A threshold of at
least 0.05 was set as the value of the market acceptance to consider a product idea. A new
column named market_acceptance was created to store the values of the market acceptance.
The code is shown in Appendix Q.
Table 2.3 Interpretation table of the polarity scores
Polarity score Interpretation
compound score >= 0.05 Positive sentiment
compound score > -0.05 and compound score < 0.05 Neutral sentiment
compound score <= 0.05 Negative sentiment
2.1.4. Data preparation
Data preparation aims to prepare the dataset for modeling (Zhang, Zhang, Yang, 2003). In
this step, the dataset to be used for building the model was finalized. In particular, the results
obtained from the screening were utilized. Specifically, two columns were considered,
namely product_idea and label. The first column contains the validated product ideas, while the
second column specifies the label for that product idea. A good product idea has a label 1;
otherwise, a value of 0 was specified. Since ML algorithms do not understand text values, word
embedding must be applied (Ge & Moh, 2017). Word embedding is the process of converting
text into numerical values. This study utilized the Word2Vec encoding scheme to convert the
values of the product_idea column to their numerical equivalent. Research has shown that using
the Word2Vec with SVC helps achieve a high-performing model for text classification
(Lilleberg, Zhu, & Zhang, 2015). The implementation of word embedding is shown in Appendix
R, line 20. Once the dataset was prepared, it was divided into two subsets. The first subset was
used to train the model. It consists of 80% of the entire dataset. The second subset was reserved
for testing. The remaining 20% was used to test the performance of the trained model. The
dataset was split using the train_test_split function from the sklearn package. To
refine further the dataset, a down sampling technique was carried out using the NearMiss
package. The code used to create the training and testing subsets is shown in Appendix R, line
26.
2.1.5. Parameter tuning
Parameter tuning was performed to get the ideal values of the parameters of the SVC
algorithm. The performance of a model is only as good when using its default parameters (Smit
& Eiben, 2009). The parameters must be fine-tuned to get their optimal values to create a high-
performing model. Parameter tuning was done using the randomizedsearchcv package;
and the results of the data preparation was utilized. It is a technique that randomly searches for
the optimal parameters for a given model. The parameters used in the parameter of the model
and their possible values are listed in Table 2.4. This study considered four parameters of the
SVC model: 1) kernel, 2) gamma, 3) degree and 4) C. The first parameter specifies the kernel
type to be used in the algorithm. Second, gamma is the value of the kernel coefficient. The third
parameter is the degree of the polynomial kernel function. Lastly, the C parameter is the
regularization parameter. The code used in parameter tuning is shown in Appendix S.

Table 2.4 Parameters of the model
# Parameter Values
1 kernel linear, rbf, poly
2 gamma 0.1, 1, 10, 100, 1000
3 degree 0,1,2,3,4,5,6
4 C 0.1, 1, 10, 100,1000
2.1.6. Performance evaluation
Performance evaluation is one of the critical activities in ML (Korde & Mahender, 2012).
Accordingly, it is not enough that a model can do prediction or classification; measuring its
performance is the most effective and reliable way to assess how well a model performs. Four
performance metrics, including (1) accuracy, (2) precision, (3) recall, and (4) f1-score were
calculated using the sklearn libraries to evaluate the performance of the model. Specifically, a
classification report was generated to get the precision, recall, and f1-score values, and cross-
validation was conducted to determine the accuracy. In addition, a confusion matrix was
generated to examine the results of the classification. The trained and tested model was saved to
be used in implementing the proposed screening application. The metrics used in performance
evaluation are further explained in the succeeding paragraphs.
The first performance metric was the accuracy. It is the ratio of correctly predicted values to
the total number of observations. The cross-validation was carried out using the
cross_val_score method from the sklearn module. It is a technique that evaluates the
model's performance by dividing a dataset into n number of subsets to get the consistency in the
accuracy of a model; more particularly, 10th fold cross-validation was performed. The average
score was used as the final value of the accuracy. The implementation is shown in Appendix T,
line 24. The formula for getting the accuracy is shown below.
True Positive+ True Negative

Accuracy=
Total no of observations
The second performance metric was the precision. It is defined as the ratio of the correctly
predicted values of a particular class to the total correctly and incorrectly predicted values of that
specific class. The classification_report module was used to get the precision score of the model.
In particular, a classification report was generated. The weighted average precision score was
considered as the precision of the model. The implementation is shown in Appendix T, line 21.
The formula for getting the precision is shown below.
True Negative True Positive

+
True Negative+ False Negative True Positive + False Positive
Precision=
No of classes
The third performance metric was the recall. It is the ratio of correctly predicted values of a
particular class to correctly predicted values and the total incorrectly predicted values of all other
classes. To get the value of the recall, a classification report was generated. The weighted
average recall score was considered as the recall of the model. The implementation is shown in
Appendix T, line 21. The formula for getting the recall is shown below.
True Negative True Positive

+
True Negative+ False Positive True Positive + False Negative
Recall=
No of classes
Finally, the f1-score is the weighted average of Precision and Recall. A classification report
was generated to get the value of the f1-score. The weighted average f1-score score was
considered as the f1-score of the model. The implementation is shown in Appendix T, line 21.
The formula for getting the f1-score is shown below.

2∗( Precision∗Recall )
F 1−score=
Precision+ Recall
2.2. Implementation
Application development is the process of creating a program or set of programs to solve an
existing challenge or improve a current process (Burns & Dennis, 1985). In particular, web
development aims to create an application that can be accessed across the web using any device
with a browser. Various ML and web technologies were utilized to implement the proposed
screening application. Furthermore, the persisted SVC model was integrated and used in the
development. The technology stack considered in the implementation is shown in Table 2.5.
Table 2.5 Technology stacks
Development category Technologies

Front-end HTML 5, CSS 3, JS (ES2015), Bootstrap v5.1, Font awesome
v5.15.4
Back-end Python libraries (see Appendix A), Python 3.8, Flask v2.0.2
Moreover, the implementation comprises of front-end and back-end development. The
former shows the front-end technologies used and the tools for designing the User Interface (UI)
of the application. The latter provides a list of the different functions of the application and the
various Python libraries and frameworks used to construct them. The packages used in the
implementation are stored in the same venv used in the former steps and are listed in Appendix
A. These steps are further discussed in the succeeding sections. The system architecture of the
application is presented in Figure 2.5.

Figure 2.5 System Architecture of the screening application
2.2.1. Front-end development
Over the years, the devices that are capable of accessing the Internet have grown. Gardner
(2011) reports that a new standard for web design has been proposed. According to the report, it
aims to standardize responsive web design to reduce the developer's workload by developing a
single application that can adapt across devices and improve user experience. The web
fundamental technologies, including HTML, CSS, JS, were used in designing the front-end. In
addition, Bootstrap and icons from Font Awesome were utilized. In particular, Bootstrap is a
CSS framework designed to make a web application responsive in any computing device with a
browser, including desktop computers, mobiles, and tablets. The design of the UI was simplified
to allow more time for the back-end development.
More specifically, three pages were designed: search page, result page, and dashboard. The
first page was created to allow users to enter a product idea. The result page shows the results of
the screening. Finally, a dashboard that provides a list of the good product ideas. The pages were
launched on different devices such as desktop computers, mobiles, and tablets enabled by
Google Chrome developer tools to assess the UI quality. Furthermore, the application was
opened in three major browsers to check its compatibility, including Google Chrome, Mozilla
Firefox, and Microsoft Edge.
2.2.2. Back-end development
Back-end development refers to server-side development. It focuses on maintaining the
technology stacks behind the application (Kaluža, 2019). In recent years, a Python-based
microframework, Flask, has been the preferred web development platform for ML applications,
mainly because of its simplicity and ease of use (Aslam, Mohammed, & Lokhande, 2015; Mufid
et al., 2019). Flask is an open-source web framework for building data-driven and dynamic web
applications quickly. Flask and various Python libraries were utilized to develop the functions of
the screening application. The list of each function and its corresponding libraries is presented in
Appendix A. In particular, seven functions were constructed to implement the proposed
application: (1) reading the input, (2) preprocessing the input, (3) validating the input and
showing the error, (4) translating the input to English, (5) converting input into vector
representations, (6) classifying the input, and (7) showing the viable product ideas. These
functions are listed in Table 2.6.
Table 2.6 Functions of the screening application
# Function Description
1 To read the input from the user The function is used to accept input from
the user on the screen page.
2 To preprocess the input The function is used to remove noise
● To convert input to lower case words/characters from the input.
● To remove special characters
● To remove emails
● To remove links
● To remove HTML tags
● To remove accented characters
● To remove repeated letters
● To remove numerical values
3 To validate the input and show error The function used to validate if the input is
message a string consists of at least three characters,
English and a noun. Error message/s is
going to appear if the input does not satisfy
any of the above conditions.
4 To translate non-English input to English The function is used to ensure that only
English and Tagalog are accepted.
5 To convert input to its vector The function is used to perform word
representations embedding to an input using the Word2Vec
encoding scheme.
6 To classify the input The function is used to classify whether an
input is good and display the results on the
result page. The output would come from
the trained text classification model.
7 To show the product ideas The function is used to display viable
product ideas in the dashboard.
The flow of the application is shown in Figure 2.6. Its quality was tested with different
inputs, including numerical, non-English, Tagalog strings, and other possible invalid inputs.
Figure 2.6 A Flowchart of the screening application
Chapter 3: Results and discussions
This chapter presents and discusses the results of the study. It is divided into two main
sections: section 3.1) building the model and section 3.2) implementation. The first section
presents the dataset used in the study. Moreover, the performance of the model is revealed. In the
second section, the result of the implementation is shown. Specifically, the UI and the different
functions are highlighted. These sections are further explained in the succeeding paragraphs.
3.1. Building the model
A model is an ML algorithm that has been trained using a particular dataset and uses the
acquired knowledge to identify some patterns to make a classification (Feldman, 2013). This
study built an SVC model that classifies a product idea as good or not good. This section is
divided into five subsections: section 3.1.1) data collection, section 3.1.2) data preprocessing,
section 3.1.3) constructing the criteria for screening, section 3.1.4) data preparation, and section
3.1.5) performance evaluation. First, the gathered dataset is shown, and significant statistics are
highlighted. In the second subsection, the preprocessed dataset is presented. The result of each
subprocess is explained. Next, the results of the screening using the extracted product ideas from
the tweets are discussed. Fourth, the prepared datasets used to train and test the model are
introduced. Finally, the performance of the model is revealed and examined.
3.1.1. Data collection
The datasets were gathered using an advanced scraping tool called Twint. It is an open-
sourced advanced tweet scraping tool often used in research to exploit publicly posted
information, including tweets (Dutch Osint Guy, 2018). The said tool collected tweets posted
within the Philippines from August to December 2020. It uses the coordinates and radius values
when considering which tweets to collect. This study used the coordinates of Marinduque
(12.072862,122.664139), the geographical center of the Philippines, and 768.54KM radius,
respectively. The value of the radius was found to be ideal covering almost the entire island of
the country. Using a lower radius’ value resulted in collecting lesser tweets. On the other hand,
using a higher value tends to include tweets posted outside the Philippines, including tweets sent
from Malaysia in the southwest, Indonesia in the south, Vietnam in the west, and Taiwan and
mainland China in the north (See Table 3.1).
Table 3.1 Sample collected tweets when using a high radius value
Tweet Country
@oogiri_zamurai なんでトイレのドア Japan
が開かないの？　もうおれ我慢の限界な
んだよね。
Tou amoré 😍 Portugal
나는 네가 웃는 것을 보는 것을 좋아한다. Korea
너는 또한 나를 행복하게 만들어 줘 오빠.
너는 나의 행복의 이유야. #예성버블
https://t.co/0baNrlGCcp
@taIktaIk2O2O เดี๋ยวเงินหมด ก็มาหาหลอกกะเทย Thailand

ใหมค
่ ะ่
20K फॉलोवर्स की बधाई और शुभकामनाएं आप निरंतर आगे बढ़ते India
रहिये और ऐसे ही निडर होकर लिखते रहिये स्वस्थ रहिये खुश
रहिये🙏🏻 https://t.co/PKd9MFiruI
A total of 5,193,417 tweets were scraped during the data collection process. A few of the
collected tweets are shown in Appendix U. Each tweet has 35 attributes that give more
information about a particular tweet, including the date when a tweet is posted, the author, the
place where the tweet is posted, the language, and the number of retweets. The complete list of
all the attributes is shown in Appendix V. In particular, most of the gathered tweets were posted
in December 2020 with 1,194,486 (23%). Conversely, only 882,880 (19%) were collected in
August 2020. The complete distribution of the collected tweets based on the month when they
are posted is shown in Figure 3.1.
Figure 3.1 Tweet Month Distribution
Although the tweets were only limited to areas within the Philippines, 54 languages were
observed from the collected data. Specifically, more than half of the tweets are Tagalog (tl) with
53%. It is followed by English (en) with 30%. 9% consists of the other languages, including
Ukrainian (uk) and Spanish (es). The remaining 9% has no defined language. The percentage of
the languages of the tweets is summarized in Figure 3.2.

Figure 3.2 Tweet Language Distribution
3.1.2. Data preprocessing
Data preprocessing is an essential part of building a model. It aims to eliminate the noise in a
given dataset. The presence of the noise is detrimental to the performance of a model (Yang,
2018). The collected tweets went through a series of preprocessing steps, including (1) data
wrangling, (2) data reduction, (3) extracting the product ideas, (4) data annotation, and (5)
sentiment analysis. As a result, the dataset was reduced from over 5M rows and 35 columns to
2,926 rows with 6 columns. In particular, the columns comprise the annotated_product_idea,
tweets_count, replies_count, retweets_count, likes_count, and polarity_score. The original
tweets were preprocessed using various NLP Python libraries. A few of the preprocessed tweets
is shown in Table 3.2. The results of data preprocessing are discussed in the succeeding
paragraphs.
Table 3.2 Sample preprocessed tweets
extracted tweets_ tweets_ replies_ likes_ polarity_

product_idea count count count count score
chocolate 64 12 19 52 0.2462
cake roll
cheese pizza 5 4 6 6 0.0454
spinach 43 12 22 137 0.2240

lasagna 90 117 133 217 0.2576
gym 4 1 1 16 0.3884
equipment
shop
haircut 4 1 12 1 0.0772
The first step in data preprocessing was to create data wrangling. It restructures raw data into
the desired format for data preprocessing. The collected tweets were transformed from a TSV
file format into a pandas dataframe. The transformation would allow useful features of
the pandas package to be used in data preprocessing. Consequently, the process was able to
create a dataframe with 5,193,417 rows and 36 columns. A few of the content of the created
dataframe is shown in Appendix U. The complete list of the columns is shown in Appendix V.
Second, data reduction was conducted to get rid of irrelevant data in the dataframe. It is a
technique to reduce the number of features in a dataset. Data reduction was made by reducing the
number of rows and columns of the created dataframe. In particular, the row selection only
considered English tweets. In column selection, only the columns that would be useful in getting
the engagements and sentiments of a product idea were included (Fischer & Reuber, 2011;
Sultana, Paul; Gavrilova, 2016; Anto et al., 2016; Ray & Chakrabarti, 2017; Geetha, Rekha, &
Rarthika, 2018; Costa et al., 2013). These include created_at, tweet, retweets_count,
replies_count, and likes_count columns. This step reduced the dimension of the dataframe into
1,557,442 rows and 5 columns. A few of the rows in the modified dataframe are shown in Table
3.3.
Table 3.3 Sample rows from the modified dataframe
created_at tweet retweets_count replies_count likes_count
TRIPLE
2020-10-18 CHOCOLATED
0 0 1
09:20:19 PST CHOCOLATE CAKE
ROLLS 🤤😭 #sweets
@jacob craving for

2020-12-18
cheeeeeeeeese pizza 0 4 0
12:00:37 PST
🧀🍕 in SM Baguio
2020-10-07 want stir-fried spinach

0 0 1
17:06:05 PST 😫❤
want lasagna and

2020-10-27
lasagna only. 😭 1 0 19
21:57:19 PST
https://t.co/Ggcb0cJe0B
suggest me sum
affordable gym
2020-11-27
equipment shops, i 0 0 2
18:53:17 PST
really need to lift some
weights🥺
2020-12-31
parang i want a haircut
23:59:58 PST 0 15 1
na.
The third step to preprocess the dataset was to extract the product ideas from the tweets.
UGC produced in social media, including tweets, could be used to find and develop product
ideas (Nascimento & Da Silveira, 2017). The process of extracting the product ideas is further
divided into seven sub-steps, including (1) data precleaning, (2) pulling out the nouns, (3) data
cleaning, (4) removing stopwords, (5) lemmatization, (6) discarding duplicate words, and (7)
eliminating short words. The results of each sub-step are further discussed in the succeeding
paragraphs. As a result, a total of 118,387 product ideas were extracted from the tweets. Sample
extracted product ideas are shown in Table 3.4.
Table 3.4 Sample product ideas extracted from the tweets
tweet extracted_product_idea
TRIPLE CHOCOLATED CHOCOLATE chocolate cake

CAKE ROLLS 🤤😭 #sweets
@jacob craving for cheeeeeeeeese pizza 🧀🍕 cheese pizza

in SM Baguio
want stir-fried spinach 😫❤ spinach
want lasagna and lasagna only. 😭 lasagna

suggest me sum affordable gym equipment gym equipment
shops, i really need to lift some weights🥺
parang i want a haircut na. barber shop
The first sub-step to extract the product ideas was data precleaning. The tweets were initially
cleaned using regex and NumPy modules to remove the noise words such as mentions,
hashtags, retweets, and links. A new column named precleaned_tweet was added to the
modified dataframe to hold the results of the data precleaning. Rows with null values in the
newly created column precleaned_tweet were dropped. Sample results are shown in Table 3.5.
In particular, in the first row, the hashtag #sweet was removed. On the second and third rows, the
mention @jacob and the link, https://t.co/Ggcb0cJe0B, were removed, respectively. This step
managed to bring the number of rows of the dataframe down to 1,501,123.

Table 3.5 Sample precleaned tweets
tweet precleaned_tweet
TRIPLE CHOCOLATED CHOCOLATE TRIPLE CHOCOLATED CHOCOLATE
CAKE ROLLS 🤤😭 #sweets CAKE ROLLS 🤤😭
@jacob craving for cheeeeeeeeese pizza 🧀🍕 craving for cheeeeeeeeese pizza 🧀🍕

in SM Baguio
want stir-fried spinach 😫❤ want stir-fried spinach 😫❤
want lasagna and lasagna only. 😭 want lasagna and lasagna only. 😭
suggest me sum affordable gym equipment suggest me sum affordable gym equipment
shops, i really need to lift some weights🥺 shops, i really need to lift some weights🥺
parang i want a haircut na. parang i want a haircut na.
The second sub-step to extract the product ideas was to pull out the nouns from the
precleaned _tweets column. Its result was used to determine which tweets contain possible
product ideas for Philippines’s SMEs. The nouns were selected because, generally, product ideas
are expressed through nouns (Malmasi & Dras, 2015). Three POS tagger libraries, TextBlob,
NLTK, and Spacy, were used, and their outputs were evaluated to select the best library for
pulling out the nouns. A comparison of the extracted nouns using the three libraries is shown in
Table 3.6. Among the three libraries, Spacy got the slowest time when pulling out nouns with 1
hour and 48 minutes compared to TextBlob, which did the same task in 52 minutes and 39
seconds, and NLTK with 44 minutes and 15 seconds. The results of TextBlob and NLTK are
almost identical. Specifically, both often treated emojis, adverbs, and even upper-case words as
nouns, which leads to inaccurate noun extraction. Overall, Spacy got the most accurate result and
was chosen as the library. Sample results are shown in Table 3.7. A new column
named pulled_out_noun was introduced to the dataframe to store the pulled-out nouns. Rows
with null values in the pulled_out_noun column were removed from the dataframe. This step
further reduced the dataframe into 1,481,864 rows.
Table 3.6 Sample nouns extracted from the precleaned tweets using the three POS tagger
precleaned_tweet textblob_noun nltk_noun spacy_noun

TRIPLE CHOCOLATE CHOCOLATE CHOCOLATE CHOCOLATED
CAKE ROLLS 🤤😭 CAKE CAKE ROLLS CHOCOLATE
CAKE ROLLS
Craving for cheeeeeeeeese pizza 🧀🍕 pizza 🧀🍕 cheeeeeeeeese pizza
pizza 🧀🍕 🧀 sm
want stir-fried spinach 😫 spinach 😫❤ spinach 😫❤ spinach ❤

❤
want lasagna and lasagna lasagna lasagna. lasagna lasagna 😭 lasagna lasagna
only. 😭 😭
suggest me sum gym equipment gym equipment gym equipments
affordable gym equipment weights🥺 shops weights🥺 shop
shops, i really need to lift
some weights🥺
parang i want a haircut na. parang i haircut parang i haircut na.. haircut na
na.
Table 3.7 Sample pulled out nouns from the precleaned tweet
tweet precleaned_tweet pulled_out_noun

TRIPLE CHOCOLATED TRIPLE CHOCOLATED CHOCOLATED
CHOCOLATE CAKE ROLLS CHOCOLATE CAKE CHOCOLATE CAKE
🤤😭 #sweets ROLLS 🤤😭 ROLLS
@jacob craving for craving for cheeeeeeeeese cheeeeeeeeese pizza 🧀 sm

cheeeeeeeeese pizza 🧀🍕 in pizza 🧀🍕
SM Baguio
want stir-fried spinach 😫❤ want stir-fried spinach 😫 spinach ❤
❤
want lasagna and lasagna only. want lasagna and lasagna lasagna lasagna
😭 https://t.co/Ggcb0cJe0B only. 😭
suggest me sum affordable suggest me sum affordable gym equipments shop
gym equipment shops, i really gym equipment shops, i
need to lift some weights🥺 really need to lift some
weights🥺
parang i want a haircut na. parang i want a haircut na. haircut na
Once the nouns were pulled out of the precleaned tweets, the pulled_out_noun were further
cleansed using the preprocess_kgptalkie package. The said package would provide,
among others, functions to remove any remaining unnecessary words and characters that were
mistakenly pulled out nouns by the Spacy library. Each noun went through a series of
preprocessing tasks, including (1) converting words to lower case, (2) expanding contractions,
(3) removing emails, (4) removing html tags, (5) removing accented characters, (6) removing
special characters, (7) discarding unnecessary repeated letters. A new column
named cleaned_noun was created to hold the results of data cleaning. Sample cleaned nouns are
shown in Table 3.8. In the table, the extracted noun was converted to lower case in the first row.
In addition, emojis such as the "cheese" and "heart" in rows 1 and 2 were removed. Furthermore,
in the second row, the repeated letters from the word "cheese" were discarded. Finally, rows with
null values were dropped from the dataframe. This step managed to bring the number of rows of
the dataframe down to1,401,264 rows.

Table 3.8 Sample cleaned nouns
tweet precleaned_tweet pulled_out_noun cleaned_noun

TRIPLE TRIPLE CHOCOLATED chocolate cake rolls
CHOCOLATED CHOCOLATED CHOCOLATE
CHOCOLATE CAKE CHOCOLATE CAKE CAKE ROLLS
ROLLS 🤤😭 #sweets ROLLS 🤤😭
@jacob craving for craving for cheeeeeeeeese cheese pizza sm

cheeeeeeeeese pizza cheeeeeeeeese pizza pizza 🧀 sm
🧀🍕 in SM Baguio 🧀🍕
want stir-fried spinach want stir-fried spinach spinach ❤ spinach
😫❤ 😫❤
want lasagna and want lasagna and lasagna lasagna lasagna lasagna
lasagna only. 😭 lasagna only. 😭
suggest me sum suggest me sum gym equipments gym equipments
affordable gym affordable gym shop shop
equipment shops, i equipment shops, i
really need to lift some really need to lift some
weights🥺 weights🥺
parang i want a haircut parang i want a haircut haircut na haircut na
na. na.
Next, the Tagalog (e.g., akin, ako, at, dapat) and English (e.g., could, or, and, rather)
stopwords were removed from the values of the cleaned_noun column still using the Spacy
library. Stopwords are words that do not add much meaning to a sentence. A new column
named stopwords_free_noun was introduced to hold the results of the operation. The sample
output is shown in Table 3.9. The Tagalog stopword “na” in the sixth row was removed from the
word “haircut na.” Rows with null values in the stopwords_free_noun column were eliminated
from the dataframe. This step managed to bring the number of rows of the dataframe down to
1,081,021.
Table 3.9 Sample cleaned nouns without the stopwords
tweet precleaned_ pulled_out_ cleaned_ stopwords_

tweet noun noun free_noun
TRIPLE TRIPLE CHOCOLATED chocolate chocolated
CHOCOLATED CHOCOLATED CHOCOLATE cake rolls chocolate
CHOCOLATE CHOCOLATE CAKE ROLLS cake rolls
CAKE ROLLS CAKE ROLLS
🤤😭 #sweets 🤤😭
@jacob craving for craving for cheeeeeeeeese cheese pizza cheese pizza
cheeeeeeeeese cheeeeeeeeese pizza pizza 🧀 sm sm sm
pizza 🧀🍕 in SM 🧀🍕
Baguio
want stir-fried want stir-fried spinach ❤ spinach spinach
spinach 😫❤ spinach 😫❤
want lasagna and want lasagna and lasagna lasagna lasagna lasagna
lasagna only. 😭 lasagna only. 😭 lasagna lasagna
https://t.co/Ggcb0cJ
e0B
suggest me sum suggest me sum gym equipments gym gym
affordable gym affordable gym shop equipments equipment
equipment shops, i equipment shops, i shop shops
really need to lift really need to lift
some weights🥺 some weights🥺
parang i want a parang i want a haircut na haircut na haircut
haircut na. haircut na.
The fifth sub-step to extract the product ideas was lemmatization. It is used to convert a word
into its base form (e.g., from equipments to equipment, from walked to walk. from mice to
mouse, from dove to dive). The values of the stopwords_free_noun column were lemmatized
using the make_base method available in the preprocess_kgptalkie package. The
lemmatization was a necessary step to remove the duplicates. A new column
named lemmatized_noun was added to the dataframe to store the lemmatized nouns. The
sample output is shown in Table 3.10. In particular, the word “chocolated” from the first row
was converted to “chocolate.” Furthermore, “rolls” was simplified to its singular form, “roll.”
Similarly, in the 5th row, the “gym equipments shop” was transformed to “gym equipment
shop.”
Table 3.10 Sample lemmatized nouns
tweet precleaned_ pulled_out_ cleaned_ stopwords_ lemmatized_

tweet noun noun free_noun noun
TRIPLE TRIPLE CHOCOLAT chocolate chocolated chocolate
CHOCOLATE CHOCOLAT ED cake rolls chocolate chocolate
D ED CHOCOLAT cake rolls cake roll
CHOCOLATE CHOCOLAT E CAKE
CAKE ROLLS E CAKE ROLLS
🤤😭 #sweets ROLLS 🤤😭
@jacob craving craving for cheeeeeeeeese cheese cheese pizza cheese pizza
for cheeeeeeeees pizza 🧀 sm pizza sm sm sm
cheeeeeeeeese e pizza 🧀🍕
pizza 🧀🍕 in
SM Baguio
want stir-fried want stir- spinach ❤ spinach spinach spinach
spinach 😫❤ fried spinach
😫❤
want lasagna want lasagna lasagna lasagna lasagna lasagna
and lasagna and lasagna lasagna lasagna lasagna lasagna
only. 😭 only. 😭
https://t.co/Ggc
b0cJe0B
suggest me suggest me gym gym gym gym
sum affordable sum equipments equipments equipments equipment
gym equipment affordable shop shop shops shop
shops, i really gym
need to lift equipment
some weights🥺 shops, i really
need to lift
some
weights🥺
parang i want a parang i want haircut na haircut na haircut haircut
haircut na. a haircut na.
Since the lemmatization converted words into their base forms, duplicates were introduced in
the lemmatized_noun column. In sub-step 6, these duplicates were taken care of. A custom
function was created to remove the repeated words. A new column named unique_noun was
created to hold the results of the operation. The sample output is shown in Table 3.11. In
particular, the second “chocolate” was removed in the first row. Rows with null values in the
unique_noun column were dropped from the dataframe. This step was able to shrink the rows of
the dataframe to 1,079,001.
Table 3.11 Sample unique nouns
tweet precleaned pulled_out cleaned_ stopwords_ lemmatiz unique_n

_ _ noun free_noun ed_noun oun
tweet noun
TRIPLE TRIPLE CHOCOLA chocolate chocolated chocolate chocolate
CHOCOLA CHOCOL TED cake rolls chocolate chocolate cake roll
TED ATED CHOCOLA cake rolls cake roll
CHOCOLA CHOCOL TE CAKE
TE CAKE ATE ROLLS
ROLLS CAKE
🤤😭 ROLLS
#sweets 🤤😭
@jacob craving for cheeeeeeee cheese cheese pizza cheese cheese

craving for cheeeeeeee ese pizza 🧀 pizza sm sm pizza sm pizza sm
cheeeeeeeees ese pizza sm
e pizza 🧀🍕 🧀🍕
in SM
Baguio
want stir- want stir- spinach ❤ spinach spinach spinach spinach
fried spinach fried
😫❤ spinach 😫
❤
want lasagna want lasagna lasagna lasagna lasagna lasagna
and lasagna lasagna and lasagna lasagna lasagna lasagna
only. 😭 lasagna
https://t.co/G only. 😭
gcb0cJe0B
suggest me suggest me gym gym gym gym gym
sum sum equipments equipme equipment equipment equipment
affordable affordable shop nts shop shops shop shop
gym gym
equipment equipment
shops, i shops, i
really need really need
to lift some to lift some
weights🥺 weights🥺
parang i parang i haircut na haircut haircut haircut haircut
want a want a na
haircut na. haircut na.
Finally, short words were eliminated from the values of the unique_noun column. These
words consist of less than three characters. This was done to exclude words that may not have
value in the unique_noun column. A new column named extracted_product_idea was created
to store the results. The sample output is shown in Table 3.12. The word “sm” in the second row
has two characters and therefore was removed. Rows with null values in
the extracted_product_idea column were removed from the dataframe. This reduced the rows
of the dataframe to 522,943.
Table 3.12 Sample nouns without short words
tweet preclean pulled_o cleaned stopword lemmati unique_ extracte

ed_ ut_ _ s_ zed_nou noun d_produ
tweet noun noun free_nou n ct_idea
n
TRIPLE TRIPLE CHOCOL chocola chocolate chocolat chocolat chocolat
CHOCOL CHOCO ATED te cake d e e cake e cake
ATED LATED CHOCOL rolls chocolate chocolat roll roll
CHOCOL CHOCO ATE cake rolls e cake
ATE LATE CAKE roll
CAKE CAKE ROLLS
ROLLS ROLLS
🤤😭 🤤😭
#sweets
@jacob craving cheeeeeee cheese cheese cheese cheese cheese

craving for for eese pizza pizza pizza sm pizza sm pizza sm pizza
cheeeeeeee cheeeeeee 🧀 sm sm
ese pizza eese pizza
🧀🍕 in 🧀🍕
SM Baguio
want stir- want stir- spinach spinach spinach spinach spinach spinach
fried fried ❤
spinach 😫 spinach
❤ 😫❤
want want lasagna lasagna lasagna lasagna lasagna lasagna
lasagna lasagna lasagna lasagna lasagna lasagna
and and
lasagna lasagna
only. 😭 only. 😭
https://t.co/
Ggcb0cJe0
B
suggest me suggest gym gym gym gym gym gym
sum me sum equipment equipm equipment equipme equipme equipme
affordable affordabl s shop ents shops nt shop nt shop nt shop
gym e gym shop
equipment equipmen
shops, i t shops, i
really need really
to lift some need to
weights🥺 lift some
weights🥺
parang i parang i haircut na haircut haircut haircut haircut haircut
want a want a na
haircut na. haircut
na.
Going back to data preprocessing, the fourth step to accomplish was data annotation. This
step involves manually removing tweets that express a product idea and therefore not valid. The
basis for considering a product idea as valid is the presence of emotion (e.g., hate, want, love,
disgust) expressed towards the noun in a tweet. According to Nascimento & Da Silveira (2017),
people use social media platforms to express their emotions on any particular topic, including
product ideas. A new column named label was added to the existing dataframe to store the
output of data annotation. A value of 1 indicates that the extracted product idea is valid, while 0
means otherwise. Furthermore, since not all the extracted nouns are valid product ideas, only
those on the list of the most common product ideas for the Philippines' SMEs are considered.
The list the product ideas for SMEs are shown in Appendix K. Another column
named category was added to the dataframe to store the modified value of the

extracted_product_idea column based on the closest product ideas’ category they belong to.
Sample output is shown in Table 3.13.
Table 3.13 Sample annotated product ideas
extracted_product_idea label category

acrylic 1 Art accessories shop
adobo 1 Home cooked meals

antibiotic 1 Drug store
yema cake 1 Cake, dessert, and pastry business
airpod 1 Tech and accessories shop
abaca face mask 1 Face mask/face shield business
accessory shop 1 Online fashion boutique
circle year 0 None
disappointment today 0 None
The data annotation was done with the aid of BSIT undergraduate students. The dataframe
with 522, 943 rows, and 12 columns were divided into five subsets using the month when the
tweets were posted. This resulted in subsets, namely, August.csv, September.csv, October.csv,
November.csv, and December.csv. The number of rows per subset before and after data
annotation is shown in Table 3.14. The first two subsets were assigned to the first student, while
the next two were delegated to the second. The remaining subset was reserved for the researcher.
The results made by the students were further validated by the researcher. Rows with 0 values in
the label column were dropped from the dataframe which brings the rows down to 11,280.
particular, the label column holds the result of the screening. It has two possible values, 1 or 0. A
value of 1 means that a product idea passes the screening or good, while 0 says otherwise or not
good. The dataframe was saved for the next step, which is data preparation. The code used to
screen the product ideas is shown in Appendix M.

Table 3.14 Comparison of each subset before and after data annotation
Subset No of rows of the pre- No of rows of rows of the

annotated subset post-annotated subset
August.csv 100,185 1,951
September.csv 105, 587 6,949
October.csv 96, 172 947
November.csv 104, 203 123
December.csv 116, 796 1,310
Total 522,943 11,280
Finally, sentiment analysis was done to compute the polarity scores of
the annotated_product_idea using the values of the precleaned_tweet column. Sentiment
analysis was carried out using a package called VADER. It is a package that designs for complex
social media data. When getting the polarity scores, it considers punctuations, emojis, slang, and
abbreviated words that commonly appear in social media texts (Hutto & Gilber, 2014). A new
column named polarity_score was added to the dataframe to store the computed polarity scores.
Rows with polarity score less than 0.05 representing neutral and negative sentiments were
dropped. This left the dataframe to 2,926 rows. Sample results are shown in Table 3.15. The
resulting dataframe was exported and used in the succeeding steps.

Table 3.15 Sample sentiment scores obtained from the precleaned tweets
precleaned_tweet extracted_product_idea polarity_score
TRIPLE CHOCOLATE chocolate cake roll 0.2462

CAKE ROLLS 🤤😭
Craving for cheeeeeeeeese cheese pizza 0.0454
pizza 🧀🍕
want stir-fried spinach 😫❤ spinach 0.2240
want lasagna and lasagna lasagna 0.2576

only. 😭
suggest me sum affordable gym equipment shop 0.3884
gym equipment shops, i
really need to lift some
weights🥺
parang i want a haircut na. haircut 0.0772
3.1.3. Constructing the criteria for screening
Screening aims to evaluate new product ideas to determine which idea is worth investing in
(Rochford, 1991). A set of criteria was constructed to screen the values of the
annotated_product_idea column from the preprocessed dataset. The created criteria were based
on the factors considered by Baker & Albaum's (1986). These criteria include potential market,
trend of demand, stability of demand, and market acceptance. (See Table 2.2). A new dataframe
was created using the aggregated values of the tweets_count, retweets_count, likes_count,
replies_count, and polarity_score columns of the annotated dataframe with 522,943 rows (See
Table 3.12). The resulting dataframe comprises of 2,926 rows and 5 columns. These columns
include extracted_product_idea. market_potential, trend_of_demand, stability_of_demand, and
market_acceptance. In addition, a variety of utility functions were programmed to automate the

computation of the values of each criterion discussed above. A few of the rows used in screening
are shown in Table 3.16..
Table 3.16 Sample data from the new dataframe
annotated_product_ potential_ trend_of_ stability_of_de market_

idea market demand mand acceptance
acrylic 19,797.3% 4.9 15% 0.333

adobo 2,527% 5.0 15% 0.444
antibiotic 1,377% 3.25 15% 0.05
yema cake 101% 5 15% 0.05
airpod 4.5% 4.91 15% 0.87
abaca face mask 8% 5 14% 0.04
accessory shop 8% 5 14% 0.04
A new column named label was added to the new dataframe to hold the values of the result
of the screening. The said column has two possible values. A value of 1 means that a product
idea passes the screening or is considered as good. On the other hand, 0 implies that a product
idea did not satisfy at least one of the four criteria and, therefore, is not a good idea. A few of the
results of the screening is shown in Table 3.17. Out of the 2,296 annotated product ideas that
were screened, 2,145 failed, while the remaining 781 made it. For instance, acrylic, adobo,
antibiotic, yema cake, and airpod are considered a good idea because they satisfied all of the
criteria for screening. Specifically, acrylic has a potential market of 19,797.3% which is way
more than the 33% acceptable value. Its trend of demand, which is computed from November to
December 2020, has an upward direction. Moreover, the stability of demand, considered from
September to October and November to December, remains optimistic. Finally, its market
acceptance has a positive sentiment. The newly created dataframe was saved for the next step,
which is data preparation. The results of the screening in each criterion are further discussed in
the succeeding paragraphs.
Table 3.17 Sample results of the screening
annotated_pr potential_ trend_of_ stability_of market_ label

oduct_ market demand _demand acceptance
idea
acrylic 19,797.3% 4.9 15% 0.333 1
adobo 2,527% 5.0 15% 0.444 1

antibiotic 1,377% 3.25 15% 0.05 1
yema cake 101% 5 15% 0.05 1
airpod 4.5% 4.91 15% 0.87 1
abaca face 8% 5 14% 0.04 0
mask
accessory shop 8% 5 14% 0.04 0
The first criterion considered in screening was the potential market. It is defined as the
market size of the total market size for a particular product. It was measured through tweets
engagement rate, defined as the ratio of the diffusion interactions (retweets, replies, and likes
count) to the total number of tweets (Muñoz-Expósito, 2017). A potential market of at least 33%
was set as a minimum value for considering a product idea to be viable (Mee, 2020). The
potential market values are stored under the column named potential_market. Out of the 2,969
product ideas, 1,482 passed the screening in terms of their potential market values. These
product ideas show enough tractions from possible customers to be considered as a viable
product idea for a business venture. A few of these product ideas are shown in Table 3.18. More
notably, acrylic which is under the art accessories shop got the highest potential market out of
the samples below with 19,797.3%.

Table 3.18 Sample product ideas with potential market
extracted_product_idea potential_market
acrylic 19,797.3%
adobo 2,527%
antibiotic 1,377%
yema cake 101%
The second criterion considered for screening was the trend of demand. It refers to the
secular movement that describes data direction (e.g., upward or downward) long-term (Komlos,
1993). The trend of demand was monitored by the trendline's slope from August 2020 to
December 2020 using the graphical method. A trend of demand of greater than 0 was set as a
minimum value for considering a product idea to be viable (Komlos, 1993). The trend of demand
values was stored under the column named trend_of_demand. Out of the 2,926 product ideas
subject for screening, 2,908 passed the screening using the computed values of the trend of
demand. A few of these product ideas are shown in Table 3.19. These product ideas have seen an
increase in engagements in the last five months, from August 2020 to December 2020, which is
another saying that the number of people talking about these product ideas has grown. In
particular, adobo, yema cake, abaca face mask, and accessory shop are among the product ideas
with a high chance of getting more demand in the succeeding months.

s
Table 3.19 Sample product ideas with trend of demand
annotated_product_idea trend_of_demand
acrylic 4.9
adobo 5.0
antibiotic 3.25
yema cake 5
airpod 4.91
abaca face mask 5
accessory shop 5
The third criterion used in screening was the stability of demand. It describes the fluctuation
in the market size in a particular period. It was measured by getting the changes in the tweet
volume from November to December 2020. Stability of demand of at least 15% was set as a
minimum value for considering a product idea to be viable (Baremetrics, 2021). The potential
market values are stored under the column named stability_of_demand. Of the 2,926 screened
product ideas, 864 got a stable demand. A few of these product ideas are shown in Table 3.20.
The list shows that product ideas such as acrylic, adobo, antibiotic, yema cake, and AirPods have
a consistent demand. All items in the list got a 15% score which is the minimum value needed
for measuring that a product idea’s market is stable.

Table 3.20 Sample product ideas with stability of demand
annotated_product_idea stability_of_demand
acrylic 15%
adobo 15%
antibiotic 15%
yema cake 15%
airpod 15%
Ultimately, market acceptance was the degree of acceptance of the possible customers
towards a particular product idea. It was assessed by getting the polarity scores of the extracted
product ideas using the VADER package. The compound valence scores were considered for the
value of the market acceptance (Hutto & Gilber, 2014). A market acceptance of 0.05 was set as a
threshold to consider a product idea to be viable. The market acceptance values are stored under
the column named market_acceptance. In particular, out of the 2,926 evaluated product ideas,
1,272 got a favorable result in terms of the criterion market acceptance. A few of these product
ideas are shown in Table 3.21. More notably, people think that yema cake and acrylic are
amongst the positive ones.
Table 3.21 Sample product ideas with market acceptance
annotated_product_idea market_acceptance
acrylic 0.444
adobo 0.05
antibiotic 0.05
yema cake 0.87

accessory shop 0.333
3.1.4. Data preparation
Data preparation is the last step before the modeling (Zhang, Zhang, Yang, 2003). The
screening results, which consist of the extracted_product_idea, potential_market, trend_of_
demand, stability_of_demand, market_acceptance, and the label, were prepared using various
packages in Python. The prepared dataset has 780 good product ideas and 2,146 not good
product ideas. The extracted_product_idea column served as the input/independent variable,
while the label values were the output/dependent variable. Moreover, the prepared dataset was
divided into two subsets in a 20:80 ratio to create training and testing datasets, respectively. In
particular, the training dataset contains 1,248 inputs, while the testing dataset has 312. A few of
the rows are shown in Table 3.22.
Table 3.22 Sample prepared dataset
input output
[0.36797234, 0.042480964, 0.099437095, -0.1050...] 1
[-0.25395, -0.021710003, 0.29691, -0.065814994...] 1
[0.15996951, 0.2001715, -0.17681801, -0.004529...] 1
[0.009951855, 0.24370115, 0.094669, -0.3137708...] 1
[0.2085925, 0.10699375, -0.0006571803, -0.0292...] 1
[-0.116972834, -0.11884499, 0.013468166, 0.224...] 0
[0.073656656, 0.22216666, 0.068681, -0.0890286...] 0

The initial text values of the extracted_product_idea column were transformed to numerical
values using the word embedding using the Word2Vec package. The said package converts
strings into their numerical values their vector representations. Each vectorized input comprises
an array of 300 numerical values. The array values below are only limited to the first four arrays.
Sample datasets are shown in Table 3.23.
Table 3.23 Sample vectorized inputs
input vectorized input

acrylic [0.36797234, 0.042480964, 0.099437095, -0.1050...]
adobo [-0.25395, -0.021710003, 0.29691, -0.065814994...]
antibiotic [0.15996951, 0.2001715, -0.17681801, -0.004529...]
yema cake [0.009951855, 0.24370115, 0.094669, -0.3137708...]
airpod [0.2085925, 0.10699375, -0.0006571803, -0.0292...]
abaca face mask [-0.116972834, -0.11884499, 0.013468166, 0.224...]
accessory shop [0.073656656, 0.22216666, 0.068681, -0.0890286...]
3.1.5. Parameter tuning
The performance of a model is only as good when using its default parameters (Smit &
Eiben, 2009). The ideal values of the parameters of the model were identified through parameter
tuning using the randomizedsearchcv package. The prepared dataset shown from the
previous step was fed to the model. The parameters considered were kernel, gamma, degree, and
C. A total of 10 candidates were found and tested out in three folds cross-validation, totaling 30
fits. The values were taken in 1.5 minutes, and the results are shown in Table 3.24.
Table 3.24 Parameters of the model
parameter result
kernel rbf
gamma 100
degree 6
C 100
3.1.6. Performance evaluation
Performance evaluation was performed to assess how well the model classifies a product
idea. The proposed model was evaluated in four performance metrics: (1) accuracy, (2)
precision, (3) recall, and (4) f1-score using the balanced prepared dataset. Specifically, the
accuracy was measured in ten-fold cross-validation. In addition, a confusion matrix was
generated to examine the number of correctly classified values and verify the cross-validation
results. Moreover, a classification report was generated and used to assess the model's precision,
recall, and f1-score values. The performance of the model is further discussed in the succeeding
paragraphs.
Accuracy was the first performance metric used in measuring the performance of the model.
Cross-validation was performed to determine the consistency in the accuracy of the model.
Accuracy is the ratio of correctly predicted values to the total number of observations. The
average accuracy was considered the final accuracy. The cross-validation was carried out using
the 2,926 extracted product ideas. It comprises 1,248 training subsets and 312 testing subsets.
The results of the cross-validation are shown in Table 3.25.
The accuracy of the model started low with 49%. The score plunged on the second fold by
almost twice as previous. It continued to fluctuate until it reached its peak of 89% on the final
fold. Overall, the average accuracy rate of 81% was considered as the final score.
Table 3.25 Accuracy of the model in 10th cross-validation
1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th Aver
fold fold fold fold fold fold fold fold fold fold age
0.487 0.858 0.801 0.884 0.846 0.852 0.807 0.814 0.826 0.878 0.80
17949 97436 28205 61538 15385 5641 69231 10256 92308 20513 5769
Moreover, a total of 312 product ideas were used to further examine the performance of the
model by generating a confusion matrix. In particular, 160 of which are not good, while 152 are
good. Based on the assessments, the model correctly classified 116 as good product ideas and
147 as not good product ideas. Furthermore, minimal inputs of 13 are false positive or type 1
error. However, a large portion of the classifications is considered false negative or type 2 errors
with 136 items. A confusion matrix is shown in Table 3.26.
Table 3.26 Confusion matrix
Act n=312 Predicted values

ual
val 0 1
ues
0 147 13
1 136 116
Precision was the second performance metric used in measuring the performance of the
model. It is computed to get the ratio of the correctly predicted values of a particular class to the
total correctly and incorrectly predicted values of that particular class. The score of the precision
is shown in Table 3.27. The weighted average was considered in measuring the precision of the
model. The report shows the model's performance using the 312 inputs from the testing subset. It
consists of 160 not-good and 152 good product ideas. Accordingly, the model has a 90%
precision rate when classifying a product idea as not good and 80% when it is good. Overall, it
got an 85% precision rate and was considered as the final value.
Table 3.27 Classification report
Precision Recall F1-score
0 80% 92% 86%
1 90% 76% 83%

Weighted avg 85% 84% 84%
Recall is the third performance metric used in measuring the performance of the model. It is
calculated as the ratio of correctly predicted values of a particular class to correctly predicted
values and the total incorrectly predicted values of all other classes. The score of the recall is
shown in Table 3.28. The weighted average was considered in measuring the recall score. The
report shows the model's performance using the 312 inputs from the testing subset. It consists of
160 not-good and 152 good product ideas. Accordingly, the model has a 92% precision rate
when classifying a product idea as not good and 76% when it is good. Overall, it got an 84%
precision rate and was considered as the final value.
Finally, f1-score is simply the weighted average of precision and recall. The score of the f1-
score is shown in Table 3.28. The weighted average was considered in measuring the recall
score. The report shows the model's performance using the 312 inputs from the testing subset. It
consists of 160 not-good and 152 good product ideas. Accordingly, the model has an 86%
precision rate when classifying a product idea as not good and 83% when it is good. Overall, it
got an 84% precision rate and was considered as the final value.
Overall, the performance evaluation shows that the accuracy of the model is at 84%.
Furthermore, it got a precision rate of 85%, recall rate of 84%, and f1-score rate of 84%. These
results are summarized in Table 3.28. The trained model was exported and used in the
implementation of the proposed screening application.
Table 3.28 Performance evaluation
Accuracy Precision Recall F1-score
84% 85% 84% 84%
3.2. Implementation
The application was designed and developed using various ML and web technologies used in
web development, including HTML, CSS, and JS and the micro web framework, Flask. This
section is presented in two parts: section 3.2.1) front-end development and section 3.2.2) back-
end development. First, screenshots of the UIs are shown. Afterward, the different functions are
presented with sample inputs.
3.2.1. Front-end development
The UI of the application was implemented using Bootstrap and Flask’s templating engine,
Jinja2. In particular, Bootstrap was used to make the design responsive to any computing device,
including desktop computers, mobiles, and tablets. Furthermore, icons from Fontawesome were
added to improve the aesthetics. The created UIs were tested out on the different sizes of devices
enabled by Google Chrome developer tools. Finally, UIs were viewed in three major browsers
including Google Chrome, Mozilla Firefox, and Microsoft Edge. In particular, three pages were
created for the application, namely search, result, and dashboard. Sample screenshots of these
pages are shown below.
The screen page is the first page of the application. This page would allow Philippines’
SMEs owners or representatives to see the application's different features, such as entering their
new product ideas to the search bar, a link to see good product ideas through the dashboard, and
others. A screenshot of the screen page is shown in Figure 3.3. Moreover, when a user enters
invalid or incorrect input, the screen page will be redirected. The nature of the error/s and tips to
prevent it would pop up. A screenshot of the screen page with invalid and valid inputs are shown
in Figures 3.4 and 3.5, respectively.
Figure 3.3 Screen page

Figure 3.4 Screen page with invalid input
Figure 3.5 Screen page with valid input

Second, the result page is where the user is redirected when the screening button is pressed
and there are no errors found. This is the page where the result of the screening is revealed. It
indicates whether the entered product idea is good or not. A screenshot for the result page of face
mask is shown in Figure 3.6. Figures input/. Other sample screening results are shown Figures
3.7 and 3.8.

Figure 3.6 Result page for face mask
Figure 3.7 Result page for cake

Figure 3.8 Result page for car wash
Ultimately, the dashboard is where the good product ideas are listed. The product ideas listed
in this page is sorted alphabetically. Other information including the category, potential market,
stability of demand, and market acceptance are also shown. A screenshot of the dashboard page
is shown in Figure 3.9.
Figure 3.9 Dashboard

3.2.2. Back-end development
The application functions were developed using Flask, and a list of python packages. A total
of seven functions were constructed, including reading the input, preprocessing the input,
validating the input and showing the error, translating the input to English, converting input into
vector representations, classifying the input, and showing the good product ideas. The screening
application was tested with different inputs, such as numerical, non-English, Tagalog strings, and
other possible invalid inputs. Sample preprocessed, invalid, and classified inputs are shown in
Tables 3.29, 3.30, and 3.31.
Table 3.29 Sample input and their preprocessed equivalents
Input Preprocessed input Function
Bread bread Converted input to lowercase

https://www.bread.com Removed links
<html>bread</html> bread Removed HTML tags
Bread!@ bread Removed special characters
Breeeeeaaaad bread Removed extra characters
Bread123 bread Removed numerical

Tinapay bread Translated input to English
Breads bread Lemmatized the input
Table 3.30 Sample invalid inputs and their corresponding errors
Input Error
123123123 Your input is not valid. Please enter a string.
Gahmsahhahmnida Your language is not supported. Please enter a
Tagalog/English string only.
as Your input is too short. Try longer string.
cheerful Your input is not a product idea. Please enter a noun.
accident Your input is not a valid product idea. See sample

aaaaaaaaaaaaaaaaaaaaa Your input is too short. Try longer string.
Table 3.31 Sample product ideas and their labels
Input Output
acrylic Good idea
adobo Not a good idea
antibiotic Good idea
yema cake Not a good idea
airpod Not a good idea
abaca face mask Good idea
accessory shop Good idea
Chapter 4: Conclusions and recommendations
4.1. Conclusions
This study developed a tool that utilizes UGC produced in social media to assist the
Philippine-based SMEs in developing product ideas. This project was done in two steps. First, a
supervised ML model was trained to classify product ideas into good or not good. Second, an
application was implemented using the created model.
The model was built using various Python libraries (see Appendix A). A few of the packages
used are pandas, vader, spacy, and scikitlearn. The dataset used to train the said model consists
of UGC on Twitter. In particular, over 5 million tweets were collected in five months (August-
December 2020) using an advanced scrapping tool [Twint]. The data collection was performed in
a Linux environment which was configured using Oracle VM. The collected tweets went through
a series of preprocessing steps, including data precleaning, pulling out the nouns from the tweets,
removing the stopwords to prepare the datasets for modeling. In addition, data annotation was
conducted to ensure that only valid product ideas were included in the final dataset. The resulting
dataset was used to perform the screening; in particular, the extracted product ideas from the
tweets were evaluated through engagements and sentiments information obtained from the
scraped tweets; more particularly, four factors were considered in the
screening: potential market, the trend of demand, stability of demand, and market
acceptance (Baker & Albaum, 1986). The results are represented by two values. A value of 1
indicates that a product idea is good, while 0 says otherwise. These data were used as the dataset
to feed the model. Since an ML algorithm cannot process text, the product ideas were
transformed into vector representations using the Word2Vec encoding scheme. The vectorized
extracted product ideas were used as the input. On the other hand, the label (e.g., 1 or 0) was
used as the output. The dataset was divided with a ratio of 80:20 to train and test the model,
respectively. The model's performance was assessed in four metrics: accuracy, precision, recall,
and f1-score. The evaluation shows that the model achieves an 84% accuracy rate when
classifying a product idea (e.g., good or not good). The model was saved and used in the
implementation of the application.
The application was implemented using Flask, and its UIs were designed using the CSS
framework, Bootstrap. The developed application would allow SMEs’ owners or any
representative to enter and screen their product ideas. Specifically, the tool consists of seven
features, including reading the input, validating the input and showing the error, converting input
into vector representations, and classifying the input. Three web pages were designed to show
the features of the application, namely, screen page, result page, and dashboard. The application
was tested by launching it on different devices, including phones, tablets, and computers, using
the Google developer tools.
This study would help contribute to the limited literature that explores social media data in
developing new product ideas (Nascimento & Da Silveira, 2017). It supports the findings of
Rathore & Ilavarasan (2020) regarding obtaining insights on user preferences from social media.
It demonstrates how tweets on Twitter could be used in screening product ideas for Philippines’
SMEs. The screening process is part and parcel of the seven-stage process of NPD. Other factors
need to be considered when conducting NPD, such as new product strategy development,
business analysis, development, and commercialization (Booz, & Allen & Hamilton, 1982). The
proposed application does not intend to replace the screening process; instead, it would
complement it. Specifically, it would provide Philippines SMEs a platform to urgently evaluate
and select which product ideas to consider for their next business venture. Furthermore, the
application would also allow the rising numbers of SMEs putting up a business in the online
platform economy nationwide to look for product ideas to sell (Villanueva, 2020).
4.2. Recommendations
The researcher gathered over 5 million tweets posted within the Philippines that covers a
five-month period, from August to December 2020. Only the 1,557,442 English tweets from the
entire dataset were considered. Future studies could collect more tweets by allowing a broader
time period of data collection. Also, later research could include tweets with Tagalog language
and other dialects in the Philippines. Furthermore, collecting real-time tweets could also help
provide more up-to-date insights regarding product ideas that could be critical in the screening
process. Additionally, other sources of UGC such as Facebook, Instagram, and YouTube could
also be included. This study only considered text-based UGC. Considerations for UGC in other
formats such as images can be explored. Interestingly, Instagram has been the most preferred
social media platform by the consumer because of its ease of use and less complicated way to
view feedbacks (Smith, Fischer, & Yongjian, 2012). It has also been recommended that future
research can examine SM platforms that are driven by images for specific type of product ideas
(Rathore & Ilavarasan, 2020).
Moreover, out of the 1,557,442 tweets considered in the study, a total of 2,926 product ideas
were extracted through data annotation. It was performed with the aid of two undergrad BSIT
students. The preprocessed nouns that were pulled out from the tweets were marked as 1 for
valid and 0 for invalid. The basis for considering a product idea as valid is the presence of
emotion (e.g., hate, want, love, disgust) expressed towards the noun in a tweet. Furthermore,
since not all nouns are product ideas, only those on the list of the most common product ideas
particular to the Philippines' SMEs are considered. For example, MoneyMax.com (2021)
published a list of small businesses ideas with capital in 2021. The researcher further validated
the results in a limited period of time. Similar studies could incorporate marketing experts when
annotating the product ideas and in a more considerable amount of time to have a more in-depth
analysis and further validation of the dataset.
Furthermore, the annotated product ideas were transformed into numerical values by getting
their vector representations using Word2Vec encoding scheme. A study suggests that combining
Word2Vec with TF-IDF yields a better result all the time (Lilleberg, Zhu, & Zhang, 2015). It is a
bag of words that assign numerical values to each word based on how frequently it appears in a
collection or given datasets. The lower the frequency of the word is the higher its value. Future
studies could further enhance the model's performance by using the features of both TF-IDF and
Word2Vec when transforming the extracted product ideas into their vector representations.
In addition, the study was able to build a model that classifies a particular product idea into
good or not using the collected UGC on Twitter. However, the model was only trained once, and
its classification depends solely on the limited number of training datasets. Meanwhile, consumer
preferences are constantly changing. What considered good product ideas today are no longer the
case next month. Future studies can consider implementing a mechanism to allow the model to
continuously learn as it accepts input from users. This would give the model more intelligence
and reliability as it would enable it to analyze current trends in the market.
Finally, a supervised ML algorithm, SVC, was used to build the model for classifying
product ideas. The parameters of the model were identified using RandomizedSearchCV. The
model's accuracy could be improved by getting the optimal parameter values using
GridSearchCV (Batayev, 2019). It is a technique that searches for all the possible combinations
for the parameters to get the optimal values. This study could also be extended by using
unsupervised learning. Alternatively, the SVC model could be trained using an unsupervised
clustering approach for text classification (Shafiabady et al., 2016).

Definition of terms
Small and medium enterprises It is a non-subsidiary, independent firm in the

Philippines that employs 10-199 people and has total
assets not exceeding 3,000,000 pesos.
New product development It is a process that transforms market opportunities
into a product available for sale in the market. It
consists of seven stages: new product strategy
development, idea generation, screening and
evaluation, business analysis, development, testing,
and development.
Screening It is a phase of the new product development that
selects and evaluates new product ideas from a pool
of ideas generated during new product ideation.
Social media It is a website and application that enable users to
create and share content or participate in social
networking (e.g., Facebook, Instagram, and Twitter).
User-generated content It is publicly available content on the Internet, usually
posted on social media sites, such as text, image,
video, and audio, created by users rather than brands.
Twitter It is a social media application that allows its users to
post a concise tweet of 280 characters with the option
to include hashtags.
Machine learning It is a branch of AI based on the idea that systems can
learn from data, identify patterns and make decisions
with minimal human intervention.
Supervised machine learning It is a machine learning task of learning a function
algorithm that maps an input to an output based on input-output
pairs.
Support vector machine It is a set of supervised learning methods used for
classification and regression.
Sentiment analysis It is a process of getting the polarity scores from a
piece of text to determine its sentiment. The
sentiment is categorized into positive, negative, or
neutral.
Text classification It is a machine learning technique that assigns a label
to a particular text. A typical text classification
problem is categorizing news into sports, business,
politics, or entertainment.
Word embedding It is a process of associating numerical or vector
representations to a text for mathematical
computations. One typical technique used in word
embedding is Word2Vec.
Market potential It is the criteria used to measure the size of the total
market size of a particular new product idea.
Stability of demand It is the criteria used to measure the fluctuation in the
market size from one year to another of a particular
new product idea.
Trend of demand It is the criteria used to measure the growth of
demand over a period of time of a particular new
product idea.
Market acceptance It is the criteria used to determine the sentiment
(positive, negative, or neutral) of the potential
customers towards a particular product idea.
References
Agrawal, A., & Bhuiyan, N. (2014). Achieving success in NPD projects. International Journal of
Social, Behavioral, Educational, Economic, Business and Industrial Engineering, 8(2),
476-481.
Akram Afzal, M. (2017). Risks in new product development (NPD) projects.
Albar, F. M. (2013). An Investigation of Fast and Frugal Heuristics for New Product Project
Selection.
Bahtar, A. Z., & Muda, M. (2016). The impact of User–Generated Content (UGC) on product
reviews towards online purchasing–A conceptual framework. Procedia Economics and
Finance, 37, 337-342.
Baker, K. G., & Albaum, G. S. (1986). Modeling new product screening decisions. Journal of
Product Innovation Management: AN INTERNATIONAL PUBLICATION OF THE
PRODUCT DEVELOPMENT & MANAGEMENT ASSOCIATION, 3(1), 32-39.
Baremetrics. (2021). Growth rate. Retrieved from https://baremetrics.com/academy/growth-rate
Batayev, N. (2019). Gas turbine fault classification based on machine learning supervised
techniques. In 2018 14th International Conference on Electronics Computer and
Computation (ICECCO) (pp. 206-212). IEEE.
Bayanihan to Heal As One Act. RA 11469. 18th Cong. (2020). Retrieved from
https://www.officialgazette.gov.ph/downloads/2020/03mar/20200324-RA-11469-
RRD.pdf
Bhimani, H., Mention, A. L., & Barlatier, P. J. (2019). Social media and innovation: A
systematic literature review and future research directions. Technological Forecasting and
Social Change, 144, 251-269.
Brainkart. (2021). Methods for measuring secular trend. Retrieved from
https://www.brainkart.com/article/Methods-of-Measuring-Secular-Trend_39269/
Booz, & Allen & Hamilton. (1982). New products management for the 1980s. Booz, Allen &
Hamilton.
Bouey, J. (2020). Assessment of COVID-19's Impact on Small and Medium-Sized Enterprises:

Implications from China.
Carbonell, P., Mayer, M. A., & Bravo, À. (2015, January). Exploring brand-name drug mentions
on Twitter for pharmacovigilance. In MIE (pp. 55-59).
Chenworth, M., Perrone, J., Love, J. S., Graves, R., Hogg-Bremer, W., & Sarker, A. (2021).
Methadone and Suboxone® mentions on Twitter: Thematic and sentiment analysis.
Clinical Toxicology, 1-10.
Chu, S. C., & Kim, Y. (2011). Determinants of consumer engagement in electronic word-of-
mouth (eWOM) in social networking sites. International journal of Advertising, 30(1), 47-75
Church, K. W. (2017). Word2Vec. Natural Language Engineering, 23(1), 155-162.
CMO’s Use of Social Media During COVID-19. For what purpose has your firm used social
media during the pandemic. MarketingCharts.com. (May, 2020). Retrieved from
https://www.marketingcharts.com/charts/cmos-use-of-social-media-during-covid-
19/attachment/cmosurvey-cmo-use-of-social-media-during-covid-19-jul2020
Cooper, R. G. (1979). Identifying industrial new product success: Project NewProd. Industrial

Marketing Management, 8(2), 124-135.
Costa, J., Silva, C., Antunes, M., & Ribeiro, B. (2013, April). Defining semantic meta-hashtags
for twitter classification. In International Conference on Adaptive and Natural Computing
Algorithms (pp. 226-235). Springer, Berlin, Heidelberg.
Dadgar, S. M. H., Araghi, M. S., & Farahani, M. M. (2016, March). A novel text mining
approach based on TF-IDF and Support Vector Machine for news classification. In 2016
IEEE International Conference on Engineering and Technology (ICETECH) (pp. 112-116).
IEEE.
De Brentani, U., & Droge, C. (1988). Determinants of the new product screening decision A
structural model analysis. International Journal of Research in Marketing, 5(2), 91-106.
Dhaoui, C., Webster, C. M., & Tan, L. P. (2017). Social media sentiment analysis: lexicon versus
machine learning. Journal of Consumer Marketing.
Employment Situation in April 2020. (2020, June 5). Philippines Statistics Authority. Retrieved
from https://psa.gov.ph/statistics/survey/labor-and-employment/labor-force-
survey/title/Employment%20Situation%20in%20April%202020
Exec. Order 112, s. 2020. (April 30, 2020). Retrieved from

https://www.officialgazette.gov.ph/downloads/2020/10oct/20201008-IATF-Omnibus-
Guidelines-RRD.pdf
Effendi, M. I., Sugandini, D., & Istanto, Y. (2020). Social Media Adoption in SMEs Impacted by
COVID-19: The TOE Model. The Journal of Asian Finance, Economics, and Business,
7(11), 915-925.
Feldman, R. (2013). Techniques and applications for sentiment analysis. Communications of the

ACM, 56(4), 82-89.
Fischer, E., & Reuber, A. R. (2011). Social interaction via new social media:(How) can
interactions on Twitter affect effectual thinking and behavior?. Journal of business
venturing, 26(1), 1-18.
Ford, P., & Terris, D. (2017). NPD, design and management for SME's. The Design Society.
Ge, L., & Moh, T. S. (2017, December). Improving text classification with word embedding.
In 2017 IEEE International Conference on Big Data (Big Data) (pp. 1796-1805). IEEE.
Gharib, T. F., Habib, M. B., & Fayed, Z. T. (2009). Arabic Text Classification Using Support
Vector Machines. Int. J. Comput. Their Appl., 16(4), 192-199.
Global User Generated Content (UGC) Software Market Size, Status and Forecast 2020-2026.
MarketingResearch.com. (2020). Retrieved from
https://www.marketresearch.com/QYResearch-Group-v3531/Global-User-Generated-
Content-UGC-13615161/
Guntuku, S. C., Sherman, G., Stokes, D. C., Agarwal, A. K., Seltzer, E., Merchant, R. M., &
Ungar, L. H. (2020). Tracking mental health and symptom mentions on twitter during
covid-19. Journal of general internal medicine, 35(9), 2798-2800.
Hughes, G. D., & Chafin, D. C. (1996). Turning new product development into a continuous
learning process. Journal of Product Innovation Management: AN INTERNATIONAL
PUBLICATION OF THE PRODUCT DEVELOPMENT & MANAGEMENT
ASSOCIATION, 13(2), 89-104.
Hayon, S., Tripathi, H., Stormont, I. M., Dunne, M. M., Naslund, M. J., & Siddiqui, M. M.
(2019). Twitter mentions and academic citations in the urologic literature. Urology, 123,
28-33.
Impact of COVID-19 pandemic on micro, small, and medium enterprises (MSMEs). MSC.
(June, 2020). Retrieved from https://www.microsave.net/wp-content/uploads/2020/08/Impact-o
f-COVID-19-on-Micro-Small-and-Medium-Enterprises-MSMEs-1.pdf
Islam, M. S., Jubayer, F. E. M., & Ahmed, S. I. (2017, February). A support vector machine
mixed with TF-IDF algorithm to categorize Bengali document. In 2017 international conference
on electrical, computer and communication engineering (ECCE) (pp. 191-196). IEEE.
Jeong, E., & Jang, S. S. (2011). Restaurant experiences triggering positive electronic word-of-
mouth (eWOM) motivations. International Journal of Hospitality Management, 30(2),
356-366.
Jespersen, K. R. (2007). Is the screening of product ideas supported by the NPD process
design? European journal of Innovation management.
Jiang, J. X., & Shen, M. (2017). Traditional media, twitter and business scandals. Michael,
Traditional Media, Twitter and Business Scandals (May 1, 2017).
Joachims, T. (1998, April). Text categorization with support vector machines: Learning with
many relevant features. In European conference on machine learning (pp. 137-142). Springer,
Berlin, Heidelberg.
Kaluža, M., Kalanj, M., & Vukelić, B. (2019). A comparison of back-end frameworks for web
application development. Zbornik veleučilišta u rijeci, 7(1), 317-332.
Komlos, J. (1993). The secular trend in the biological standard of living in the United Kingdom,
1730‐1860 1. The Economic History Review, 46(1), 115-144.
Korde, V., & Mahender, C. N. (2012). Text classification and classifiers: A survey. International
Journal of Artificial Intelligence & Applications, 3(2), 85.
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019).
Text classification algorithms: A survey. Information, 10(4), 150.
Krumm, J., Davies, N., & Narayanaswami, C. (2008). User-generated content. IEEE Pervasive
Computing, 7(4), 10-11.
Kumar, S., Koolwal, V., & Mohbey, K. K. (2019). Sentiment analysis of electronic product
tweets using big data framework. Jordanian Journal of Computers and Information Technology
(JJCIT), 5(01).
Kurnia, R., Tangkuman, Y., & Girsang, A. (2020). Classification of User Comment Using
Word2vec and SVM Classifier. Int. J. Adv. Trends Comput. Sci. Eng, 9, 643-648.
Leano, R. M. (2006). SMEs in the Philippines. Cacci Journal, 3, 1-10.

Promoting SME competitiveness in the Philippines. International Trade Centre.
(November, 2020). Retrieved from
https://www.intracen.org/uploadedFiles/intracenorg/Content/Publications/Philippines_S
ME_v6.pdf
Lee, I., & Shin, Y. J. (2020). Machine learning for enterprises: Applications, algorithm selection,
and challenges. Business Horizons, 63(2), 157-170.
Lee, M., & Youn, S. (2009). Electronic word of mouth (eWOM) How eWOM platforms
influence consumer product judgement. International Journal of Advertising, 28(3), 473-
499.
Lilleberg, J., Zhu, Y., & Zhang, Y. (2015, July). Support vector machines and word2vec for text
classification with semantic features. In 2015 IEEE 14th International Conference on
Cognitive Informatics & Cognitive Computing (ICCI* CC) (pp. 136-140). IEEE.
Magna Carta for Micro, Small and Medium Enterprises (MSMEs). RA 9501. 14 th Cong. (2008).
https://www.officialgazette.gov.ph/2008/05/23/republic-act-no-9501/
Money Max (23 June, 2021). 32 Micro and Small Business Ideas You Can Start with Low
Capital in 2021. Retrieved from
https://www.moneymax.ph/personal-finance/articles/small- business-ideas-philippines
Muñoz-Expósito, M., Oviedo-García, M. Á., & Castellanos-Verdugo, M. (2017). How to

measure engagement in Twitter: advancing a metric. Internet Research.
Mu, J., Peng, G., & Tan, Y. (2007). New product development in Chinese SMEs: Key success
factors from a managerial perspective. International Journal of Emerging Markets, 2(2),
123-143.
Neda: PH Economy lost 1.1T during lockdown. (2020, May 23). De Vera, B. Retrieved
from https://business.inquirer.net/298037/neda-ph-economy-lost-p1-1t
Nascimento, A. M., & Da Silveira, D. S. (2017). A systematic mapping study on using social
media for business process improvement. Computers in Human Behavior, 73, 670-675.
Onarheim, B., & Christensen, B. T. (2012). Distributed idea screening in stage–gate
development processes. Journal of Engineering Design, 23(9), 660-673.
Osisanwo, F. Y., Akinsola, J. E. T., Awodele, O., Hinmikaiye, J. O., Olakanmi, O., & Akinjobi,
J. (2017). Supervised machine learning algorithms: classification and
comparison. International Journal of Computer Trends and Technology (IJCTT), 48(3),
128-138.
Owens, J. D. (2007). Why do some UK SMEs still find the implementation of a new product
development process problematical? Management Decision.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using
machine learning techniques. arXiv preprint cs/0205070.
Pantano, E., Giglio, S., & Dennis, C. (2019). Making sense of consumers’ tweets. International
Journal of Retail & Distribution Management.
Park, C., & Lee, T. M. (2009). Information direction, website reputation and eWOM effect: A
moderating role of product type. Journal of Business research, 62(1), 61-67.
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and
psychometric properties of LIWC2015.
Prantl, D., & Mičík, M. (2019). Analysis of the significance of eWOM on social media for
companies.
Pribyl, J. R. (1994). Using surveys and questionnaires.

Rathore, A. K., & Ilavarasan, P. V. (2020). Pre-and post-launch emotions in new product
development: Insights from twitter analytics of three products. International Journal of
Information Management, 50, 111-127.
Rochford, L. (1991). Generating and screening new products ideas. Industrial marketing

management, 20(4), 287-296.
Rodríguez-Ferradas, M. I., & Alfaro-Tanco, J. A. (2016). Open innovation in automotive SMEs

suppliers: an opportunity for new product development. Universia Business Review, (50),
142-157.
Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.

Şahİn, G. (2017, May). Turkish document classification based on Word2Vec and SVM
classifier. In 2017 25th signal processing and communications applications conference
(SIU) (pp. 1- 4). IEEE.
Schreiner, C., Torkkola, K., Gardner, M., & Zhang, K. (2006, October). Using machine learning
techniques to reduce data annotation time. In Proceedings of the Human Factors and
Ergonomics Society Annual Meeting (Vol. 50, No. 22, pp. 2438-2442). Sage CA: Los
Angeles, CA: SAGE Publications.
Scrunch. (2020, November 19). Mee, G. What is a good engagement rate on twitter? Retrieved
from https://scrunch.com/blog/what-is-a-good-engagement-rate-on-twitter.
Small Business Wage Subsidy Program. Department of Finance. (April, 2020). Retrieved from
https://sites.google.com/dof.gov.ph/small-business-wage-subsidy
Soukhoroukova, A., Spann, M., & Skiera, B. (2012). Sourcing, filtering, and evaluating new
product ideas: An empirical exploration of the performance of idea markets. Journal of
product innovation management, 29(1), 100-112.
Sultana, M., Paul, P. P., & Gavrilova, M. (2016). Identifying users from online interactions in
Twitter. In Transactions on Computational Science XXVI (pp. 111-124). Springer,
Berlin, Heidelberg.
Takeuchi, H., & Nonaka, I. (1986). The new new product development game. Harvard business
review, 64(1), 137-146. A
Tankova, H. (2021). Statista. Number of global social media network users 2017-2025.
Retrieved from https://www.statista.com/statistics/278414/number-of-worldwide-social-
network- users/
Trackmyhashtag. (2021). Twitter hashtag statistics. Retrieved from
https://www.trackmyhashtag.com/hashtag-stats
Tufekci, Z. (2014, May). Big questions for social media big data: Representativeness, validity
and other methodological pitfalls. In Proceedings of the International AAAI Conference on
Web and Social Media (Vol. 8, No. 1).
Vaishnav, M., Dalal, P. K., & Javed, A. (2020). When will the pandemic end?. Indian Journal of
Psychiatry, 62(Suppl 3), S330.
Vancic, A., & Pärson, G. F. A. (2020). Changed Buying Behavior in the COVID-19 pandemic:
the influence of Price Sensitivity and Perceived Quality.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. (2nd ed.). Springer Verlag. Pp.
1 – 20. Retrieved from website:
https://www.andrew.cmu.edu/user/kk3n/simplicity/vapnik2 000.pdf
Verworn, B., Herstatt, C., & Nagahira, A. (2006). The impact of the fuzzy front end on new
product development success in Japanese NPD projects (No. 39). Working Paper.
Villanueva, J. (2020). Philippine News Agency. Rise of online shopping nationwidem not just
from Metro. Retrieved from https://www.pna.gov.ph/articles/1112078
Wang, Z. Q., Sun, X., Zhang, D. X., & Li, X. (2006, August). An optimal SVM-based text
classification algorithm. In 2006 International Conference on Machine Learning and
Cybernetics (pp. 1378-1381). IEEE.
Weeg, C., Schwartz, H. A., Hill, S., Merchant, R. M., Arango, C., & Ungar, L. (2015). Using
Twitter to measure public discussion of diseases: a case study. JMIR public health and
surveillance, 1(1), e6.
Yin, Z., Fabbri, D., Rosenbloom, S. T., & Malin, B. (2015). A scalable framework to detect
personal health mentions on Twitter. Journal of medical Internet research, 17(6), e138.
Zhang, D., Xu, H., Su, Z., & Xu, Y. (2015). Chinese comments sentiment classification based on
word2vec and SVMperf. Expert Systems with Applications, 42(4), 1857-1863.
Zhang, S., Zhang, C., & Yang, Q. (2003). Data preparation for data mining. Applied artificial
intelligence, 17(5-6), 375-381.
Appendix A: List of Python packages
Methodology Package Usage

Building the model
Data collection twint ● To gather the dataset.
Data preprocessing pandas ● To read and convert the
collected tweets into a
dataframe.
● To reduce the rows and
columns of the dataframe.
● To discard the duplicates from
the nouns.
● To eliminate the short words
from the nouns.
● To verify the validity of the
extracted product ideas from
the tweets (data annotation).
regex, numpy ● To remove the mentions,
hashtags, emails, urls, html
tags, retweets from the tweets.
nltk, spacy, textblob ● To pull out the nouns from the
precleaned tweets.
preprocess_kgptalkie ● To convert the pulled-out
nouns to lower case
● To expand the contractions of
the pulled-out.
● To remove the accented
characters, special characters,
and repeated letters from the
pulled-out nouns.
● To get the base forms of the
pulled-out nouns
(lemmatization).
spacy ● To remove the stopwords from
the pulled-out nouns.
vader ● To compute the polarity scores
of the precleaned tweets
(sentiment analysis).
Data preparation svc ● To create the model
spacy ● To get the vector
representations of the extracted
product ideas.
nearmiss ● To balance the dataset
test_split_function ● To divide the dataset into
training and testing
Parameter tuning randomizedsearchcv ● To get the value of the
parameters of the model
Performance evaluation cross_val_score, stratifiedkfold ● To compute the accuracy score
of the model.
confusion_matrix ● To examine the number of
correctly and incorrectly
classified product ideas.
classification_report ● To get the precision, recall,
and f1-scores of the model.
pickle ● To persist the model.
Implementation
flask ● To create the back-end of the
-flask, request, application.
● To read the input.
render_template,
● To display the error.
pickle ● To load the SVC model and
classify the input.
preprocess_kgptalkie, ● To convert the input into
re vector values.
spacy ● To preprocess and validate the
input
pandas ● To show the list of the good
googletrans ● To validate the language of the
input and convert non-English
input to English.
Appendix B: Sample code: data collection

This screenshot from the Ubuntu terminal shows the code used in collecting tweets in September
2020 within the Philippines using Twint. twint indicates that a Twint command is being used,
-g <longitude latitude, radius> stands for geography. It is the parameter used to
specify where the geographical location of the tweets. It uses three parameters: longitude,
latitude, and radius. --since <date> and --until <date> indicate the date range of
tweet to be collected. -o <filename> simply tells to save the retrieved tweets in a file using
the specified filename. –-csv means that the output is to be saved in a csv format. Lastly, --
count simply counts the number of scraped tweets.
Appendix C: Geographical sources of the tweets

This screenshot from mapdevelopers.com shows the sources of the collected tweets. The center
of the circle is located in Marinduque which is known as the geographical center of the
Philippines. From Marinduque, tweets sent within 768.54km radius were scraped. The value of
radius was an ideal value to cover the entire Philippines.
Appendix D: Source code: Data reduction

Appendix E: Source code: data precleaning
Appendix F: Source code: noun extraction
Appendix G: Source code: data cleaning
Appendix H: List of Tagalog and English stopwords
Tagalog English
akin, aking, ako, alin, am, amin, aming, ang, a, about, above, after, again, against, all, also,
ano, anumang, apat, at, atin, ating, ay, bababa, am, an, and, any, are, aren't, as, at, be,
bago, bakit, bawat, bilang, dahil, dalawa, because, been, before, being, below, between,
dapat, din, dito, doon, gagawin, gayunman, both, but, by, can, can't, cannot, com, could,
ginagawa, ginawa, ginawang, gumawa, gusto, couldn't, did, didn't, do, does, doesn't, doing,
habang, hanggang, hindi, huwag, iba, ibaba, don't, down, during, each, else, ever, few, for,
ibabaw, ibig, ikaw, ilagay, ilalim, ilan, inyong, from, further, get, had, hadn't, has, hasn't,
isa, isang, itaas, ito, iyo, iyon, iyong, ka, kahit, have, haven't, having, he, he'd, he'll, he's,
kailangan, kailanman, kami, kanila, kanilang, hence, her, here, here's, hers, herself, him,
kanino, kanya, kanyang, kapag, kapwa, himself, his, how, how's, however, http, i, i'd,
karamihan, katiyakan, katulad, kaya, kaysa, i'll, i'm, i've, if, in, into, is, isn't, it, it's, its,
ko, kong, kulang, kumuha, kung, laban, lahat, itself, just, k, let's, like, me, more, most,
lamang, likod, lima, maaari, maaaring, mustn't, my, myself, no, nor, not, of, off, on,
maging, mahusay, makita, marami, marapat, once, only, or, other, otherwise, ought, our,
masyado, may, mayroon, mga, minsan, ours, ourselves, out, over, own, r, same, shall,
mismo, mula, muli, na, nabanggit, naging, shan't, she, she'd, she'll, she's, should,
nagkaroon, nais, nakita, namin, napaka, narito, shouldn't, since, so, some, such, than, that,
nasaan, ng, ngayon, ni, nila, nilang, nito, niya, that's, the, their, theirs, them, themselves, then,
niyang, noon, o, pa, paano, pababa, paggawa, there, there's, therefore, these, they, they'd,
pagitan, pagkakaroon, pagkatapos, palabas, they'll, they're, they've, this, those, through, to,
pamamagitan, panahon, pangalawa, para, too, under, until, up, very, was, wasn't, we,
paraan, pareho, pataas, pero, pumunta, we'd, we'll, we're, we've, were, weren't, what,
pumupunta, sa, saan, sabi, sabihin, sarili, sila, what's, when, when's, where, where's, which,
sino, siya, tatlo, tayo, tulad, tungkol, una, while, who, who's, whom, why, why's, with,
walang, won't, would, wouldn't, www, you, you'd,
you'll, you're, you've, your, yours, yourself,
yourselves,
Appendix I: Source code: stopwords removal

Appendix J: Source code: lemmatization
Appendix K: List of common product ideas for Philippines' SMEs
1 agricultural shop
2 appliance store
3 art accessories shop
4 baby products
5 baking supply store
6 ballet lesson business
7 bar business
8 barbeque stand/inihaw business
9 beauty products reselling business
10 bicycle shop business
11 bills payment business
12 book store
13 cake, dessert, and pastry business
14 call center business
15 candle shop
17 car parts and services
18 car rental business
20 car wash
21 cart parts and services
22 Catering/buffet business
23 cellphone loading business
24 chocolate/candy store
26 cigarette/vape shop
27 classic cheese burger
28 coffee shop
29 community store
30 construction supply
31 cooking equipment business
32 cosplay costume shop
33 crochet tutorial business
34 cryptocurrency
35 culinary class business
36 dance school business
37 delivery/courier services
38 dental clinic
39 department store
40 derma clinic
41 driving lesson business
42 drug store
43 event planning business
44 eyewear shop
45 face mask/face shield business
46 fireworks/firecrackers shop
47 flower shop
48 food cart franchise
49 funeral service
50 furniture shop
51 gas station
52 graphic design and video editing business
53 gun shop
54 hair and grooming home services
55 hardware store
56 home cleaning and repair services
57 home cooked meals
58 home essentials shop
59 home rental business
60 homemade beverage or palamig business
61 hygiene products
62 ice cream store
63 internet shop
64 piso wi-fi vending machine
65 jewelry shop
66 kitchen supply store
67 laundry shop
68 lingerie shop
69 liquor store
70 loan business
71 local grocery store business
72 lugaw/arozz caldo/goto business
73 mask face shield
74 meal plan and delivery
75 meat shop
76 medical supply store
77 medicine
78 milk tea business
79 money changer/pawn shop business
80 motorcycle shop
81 music store
82 online fashion boutique
83 online gambling business
84 online tutoring business
85 Organic/vegan shop
86 parking business
87 pawn shop
88 personal shopping services
89 personalized gift business
90 pest control services
92 pet shop
93 plant shop
94 plastic manufacture company
95 poultry shop
96 Recreational drug store
97 restaurant
98 school supply business
99 seafood shop
10 sex toy store
0
10 shoe business
1
10 souvenir shop
2
10 spa
3
10 sports store
4
10 swimming lesson business
5
10 t-shirt printing business
6
10 tailor shop
7
10 tattoo studio
8
10 tech and accessories shop
9
11 thrift shop
0
11 toy store
1
11 travel products and accessories
2
11 vegetables and fruits shop
3
11 vend machine
4
11 veterinary shop
5
11 vintage shop
6
11 vitamins and supplements reselling business
7
11 vlogging
8
11 water refilling station
9
12 web design business
0
12 workout equipment shop
1
Appendix L: Baker & Albaum (1986)’s criteria for screening new product ideas
Criterion Examples
Societal Factor ● legality: product liability

● safety: usage hazards
● environment impact: pollution potential
● societal impact: benefit to society
Business Risk Factor ● functional feasibility: work as intended
● production feasibility: technical feasible
● stage of development: prototype development
● investment costs: development costs
● payback period: time to recover investment
● profitability: profit potential
● marketing research: necessary market information
● research and development: production development
Demand Analysis ● potential market: size of total market size
● potential sales: economies of sale
● trend of demand: growth of demand
● stability of demand: demand fluctuation
● product life cycle: expected length of cycle
● product line potential
Market Acceptance Factor ● compatibility
● learning
● need
● dependence
● visibility
● promotion
● distribution
● service
Competitive Factor ● appearance
● function
● durability
● price
● existing competition
● new competition
● protection
Appendix M: Source code: screening product ideas

Appendix N: Source code: potential market
Appendix O: Source code: trend of demand
Appendix P: Source code: stability of demand
Appendix Q: Source code: market acceptance
Appendix R: Source code: data preparation
Appendix S: Source code: parameter tuning
Appendix T: Source code: performance evaluation
Appendix U: Sample collected tweets
id conversatio created date time timezo user_i userna name place

n_id _at ne d me
13446 1344674669 2020- 2020 23:59 800 18521 hanswee ʎnsɐʎʎ NaN
74669 266411521 12-31 -12- :58 876 3 nsuɐɥ
26641 23:59:5 31
1521 8 PST
13446 1344674665 2020- 2020 23:59 800 80357 _rjtama rjrjrj NaN
74665 554472963 12-31 -12- :57 32073 yo
55447 23:59:5 31 24192
2963 7 PST 774
13446 1344674664 2020- 2020 23:59 800 13395 mty_bau rawr NaN
74664 790908936 12-31 -12- :57 20174 tista
79090 23:59:5 31 30777
8936 7 PST 0369
13446 1344656853 2020- 2020 23:59 800 74168 missylee Bearbie NaN
74657 154856964 12-31 -12- :55 93278 vixen 💞
96257 23:59:5 31 73232
7921 5 PST 896
13446 1344674657 2020- 020- 23:59 800 70416 chinitala ᴮᴱ𝕔𝕙𝕚𝕟 NaN
74657 580929028 12-31 12- :55 647 mpas 𝕚
58092 23:59:5 31 𝕖𝕝𝕒𝕫𝕖𝕘
9028 5 PST 𝕦𝕚-
𝕥𝕒𝕝𝕒𝕞
𝕡𝕒𝕤⁷
Page | 115
tweet langu mentions url photo replie retwee like hashtag cashta
age s s s_cou ts_cou s_c s gs
nt nt ou
nt
Happy new en [] [] [] 0 0 0 ['happyn
year ewyear2 []
Philippines 021']
#HappyNe
wYear202
1
🙃 und [] [] [] 1 0 0 [] []
Welcome en [] [] [] 0 0 0 [] []
2021🥳❤️
@mystoga tl [] [] [] 1 0 1 [] []
n4u pakita
ka muna
ng balls 😁
Thank you en [{'screen_ [] [] 0 0 2 [] []
@BTS_twt name':
@BigHitE 'bts_twt',
nt 'name': '
@TXT_me 방탄소년단',
mbers 💜 'id':
Happy '335141638'
New Year 🥳 },
✨🥂 {'screen_n
ame':
'bighitent'
, 'name':
'bighit
entertain
ment', 'id':
'168683422'
}]
Page | 116
link retwe quote_ video thumb near geo source user_rt user_
et url nail _id rt
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
hanswee3/ 2.6641
status/ 39,768
1344674669 536k
266411521
twitter.com/ 862,12
_rjtamayo/ 2.6641
status/ 39,768
1344674665 536k
554472963
twitter.com/ 862,12
Mty_Bautist 2.6641
a/status/ 39,768
1344674664 536k
790908936
twitter.com/ 862,12
MissyLeeVi 2.6641
xen/status/ 39,768
1344674657 536k
962577921
twitter.com/ 862,12
chinitalampa 2.6641
s/status/ 39,768
1344674657 536k
580929028
Page | 117
retweet_id reply_to retweet_date translate trans_src trans_de
st
NaN [] NaN NaN NaN NaN
NaN [{'screen_name' NaN NaN NaN NaN

: 'mystogan4u',
'name': '. . .',
'id':
'838906585850
200064'}]
Page | 118
Appendix V: List of all the columns and the number of null values
# Column Null
1 id 0
2 conversation_id 0
3 created_at 0
4 date 0
5 time 0
6 timezone 0
7 user_id 0
8 username 0
9 name 3,251
10 place 5,014,984
11 tweet 0
12 language 0
13 mentions 0
14 urls 0
15 photos 0
16 replies_count 0
17 retweets_count 0
18 likes_count 0
19 hashtags 0
20 cashtags 0
21 link 0
22 retweet 0
23 quote_url 4,552,147
24 video 0
25 thumbnail 4,556,780
26 near 5,193,417
27 geo 0
28 source 5,193,417
29 user_rt_id 5,193,417
30 user_rt 5,193,417
31 retweet_id 5,193,417
32 reply_to 0
33 retweet_date 5,193,417
34 translate 5,193,417
35 trans_src 5,193,417
36 trans_dest 5,193,417
Page | 119
Appendix W: Content Editing Certification
Page | 120
Appendix X: English Editing Certification
Page | 121
Curriculum Vitae
Landley Bernardo
Address: Baguio City, Philippines, 2600
Phone: +639752826318
Email: lmbernardo@slu.edu.ph
WORK
EXPERIENCE 02/2018 – 05/2018
Part-time faculty, Saint Louis University, Baguio City,
Philippines
08/2017 - 02/2018
Management Information Specialist, Martha Property
Management Inc., Baguio City, Philippines
EDUCATION 2014 - 2017

Bachelor Degree of Computer Science, Saint Louis
University
ADDITIONAL Microsoft Office package: Microsoft Word, Excel, PowerPoint

SKILLS Video Editing: VDSC
Database operation: MySQL
Machine Learning: Python
Programming: Java, Python
Web technologies/framework: HTML, CSS, JS, Laravel
Front-end framework: Bootstrap
Page | 122
ACHIEVEMENTS Finalist in the Philippine Start Challenge 2016
Page | 123

Screening Product Ideas Through User-Generated Content in Social Media To Assist Small and Medium Enterprises in New Product Development

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Screening Product Ideas Through User-Generated Content in Social Media To Assist Small and Medium Enterprises in New Product Development

Uploaded by

Copyright:

Available Formats

SCREENING PRODUCT IDEAS THROUGH USER-GENERATED

CONTENT IN SOCIAL MEDIA TO ASSIST SMALL AND MEDIUM

ENTERPRISES IN NEW PRODUCT DEVELOPMENT

Submitted to the Faculty of the School of Advanced Studies

In Partial Fulfillment of the Requirements for the degree of

Master in Information Technology

Computing and Information Technology Programs

School of Advanced Studies

Saint Louis University

The capstone project entitled SCREENING PRODUCT IDEAS THROUGH USER-

MARIA CONCEPCION CLEMENTE

Beverly Estephany P. Ferrer, DIT Cecilia A. Mercado, PhD Ed.

Elisabeth D. Calub, MSIT Randy B. Domantay, DIT

Beverly Estephany P. Ferrer, DIT Faridah Kristi Wetherick,

Approved by the Committee on Oral Examination as ____PASSED____ on __June 25, 2021_.

Beverly Estephany P. Ferrer, DIT Cecilia A. Mercado, PhD Ed.

Elisabeth D. Calub, MSIT Randy B. Domantay, DIT

Beverly Estephany P. Ferrer, DIT

This is to certify that LANDLEY M. BERNARDO has completed all academic

Beverly Estephany P. Ferrer, DIT Faridah Kristi Wetherick, PhD

accomplish this capstone project.

endorsement letters mean so much to me.

the course of a year.

oral presentation despite your hectic schedule.

Technology and life.

Medium Enterprises (SMEs) in New Product Development (NPD), more specifically, in

1.1. Background of the study

Agency Task Force on Emerging Infectious Diseases (IATF-EID) recommends General

Figure 1.1 Status of the Philippines' SMEs during GCQ

other investors (MSC, 2020) (See Figure 1.2).

acceptance, and competitive factors.

Figure 1.3 Stages of NPD

remains a major challenge (Owens, 2007).

(Effendi, Sugandini, & Istanto, 2020). Based on a survey conducted by MarketingCharts.com,

Figure 1.4 Applications of social media to SMEs

Figure 1.5 Projection of social media users from 2020-2025.

Webster, & Tan 2017).

a label into texts called text classification/categorization. A typical classification problem is

potential to be used for more complex text classification problems.

Figure 1.6 SVM algorithm

(Church, 2017). It is a process of associating numerical or vector representations to a text for

SMEs in selecting which product ideas to consider and invest in.

1.2. Statement of objectives

Philippine SMEs. Specifically, the objectives are the following:

1. To build a supervised text classification model using SVM.

1.4. Significance of the study

businesses stay competitive and achieve prosperity in a rapidly changing market.

Figure 2.1 Overview of the methodology

are listed in Appendix A.

In addition, the Integrated Development Environment (IDE) of choice was PyCharm

provides comprehensive facilities to the developer for development. PyCharm is the

2.1. Building the model

2.1.1. Data collection

threshold of 500,000 tweets per month.

virtual OS are shown in Table 2.1.

Table 2.1 Ubuntu’s configurations

as the coordinates because it is considered to be the geographical center of the Philippines. In

collected are shown in Appendix C.

2.1.2. Data preprocessing

information (Karczewski, 2021).

in data preprocessing is shown in Appendix D, lines 4-8.

shown in Appendix D, lines 14-22.

words. These steps are explained in detail in the next paragraph.

Approved by the Committee on Oral Examination as __PASSED on June 25, 2021_.