Professional Documents
Culture Documents
Screening Product Ideas Through User-Generated Content in Social Media To Assist Small and Medium Enterprises in New Product Development
Screening Product Ideas Through User-Generated Content in Social Media To Assist Small and Medium Enterprises in New Product Development
by
Landley M. Bernardo
A Capstone Project
November 2021
ENDORSEMENT
This is to certify further that LANDLEY M. BERNARDO is ready for oral examination.
This is to certify that the capstone project entitled SCREENING PRODUCT IDEAS
THROUGH USER-GENERATED CONTENT IN SOCIAL MEDIA TO ASSIST SMALL
AND MEDIUM ENTERPRISES IN NEW PRODUCT DEVELOPMENT prepared and
submitted by LANDLEY M. BERNARDO for the degree MASTER IN INFORMATION
TECHNOLOGY is recommended for oral examination.
Accepted and approved in partial fulfillment of the requirements for the degree Master
in Information Technology.
I dedicate this capstone project to all the Philippines’ SMEs’ owners and their employees
who have been working so hard to continue their business operations during these trying times.
ACKNOWLEDGEMENTS
First and foremost, I would like to praise and thank God, the Almighty, who has granted
countless blessings, knowledge, and opportunity to me so that I have been finally able to
Thank you to Dr. Cecilia Mercado for endorsing me to the Department of Science and
Technology (DOST) PROJECT STRAND scholarship. I cannot thank you enough for your
personal recommendation. Regardless of whether or not I am accepted, the fact that you
recommended me means so much. I would also like to thank Mr. Dalos Miguel and Ms. Maria
Concepcion Clemente for taking time out of your busy day to put in a good word for me. Your
Thank you to the DOST STRAND team for this life-changing opportunity you have given to
me. I was both thrilled and honored to hear that I had been named as a recipient of the DOST
PROJECT STRAND scholarship. By awarding me the said scholarship, you have given me
another chance to go back to school and create something awesome. I hope one day I will be able
to help students achieve their goals just as you have helped me.
Special thank you to my adviser, Maria Concepcion Clemente, for your patience, guidance,
and support. I have benefited greatly from your wealth of knowledge and meticulous advice. I
am extremely grateful that you took me on as an advisee and continued to have faith in me over
Thank you to my panel members, Dr. Beverly Ferrer, Dr. Randy Domantay, Dr. Cecilia
Mercado, and Ms. Elisabeth Calub. Your encouraging words and thoughtful, detailed feedback
have been very important to me. Thank you for the time you have allotted to accommodate my
I sincerely appreciate the time you spent assisting me with this project. I am well aware that you
have put in a lot of hard work on the task assigned to both of you.
Thank you to Dr. Randy Domantay for suggestions and comments regarding the format and
content of my paper. Those comments are all valuable and very helpful for revising and
improving my writing.
Thank you to Dr. Gerry Paul Genove for teaching me a handful of techniques on smartly
conducting research and making one during a few of my classes with you. Thank you to my
classmates in MIT class, Richard, Nikki, Mehdi, Eddie, Chesca, Jerome, and JL to whom I had
the chance to get to know and learn new perspectives about our shared interest in Information
Thank you to Georgina and Anna Marie for helping me with my oral presentation. Your
feedback when I had my mock presentation prepared me well during my actual presentation.
Thank you to my friends and other people (too many mention) who became part of my MIT
journey.
Finally, my deep and sincere gratitude to my family and relatives for their continuous and
unparalleled love, help, and support. I am grateful to my brother and cousin for always being
there for me as a friend and entertainer when I was down. I am forever indebted to my parents
for giving me the opportunities and experience that have made me who I am.
Abstract
This research aims to design and develop a tool to assist the Philippines-based Small and
screening new product ideas. The application utilized a Machine Learning (ML) model that is
based on User-Generated Content (UGC) on Twitter. Due to the Coronavirus disease 2019
(COVID-19) global pandemic, most businesses in the Philippines have struggled, and some have
gone out of business, especially SMEs. Previous studies suggest that conducting NPD could
mitigate the impact of the COVID-19 on businesses. However, the process of NPD could be
costly and tedious for SMEs, considering the limited resources they possess. Since the pandemic
broke out, the volume of UGC produced in social media has increased dramatically.
Nevertheless, studies focusing on exploiting this massive volume of UGC in NPD are quite
limited. This study developed an application that utilizes UGC on Twitter to assist SMEs in
performing new product idea screening. The application is powered by a supervised Machine
Learning (ML) algorithm, Support Vector Classifier (SVC) text classification model. Over 5
million tweets were collected and preprocessed using a variety of libraries in Python to train the
model. The criteria for screening include the potential market, the trend of demand, stability of
demand, and market acceptance. At least 2,926 rows with tweets that express potential produce
ideas based on common Philippines SMEs were extracted and vectorized using the Word2Vec
word embedding scheme. Consequently, the model achieved an accuracy rate of 84%. The
trained model was used to develop the proposed screening application. The application was
tested on different inputs, screens, and browsers to assess its quality. The output of this study
would help contribute to the limited literature that exploits social media data in developing
product ideas.
Keywords
SME, NPD, New Product Idea Screening, UGC, Machine Learning, NLP, Support Vector
Classifier, Word2Vec
Chapter 1: Introduction
A recent assessment shows that the Philippines' Luzon-wide lockdown that aims to contain
the COVID-19 has accumulated an output loss of 1.1 trillion pesos (NEDA, 2020). Furthermore,
the nation's highest unemployment rate of 17.7% has been recorded (PSA, 2020). The Inter-
Community Quarantine (GCQ) to prevent further loss and stabilize the economy, which resulted
in the resumption of most business operations and other economic activities (Exec. Order 112, s.
2020). However, it had little effect on improving the economy because the pandemic has already
influenced consumers' confidence in the market (Vancic & Pärson, 2020). In a study conducted
by MSC (2020), 23% of the Small and Medium Enterprises (SMEs) temporarily closed their
operations, while 28% reduced business operations, affecting thousands of Filipino workers
nationwide (See Figure 1.1). If the impact of the pandemic on the SMEs continues, it could cause
the economy to collapse, driving more Filipinos to the edge of poverty (Bouey, 2020).
not exceeding 3,00,000 pesos (MSMEs, 2008). These include restaurants, parlors, and small-time
renting houses. SMEs play a significant role in providing goods and services to the masses,
creating jobs, and promoting innovation through competition (Leano, 2006). 82% of the
businesses in the Philippines fall under the category of SME (International Trade Centre, 2020).
The ability to take up a loan with a lower interest rate in a more extended period to pay is just
one of the programs that the government has rolled out to help small businesses get the financial
support they need to continue operations (Bayanihan to Heal As One Act, 2020; SBSW, 2020).
Despite these efforts, a survey says that 62% of the SMEs reported not receiving any financial
support from the Government, not even from non-government sources, such as families and
Figure 1.2 Level of financial support for Philippines’ SMEs during the implementation of
ECQ in Metro Manila
Studies suggest that New Product Development (NPD) could help businesses stay
competitive and achieve prosperity in a rapidly changing market (Booz, Allen & Hamilton,
1982; Hughes and Chaffin, 1996; Ford & Terris, 2017). NPD is a process that transforms market
opportunities into a product available in the market (Takeuchi & Nonaka, 1986). It is a seven-
stage process consisting of new product strategy development, idea generation, screening and
evaluation, business analysis, development, testing, and commercialization (Booz, & Allen &
Hamilton, 1982). The stages of NPD are illustrated in Figure 1.3. The screening is considered the
most critical stage because it selects and evaluates new product ideas from a pool of ideas
generated during new product ideation (Rochford, 1991). Its output is the deciding factor that
determines if an idea is fit for the next phase. Jespersen (2007) argues that screening is a
complex decision process highly influenced by market changes. Agrawal & Bhuiyan (2014)
created a critical success factors (CSF) framework that lists the metrics, CSF, and tools used in
each stage of NPD. In particular, the CSF for the screening phase is called Up-front homework.
It consists of numerous activities that aim to understand and analyze the current and future
market potential. Similarly, Baker & Albaum (1986) created a list of criteria for screening a new
product idea, including societal factors, business risk factors, demand analysis, market
ideas (Cooper, 1979; Baker & Albaum, 1986; Debrentani, 1988; Verworn, Herstatt, & Nagahira,
2006; Mu, Peng, & Tan, 2007; Soukhoroukova, Spann, & Skiera, 2001; Onarheim &
Christensen, 2012; Albar, 2013). However, these studies used a traditional approach in obtaining
the datasets (e.g., giving out questionnaires, conducting interviews) to screen new product ideas.
Although conventional data collection has proven effective, a study emphasizes that it could
bring validity issues, as questions are prone to misinterpretation (Pribyl, 1994). Furthermore,
traditional product screening is a tedious, expensive, and lengthy process, yet only 20% of new
product ideas reached commercialization (Ford & Terris, 2017; Rodríguez-Ferradas & Alfaro-
Tanco, 2016; Akram, 2017). Given the limited resources of SMEs, screening new product ideas
Since the pandemic broke out, the number of people using social media (e.g., Facebook,
Twitter, and Instagram) has increased dramatically (Statista, 2020). In particular, businesses have
widely utilized these platforms to improve their brands and reach out to their target customers
84% of firms use social media to build their brands and increase awareness (See Figure 1.4).
Another advantage that social media brings to a business that is often underestimated is the
availability of UGC (Chu & Kim, 2011; Prantl & Mičík, 2019). UGC has been defined as
publicly available content, such as text, image, video, and even audio, created by users, rather
than brands, to express one's opinion (Krumm & Davies, 2008). It is referred to by other
literature as Electronic Word-of-Mouth or eWOM (Park & Lee, 2009; Lee & Youn, 2009; Jeong
& Jang, 2011). Examples of UGC are posts (e.g., Facebook), comments (e.g., Instagram), tweets
(e.g., Twitter), and ratings and reviews (e.g., Amazon). It is the main contributor to the enormous
digital information produced on the Web, often referred to as Big Data (Tufekci, 2014). Social
media users are expected to reach 4.41 billion in 2025 (Statista, 2020) (See Figure 1.5).
Therefore, the amount of UGC generated in social media will also see a tremendous increase. On
a recent projection made by MarketingResearch.com (2020), the global UGC software market
will reach 447 billion dollars in 2026, and blogging platforms such as Twitter will be at the top
of the competition.
few decades ago (Pantano, Giglio, & Dennis, 2019). Nascimento & Da Silveira (2017) argue that
UGC could be an alternative source of information for screening new product ideas. Rathore &
Ilavarasan (2020) explain that it is inexpensive and can provide real-time consumer behaviors.
Specifically, Twitter has been the leading source of UGC to measure customers' satisfaction
towards a particular product or brand (Bhimani, Mention, & Barlatier, 2018; Kumar, Koolwal, &
Mohbey, 2019). The application allows its users to post a concise tweet of 280 characters with
specific hashtags (e.g., #iphone12, #aegyocake, #talabykyla), making it searchable for
researchers (Kumar, Koolwal, & Mohbey, 2019). Newswire (2020) reports indicate that its Daily
Active Users improved by 34% in 2020, with 500 million tweets being sent every day.
(Fischer & Reuber, 2011; Sultana, Paul, and Gavrilova, 2016; Anto et al., 2016); Ray &
Chakrabarti, 2017; Geetha et al., 2018; Costa et al., 2013). In particular, Fischer & Reuber, 2011)
used Twitter engagements to increase reaching more target customers. Accordingly, it would be
an avenue for a business and its customers to develop a rapport and build trust. Interestingly,
Sultana, Paul, and Gavrilova (2016) use engagements to identify the behaviors of selected users.
Sentiments are the opinion of people towards a product. Sentiment Analysis is the process of
getting the polarity scores from a piece of text. It is computed in three levels: document,
sentence, and aspect levels. Studies made by Anto et al. (2016), Ray & Chakrabarti (2017), &
Geetha et al., (2018) categorized the ratings of a known mobile phone brand using tweets. In
particular, Ray & Chakrabarti (2017) made an assessment of people’s opinions on iPhone 6 in
the document and aspect levels sentiment analysis. Finally, hashtags promote company visibility
and spread awareness of their mission and vision (Costa et al., 2013).
In the past, sentiment analysis has been the most used methodology in measuring product
performance (Pang, Lee, & Vaithyanathan, 2002). It is a type of text classification technique that
assigns a label to particular text (Kowsari et al., 2019). It computes the polarity scores from a
piece of text to determine its sentiment. Its output could be categorized into positive, negative, or
neutral. There are two general approaches in text classification: lexicon-based and machine
learning-based (Dhaoui, Webster, & Tan 2017). The former relies on a dictionary or sentiment
lexicon to determine the emotions of a text (Pennebaker et al., 2015). The latter uses Machine
Learning (ML) algorithms to identify patterns on a given dataset and use the acquired knowledge
to perform the classification (Feldman, 2013). The ML approach requires a massive amount of
data for training and testing to achieve an accurate result. However, implementing sentiment
analysis using the ML approach is more accurate than the Lexicon-based approach (Dhaoui,
ML algorithms are generally divided into three types: supervised, unsupervised, and
reinforcement learning (Feldman, 2013). A supervised ML algorithm uses a labeled dataset for
training and testing. It is further subcategorized into regression and classification. The regression
deals with numbers such as mean/average prediction, while the classification tries to label a
particular input into a finite number of classes. Text classification is a domain in ML that assigns
categorizing news into sports, business, politics, or entertainment (Dadgar et al., 2016). One of
the most recent additions to supervised ML algorithms is Support Vector Machine (SVM)
(Vapnik, 1995). It is also considered as one of the most popular, along with linear regression, k-
nearest neighbors (kNN), and Naïve-Bayes (Lee & Shin, 2020). SVM uses hyperplanes to
separate the observations into different classes (Wang et al., 2006). A good hyperplane is
achieved when it has the largest distance to the nearest data point. The structure of the SVM is
shown in Figure 1.6. Osisanwo et al. (2017) compared the performance of the different ML
algorithms, including decision trees, neural networks, Naïve-Bayes, kNN, SVM, and rule-
learners. The result shows that SVM supersedes all other algorithms in terms of accuracy and
tolerance to irrelevant attributes. Furthermore, Gharib, Habib, & Fayed (2009) confirmed the
effectiveness of SVM when dealing with a large text dataset. Similarly, SVM consistently
outperforms other alternative models (Joachims, 1998). Accordingly, the algorithm has the
mathematical computations (Rong, 2014). Several word embedding approaches are available in
performing this task, including Word2Vec. Word2Vec encodes texts by considering their
semantic values. For instance, the numerical equivalents of men and women are more similar
than of men and horses. Recent studies have shown that incorporating Word2Vec in SVM when
building a text classifier yields a better performance (Zhang et al., 2015; Şahİn, 2017; Kurnia et
al., 2020). Specifically, Zhang et al. (2015) and Kurnia et al. (2020) utilized UGC via users’
comments regarding mobile applications and clothing products, respectively, to train a text
classification model.
Previous studies have shown that UGC produced in social media could be an alternative
dataset for screening a particular product. However, products that have not been created (new
product ideas) are yet to be explored (Nascimento & Da Silveira, 2017). According to
Soukhoroukova, Spann & Skiera (2012), new product ideas that are not immediately captured
may fade away in an organization in no time. Similarly, Albar (2013) concluded that 90% of new
product ideas are rejected before they even reach formal evaluation. Therefore, it is high time to
utilize the abundance of available UGC for screening product ideas. A tool is developed to
perform screening of product ideas using tweets. The application would assist the Philippines’
This study explored UGC produced in social media in developing new product ideas for
2. To develop a web-based screening application using the trained SVM text classification
model.
1.3. Scope of the study
This study focuses on the screening phase of the NPD. The dataset used in the study was
limited to text-based UGC produced on Twitter; in particular, the dataset consists of tweets
posted within the Philippines from August to December 2020. The collected tweets went through
a series of data preprocessing using available Natural Language Processing (NLP) libraries in
Python. In addition, manual data annotation was conducted to verify the product ideas extracted
from the tweets. A list of common product ideas was used for considering if the extracted
product ideas were valid or not. The screening considered four criteria: market potential, stability
of demand, the trend of demand, and market acceptance (Baker & Albaum, 1986). A static text
classification model is built using a supervised ML algorithm called SVC. The input variable,
which consists of the extracted product ideas, were transformed into their vector representations
using Word2Vec encoding scheme. A parameter tuning was performed to get the ideal values of
four parameters of the SVC model, namely kernel, degree, gamma, and C. The model's
performance was assessed in four metrics: accuracy, precision, recall, and f1-score. The trained
model was exported and used in the development of the screening application. The proposed
application was implemented using the Python-based web microframework called Flask. A
proof-of-concept on how UGC could be utilized to assist SMEs in conducting NPD is developed.
The output of this capstone project would help bridge the gap identified by Nascimento & Da
Silveira (2017) regarding the limited literature that utilizes data produced in social media to
develop new product ideas. The research also would prove how UGC could be an alternative
source of information when screening new product ideas (Owens, 2007). Moreover, the findings
would show a practical implication for SMEs by helping them to understand the importance of
consumer engagement in social media. Furthermore, the developed application would give the
rising numbers of SMEs putting up a business in the online platform economy nationwide
product ideas to sell. One report shows that 77 percent of Filipinos consider the online presence
of SMEs a must (Villanueva, 2020). Most importantly, it would encourage the Philippines’
SMEs to consider developing new product ideas. Studies made by Booz, Allen & Hamilton
(1982), Hughes and Chaffin (1996), and Ford & Terris (2017) stated that NPD could help
This section describes the steps taken to accomplish the objectives of this study. The
methodology is divided into two main sections: section 2.1) building the model and section 2.2)
implementation. Section 2.1 discusses the tasks involved in creating the ML model for
classifying product ideas into good or not good. This section is further divided into six
subsections, including data collection, data preparation, and performance evaluation. Section 2.2,
shows a detailed discussion on how the proposed screening application is implemented using the
model built in section 2.1. Moreover, the technology stacks used in the implementation are
presented. The development is divided into two parts: front-end and back-end development. Each
section of the methodology is further explained in the succeeding paragraphs and is summarized
in Figure 2.1.
The primary tool used to build the model (in section 2.1), is Anaconda Navigator. It is a
GUI-based application containing tools and pre-installed packages in Python for ML, also known
as conda packages. In particular, Jupyter Lab is one of the ML tools available by default in
Anaconda Navigator. It is an interactive open-source web application used for data exploration,
visualization, and analysis. Jupyter Lab was used to create the notebooks that contain the
different utility functions to build the model. Anaconda Navigator’s environments created a new
virtual environment (venv) to store the needed Python packages for building the model. A
package is a collection of modules. It is a file consisting of Python code that defines classes,
functions, and variables. Packages are used to build a powerful Python application. Creating a
venv would allow packages to be reinstalled easily when an unknown bug occurs in one of the
packages installed. Anaconda Navigator uses the channel named default as its default channel for
installing and adding packages to a venv. Additional channels were added to get more packages,
such as the conda-forge and pytorch channels. The packages stored in the venv and their roles
Section 2.2 discusses the implementation of the proposed screening application. Python is the
language of choice when performing analysis for big data (Oliphant, 2007). Specifically, the
application was developed using Flask. It is an open-source web Python microframework for
building data-driven and dynamic web applications quickly. It follows the Model-View-
Controller (MVC) architecture pattern that separates the application into three main components:
the model, view, and controller; these components represent the data or the database, the display,
and the application's logic, respectively. Flask is the most popular web framework for Python,
along with Django (Aslam, Mohammed, & Lokhande, 2015; Mufid et al., 2019). It has 52,400
stars and 13,800 forks on its Github page at the time of this writing. Unlike Django, which is a
full-stack and comes with pre-built dependencies, libraries, and layouts, Flask is lightweight. It
only offers suggestions for possible tools for developing the application, giving developers the
flexibility and freedom to select other technologies for implementation. The Flask framework is
relatively new. Its first stable version (1.1.2) was recently released on April 3, 2020.
Nevertheless, its community has been growing. There are currently 621 contributors and trusted
by more than 5,000 projects, including well-known brands Netflix, Reddit, and Lyft.
Professional. Although the professional version comes with a price, Jetbrains, the creator of
PyCharm, offers a free 1-year subscription for students. IDE is a software application that
recommended code editor for building Python-based projects. It is equipped with all the
necessary tools for modern development, including a command-line interface (CLI), features to
connect to the database, create a venv, and integrate with Github. Also, it offers features for
handling big data and developing data-driven applications like conda integration to manage
packages for Python, scientific libraries and plots for performing data analytics and visualization,
and coding assistance for Python frameworks like Flask. The same venv used in section 2.1 was
utilized in the development. Once the venv was created, and packages were installed, a new
Flask project was initiated in PyCharm. PyCharm used the created venv as its interpreter to
import all the packages needed for the development (see Appendix A).
A model is an ML algorithm that has been trained to identify patterns on a given dataset. It
uses its acquired knowledge to perform a prediction or classification (Feldman, 2013). This study
proposed a supervised text classification model that classifies a particular product idea as good or
not good. The SVM ML algorithm was considered in building the said model. According to
Joachims (1998), SVM consistently outperforms other alternative models in text classification.
This section is further broken down into six subsections: section 2.1.1) data collection, section
2.1.2) data preprocessing, section 2.1.3) constructing the criteria for screening, section 2.1.4)
data preparation, and section 2.1.5) performance evaluation. First, data collection explained how
the data needed for the study was gathered using an advanced scraping tool. In section 2.1, data
preprocessing provides a detailed look into how the collected data were cleaned and turned into
valuable data for modeling. It also shows the process of extracting product ideas from the actual
tweets. In the third step, the criteria for screening product ideas were constructed. Furthermore,
the results of the screening process of the extracted product ideas were labeled as good or not
good. Fourth, the dataset for modeling was prepared and divided into two subsets to train and test
the model. Finally, the model was trained, and its performance was assessed in different
performance metrics. The trained model was saved and used for the implementation of the
screening application.
A variety of scraping tools are available for mining tweets on Twitter. The microblogging
platform [Twitter] allows third-party applications to connect and collect tweets using an
approved and authenticated developer account through a secured channel called Twitter
Application Programming Interface (API) (Kumar, Koolwal, & Mohbey, 2019). However, the
number of tweets allowed to be collected is limited. Specifically, the Twitter API v2 has a
Alternatively, Twitter Intelligent Tool (Twint) allows retrieval of tweets with no limits and
APIs required (Dutch Osint Guy, 2018). It is an open-sourced advanced tweet scraping tool often
used for Open-Source Intelligence (OSINT) research. OSINT tools focus on collecting,
analyzing, and using publicly posted information (e.g., reviews, tweets, and Facebook feeds) for
research purposes. Hence, Twint was utilized to collect the data needed for the study. In
particular, these data were the tweets that were posted within the Philippines from August to
December 2020.
To start off, a Linux-based Operating System (OS), Ubuntu, was loaded and configured in
Oracle Virtual Machine (VM) Virtual Box. The VM enables a machine to run more than one OS
at a time within the base OS. Since Twint is compatible with the Linux environment, the VM
was required to run Twint in Windows, the researcher’s default OS. The specifications set for the
Hardware Specification
Operating System Linux
Distro Ubuntu 18.04.01
Processor 2 CPUs
Based memory 4608MB
System type 64-bit Operating System, x64-based processor
Storage 40GB
Furthermore, Twint's dependencies, including the Python language, Pip Installs Packages
(pip), and Git, were installed into the virtual OS. Notably, a new venv was created to store the
other dependencies using pip. A venv allows the developer to build multiple Python-based
applications in a single machine. Pip is a package manager for Python. To ensure that only
tweets posted within the Philippines are included, the coordinates of Marinduque
(12.072862,122.664139) within the radius of 768.54KM were specified in the Twint's geo
parameter. The code used in data collection is shown in Appendix B. Marinduque was selected
ensuring that only the needed data are collected. The places in the Philippines where tweets were
In predictive modeling (e.g., text classification), raw data cannot typically be used as it is. It
requires preprocessing to ensure that the dataset is fit to the model. Data preprocessing is a
process that aims to get rid of the noise in a given dataset. Some examples of noise include
special characters, unnecessary duplicate letters, and stopwords, such as “the,” “a,” and “can.”
The presence of these noises could compromise the performance of an ML model when
performing text classification (Yang, 2018). The process of data preprocessing varies
accordingly. It is highly dependent on the defined problem. This study utilized the programming
language Python to preprocess the collected data. Python is a multi-purpose language that serves
different needs, from basic web applications to training models for ML and Artificial Intelligence
(AI). It also offers a variety of packages from various sources to turn raw texts into useful
Data preprocessing comprises of 5 subprocesses: (1) data wrangling, (2) data reduction, (3)
extracting the product ideas, (4) data annotation, and (5) sentiment analysis. These steps are
summarized in Figure 2.2. First, the collected data were transformed from a Tab Separated
Values (TSV) format into a dataframe which enables useful functions to aid with the data
preprocessing. Next, the rows and columns of the created dataframe were reduced to select only
the relevant data for the study. Third, numerous data cleansing techniques were carried out using
available NLP packages in Python to extract possible product ideas from the tweets. Lastly, the
polarity scores of the extracted product ideas were determined through sentiments analysis.
Figure 2.2 Steps of the data preprocessing
In the initial stages of data preprocessing, the collected UGC were moved from the
configured virtual OS to the Windows host machine via the shared folder feature of the Oracle
VM. Once the data were transferred, the data preprocessing began.
The first step of data preprocessing was data wrangling. Data wrangling restructures raw data
into the desired format to aid with the data preprocessing. The initial dataset was transformed
from its original file format [TSV] into a dataframe using the pandas package. A dataframe is a
2D data structure that provides streamlined forms of data representation. It is a table that consists
of columns as the labels and rows as the actual data. The pandas library was used because it
provides rich functions for data preprocessing. In addition, the encoding scheme UTF-8 was
specified in the pandas’ parameter to read and display the emojis of the tweets. The code used
The second step of data preprocessing is data reduction. Data reduction is a technique to
reduce the number of features of a particular dataset. In this step, the columns and rows that were
irrelevant for the study were removed. The selection considers the data that is going to be helpful
when screening new product ideas. Starting off with the rows, non-English tweets were removed
from the dataframe. This would limit the dataset to English words that work well with available
NLP libraries in Python. Also, it would lead to the improvement of the quality of the dataset by
removing nuisance words that are hard to interpret. The code used to reduce the rows of the
dataframe is shown in Appendix D, lines 10-12. In the case of column selection, attributes that
could be used in measuring the performance of the extracted product ideas using available
information in the scraped UGC were considered. These include attributes with numerical values
which would be beneficial in getting the engagements and sentiments (Fischer & Reuber, 2011;
Sultana, Paul; Gavrilova, 2016; Anto et al., 2016; Ray & Chakrabarti, 2017; Geetha, Rekha, &
Rarthika, 2018; Costa et al., 2013). The code used to reduce the columns of the dataframe is
The third step of data preprocessing was extracting the product ideas. According to
Nascimento & Da Silveira (2017), one way to find product ideas for business is in UGC
produced in social media. This information provides insights into consumers’ preferences and
with careful planning and the right tools at hand. This can be utilized by businesses for product
development (Rathore & Ilavarasan, 2020). Extracting product ideas consists of seven sub-steps,
including (1) data precleaning, (2) pulling out the nouns, (3) data cleaning, (4) removing the
stopwords, (5) lemmatization, (6) discarding the duplicate words, and (7) eliminating the short
The first step to extract the product ideas from the tweets was data precleaning. It was
intended to initially clean the tweet by discarding word patterns including mentions (@),
hashtags (#), and links (http and https). To do this task, a utility function was constructed using
the regex and numpy libraries. A new column named precleaned_tweet was created
separately from the original tweet to store the output of the applied operation. The code used to
preclean the tweets is shown in Appendix E. Rows with null values in the newly created
precleaned_tweet column were removed from the dataframe. The results of this step were used
The next step to extract the product ideas was to pull out the nouns from
each word on a sentence based on the part of speech (e.g., noun, verb, adverb, adjective,
pronoun, conjunction, and interjection) they belong to. Specifically, a noun is a part of speech
that names a person, thing, idea, action, or quality. An example of POS tagging is shown in
Figure 2.3.
Figure 2.3 An example of a tweet where the words are tagged and labeled using a POS
tagger. In this example, the yoga and mat are labeled as a noun.
This study used nouns to represent possible product ideas. Generally, product ideas are
expressed through nouns (Malmasi & Dras, 2015). By pulling out the nouns from the values of
the precleaned_tweet, it would help determine which tweets contain possible product ideas for
the Philippines’ SMEs. Rows with no extracted nouns were removed from the dataframe. Three
POS tagger libraries, namely, TextBlob, NLTK, and Spacy, were used, and their results were
manually evaluated to ensure the quality of the extracted nouns. The final result was stored in a
new column named pulled_out_noun. The code for extracting the nouns is shown in Appendix
F.
The third task to extract the product ideas was data cleaning. The results of the previous step
[pulling out the nouns] were cleansed to remove any unnecessary words and characters
mistakenly pulled out as nouns by the selected POS tagger. This step includes (1) converting
words to lower case, (2) expanding contractions, and (3) removing email addresses, html tags,
preprocess_kgptalkie package was utilized to cleanse the nouns. A new column named
cleaned_noun was created to store the results of data cleaning. The code used to clean the
extracted nouns is shown in Appendix G. Rows with null values in the new column
The fourth task to extract the product ideas was the removal of the stopwords. The stopwords
include are both Tagalog (e.g., akin, ako, at, dapat) and English (e.g., can, might, by). The
complete list of the stopwords removed from the cleaned nouns are shown in Appendix H.
Stopwords are words that do not add much meaning to a sentence and can be safely removed
without compromising the meaning of a sentence. The stopwords were removed using the
Spacy library. By default, Spacy only contains the stopwords for the English language; the
added to the dataframe to store the new values of the applied operation. The code used to get rid
of the stopwords is shown in Appendix I. Rows with null values in the newly
The fifth step to extract the product ideas was lemmatization. It is the process of converting a
word into its base form, removing endings such as -s, -ing, and -ed (e.g., from equipments to
equipment, from walked to walk). Sometimes a word could be an irregular verb, making its
conversion quite different (e.g., from mice to mouse, from dove to dive). Lemmatization was
employed to get the base form of the nouns, which would help to further simplify the values of
the extracted nouns. The make_base method from the preprocess_kgptalkie method
was used to perform the lemmatization. A new column named lemmatized_noun was
introduced to save the result of lemmatization. The code used to lemmatize the words is shown
in Appendix J, line 3.
Once the nouns were lemmatized, the next step was to discard the duplicates. The application
named unique_noun was added to the dataframe to keep the unique words. The code used in
removing the duplicates is shown in Appendix J, line 14. Rows with null values in the
Finally, short words were eliminated from the unique_noun column. In particular, these are
the words that have less than three characters. There is a high possibility that words under the
said category are merely a nuisance and do not represent a product idea and therefore discarded.
A custom one-liner function was designed to apply this step. A new column
code used to remove words with less than three characters is shown in Appendix J, line 17. Rows
with null values in the newly created extracted _product_idea column were removed from the
dataframe.
combined by aggregating their retweets_count, likes_count, and replies_count and getting the
average of their polarity_score. In addition, a new column named tweets_count was added to
the dataframe to store the number of occurrences of each extracted_product_idea. After the
aggregated, the duplicates were removed, leaving only the first occurrence of each
extracted_product_idea. The code used to implement all this step is shown in Appendix J, lines
and 24 and 26. The resulting dataframe was saved and used in section 2.1 to construct the criteria
for screening.
In the data preprocessing subsection, the fourth is data annotation. It is a process of labeling
the data available in various formats so that an ML model can quickly and clearly understand the
input patterns (Schreiner, 2006). This study conducted a manual data annotation to ensure that
only valid product ideas are included in the final dataset. Data annotation was carried out with
the aid of two BSIT undergraduate students. The result of all the steps mentioned above is a
with corresponding attributes, including the number of retweets, likes, and replies. The said
dataset was divided into five subsets based on the month when the tweets were posted
subsequent two subsets, October.csv and November.csv, were delegated to the second student.
The remaining subset (December.csv), which has the most number of rows, was given to the
researcher.
A new column named label was added to the dataframe to store the results of data annotation.
The basis for considering a product idea as valid is the presence of emotion (e.g., hate, want,
love, disgust) expressed towards the noun in a tweet. According to Nascimento & Da Silveira
(2017), people use social media platforms to express their emotions on any particular topic,
including product ideas. Moreover, since not all the values of extracted_product_idea are valid
product ideas, only those on the list of common product ideas for the Philippines' SMEs are
considered. For example, MoneyMax.com (2021) published a list of small businesses ideas with
small capital in 2021. A few of the product ideas in the said list are plant shop, beauty product
reselling business, and cake, dessert, and pastry business. The complete list of the category of the
product ideas used in categorizing the annotated product ideas is shown in Appendix K. The
belong to and its result were stored in a new column named annotated_product_idea. The
output of both students was further validated by the researcher. Rows with 0 values in
sentiment analysis. Sentiment analysis is the process of determining the sentiments on a piece of
text based on its polarity scores (Pang, Lee, & Vaithyanathan, 2002). The sentiments were
computed by getting the polarity scores of the values of the precleaned_tweets column in a
sentence level using Valence Aware Dictionary and sEntiment Reasoner (VADER). VADER is a
Python package that designs for complex social media data, as it considers words and
punctuations, emojis, slang, and abbreviated words commonly appear in social media texts
(Hutto & Gilber, 2014). The package measures sentiments by providing four valence scores:
(extremely positive). This study considered the results of the compound valence scores for the
analysis. It is the most commonly used basis for sentiment analysis by most researchers (Clayton,
H. & Gilbert, E., 2014). A new column named polarity_score was created and initialized using
the values of the computed compound valence scores. The code used to get the polarity scores is
shown in Appendix J, line 22. Rows with less 0.05 or the extracted_product_ideas with neutral
and negative sentiments values in the polarity_score column were dropped. The results resulting
dataframe was exported as a CSV file using the pandas library to be used for the succeeding
steps. The code used to export the preprocessed dataset is shown Appendix J, line 28.
Screening is the third phase of NPD. It is a complex decision process highly influenced by
market changes (Jespersen, 2007). Screening aims to evaluate new product ideas to determine
which idea is worth investing in (Rochford, 1991). A set of criteria relevant to a particular
business must be designed (Agrawal & Bhuiyan, 2014, Baker & Albaum, 1986). In this section,
considering the factors considered by Baker & Albaum (1986). Accordingly, the said literature
used 33 criteria categorized into five factors: societal factor, business risk factor, demand
analysis, market acceptance factor, and competitive factor. The complete list of Baker &
Albaum’s criteria is shown in Appendix L. Out of the categories previously mentioned, only two
were chosen, namely, demand analysis and market acceptance. These factors were selected
because their values could be obtained using the available information in the gathered data. In
contrast, the remaining factors (societal, business risk, and competitive factors) require more
than just UGC. It involves information such as financials, skills, business values, and culture,
The first criterion for screening was the demand analysis. It is used to assess the number of
possible customers for a particular product (Hart et al., 2003). The latter [market acceptance] is
defined as the reaction of the possible customers to a specific product (Calatone & Cooper,
1979). First, the demand analysis was measured in three metrics: (1) potential market, (2) the
trend of demand, and (3) stability of demand. For the second factor, market acceptance, the
sentiment analysis was conducted. These factors and assigned criteria are summarized in Table
2.2. A product idea should satisfy all the criteria to pass the screening and be considered good.
The formula used in getting the scores of each criterion are shown in the succeeding paragraphs.
A new dataframe was created and filled with the computed values of each criterion mentioned
above using the preannotated version of the dataset; more particularly, the values of the
stability_of_demand, market_acceptance, and label. In particular, the label column holds the
result of the screening. It has two possible values, 1 or 0. A value of 1 means that a product idea
passes the screening or good, while 0 says otherwise or not good. The dataframe was saved for
the next step, which is data preparation. The code used to screen the product ideas is shown in
Appendix M.
Table 2.2 Criteria for screening product ideas
The first criterion was the potential market. It is defined as the size of the total market size
for a particular product idea. The engagement rate was used to measure the potential market. It is
the ratio of the total interactions to the total number of tweets. Muñoz-Expósito (2017) proposed
a formula for getting the engagements on Twitter called the ratio of interest. The formula is
shown below.
Accordingly, interactions consist of the total number of retweets, shared via email, replies,
and likes, collectively called diffusion interactions. Since tweets shared via email are not
available on the collected dataset, only retweets, replies, and likes are considered on the
rate to be very high (Mee, 2020). A potential market of at least 33% was used as the threshold
value to consider a product idea as good. A new column named potential_market was
introduced to store the computed calculated potential market. The code is shown in Appendix N.
The second criterion was the trend of demand. It is the growth of demand over a period of
time. A study refers to this metric as a secular trend that describes data direction (e.g., upward or
downward) in the long term (Komlos, 1993). According to Trackmyhashtag, a firm that gives
analytics from raw tweets, tweet volume is the sum of the tweets, retweets, and replies
(Trackmyhashtag, 2020). The graphical method is used in measuring a secular trend (Komlos,
Accordingly, a good trend is when the curve generated is smooth, which means that the
scores above the line should be greater than or equal to the score below it. Sample trendlines are
shown in Figure 2.4. The slope of the trendline from August to December was monitored to get
the trend of demand. A value greater than 0 was used as the threshold value to consider a product
idea as good. A new column named trend_of_demand was created to hold the values of the
another. The increase in the tweet volume was compared to measure the stability of demand.
Usually, the standard is to assess the stability of demand annually, but since the collected data
are tweets posted in 6-month time, the increase in the tweet volume from November to
previous−current
Stabilityofdemand= ∗100
previous
Where:
previous=tweetvolume ∈ November 2020
Baremetrics, a company that provides analytics and insights for business, reported that
companies should have 15%-45% stability of demand yearly. A threshold of at least 15% was set
as the value of the stability of demand to consider a product idea. A new column named
stability_of_demand was created to store the values of the stability of demand. The code is
shown in Appendix P.
Finally, market acceptance is described as the sentiments of the people towards a particular
product idea. The computed compound valence scores from the sentiment analysis step were
utilized to measure the sentiments. According to Hutto & Gilber (2014), compound valence
scores can be negative, positive, and neutral. The interpretation table is shown in Table 2.3.
Accordingly, a score value greater than or equal to 0.05 is considered positive. A threshold of at
least 0.05 was set as the value of the market acceptance to consider a product idea. A new
compound score > -0.05 and compound score < 0.05 Neutral sentiment
Data preparation aims to prepare the dataset for modeling (Zhang, Zhang, Yang, 2003). In
this step, the dataset to be used for building the model was finalized. In particular, the results
obtained from the screening were utilized. Specifically, two columns were considered,
namely product_idea and label. The first column contains the validated product ideas, while the
second column specifies the label for that product idea. A good product idea has a label 1;
otherwise, a value of 0 was specified. Since ML algorithms do not understand text values, word
embedding must be applied (Ge & Moh, 2017). Word embedding is the process of converting
text into numerical values. This study utilized the Word2Vec encoding scheme to convert the
values of the product_idea column to their numerical equivalent. Research has shown that using
the Word2Vec with SVC helps achieve a high-performing model for text classification
(Lilleberg, Zhu, & Zhang, 2015). The implementation of word embedding is shown in Appendix
R, line 20. Once the dataset was prepared, it was divided into two subsets. The first subset was
used to train the model. It consists of 80% of the entire dataset. The second subset was reserved
for testing. The remaining 20% was used to test the performance of the trained model. The
dataset was split using the train_test_split function from the sklearn package. To
refine further the dataset, a down sampling technique was carried out using the NearMiss
package. The code used to create the training and testing subsets is shown in Appendix R, line
26.
Parameter tuning was performed to get the ideal values of the parameters of the SVC
algorithm. The performance of a model is only as good when using its default parameters (Smit
& Eiben, 2009). The parameters must be fine-tuned to get their optimal values to create a high-
performing model. Parameter tuning was done using the randomizedsearchcv package;
and the results of the data preparation was utilized. It is a technique that randomly searches for
the optimal parameters for a given model. The parameters used in the parameter of the model
and their possible values are listed in Table 2.4. This study considered four parameters of the
SVC model: 1) kernel, 2) gamma, 3) degree and 4) C. The first parameter specifies the kernel
type to be used in the algorithm. Second, gamma is the value of the kernel coefficient. The third
parameter is the degree of the polynomial kernel function. Lastly, the C parameter is the
# Parameter Values
1 kernel linear, rbf, poly
2 gamma 0.1, 1, 10, 100, 1000
3 degree 0,1,2,3,4,5,6
4 C 0.1, 1, 10, 100,1000
Performance evaluation is one of the critical activities in ML (Korde & Mahender, 2012).
Accordingly, it is not enough that a model can do prediction or classification; measuring its
performance is the most effective and reliable way to assess how well a model performs. Four
performance metrics, including (1) accuracy, (2) precision, (3) recall, and (4) f1-score were
calculated using the sklearn libraries to evaluate the performance of the model. Specifically, a
classification report was generated to get the precision, recall, and f1-score values, and cross-
validation was conducted to determine the accuracy. In addition, a confusion matrix was
generated to examine the results of the classification. The trained and tested model was saved to
be used in implementing the proposed screening application. The metrics used in performance
The first performance metric was the accuracy. It is the ratio of correctly predicted values to
the total number of observations. The cross-validation was carried out using the
cross_val_score method from the sklearn module. It is a technique that evaluates the
model's performance by dividing a dataset into n number of subsets to get the consistency in the
accuracy of a model; more particularly, 10th fold cross-validation was performed. The average
score was used as the final value of the accuracy. The implementation is shown in Appendix T,
line 24. The formula for getting the accuracy is shown below.
The second performance metric was the precision. It is defined as the ratio of the correctly
predicted values of a particular class to the total correctly and incorrectly predicted values of that
specific class. The classification_report module was used to get the precision score of the model.
In particular, a classification report was generated. The weighted average precision score was
considered as the precision of the model. The implementation is shown in Appendix T, line 21.
The third performance metric was the recall. It is the ratio of correctly predicted values of a
particular class to correctly predicted values and the total incorrectly predicted values of all other
classes. To get the value of the recall, a classification report was generated. The weighted
average recall score was considered as the recall of the model. The implementation is shown in
Appendix T, line 21. The formula for getting the recall is shown below.
was generated to get the value of the f1-score. The weighted average f1-score score was
considered as the f1-score of the model. The implementation is shown in Appendix T, line 21.
existing challenge or improve a current process (Burns & Dennis, 1985). In particular, web
development aims to create an application that can be accessed across the web using any device
with a browser. Various ML and web technologies were utilized to implement the proposed
screening application. Furthermore, the persisted SVC model was integrated and used in the
development. The technology stack considered in the implementation is shown in Table 2.5.
former shows the front-end technologies used and the tools for designing the User Interface (UI)
of the application. The latter provides a list of the different functions of the application and the
various Python libraries and frameworks used to construct them. The packages used in the
implementation are stored in the same venv used in the former steps and are listed in Appendix
A. These steps are further discussed in the succeeding sections. The system architecture of the
Over the years, the devices that are capable of accessing the Internet have grown. Gardner
(2011) reports that a new standard for web design has been proposed. According to the report, it
aims to standardize responsive web design to reduce the developer's workload by developing a
single application that can adapt across devices and improve user experience. The web
fundamental technologies, including HTML, CSS, JS, were used in designing the front-end. In
addition, Bootstrap and icons from Font Awesome were utilized. In particular, Bootstrap is a
CSS framework designed to make a web application responsive in any computing device with a
browser, including desktop computers, mobiles, and tablets. The design of the UI was simplified
More specifically, three pages were designed: search page, result page, and dashboard. The
first page was created to allow users to enter a product idea. The result page shows the results of
the screening. Finally, a dashboard that provides a list of the good product ideas. The pages were
launched on different devices such as desktop computers, mobiles, and tablets enabled by
Google Chrome developer tools to assess the UI quality. Furthermore, the application was
opened in three major browsers to check its compatibility, including Google Chrome, Mozilla
technology stacks behind the application (Kaluža, 2019). In recent years, a Python-based
microframework, Flask, has been the preferred web development platform for ML applications,
mainly because of its simplicity and ease of use (Aslam, Mohammed, & Lokhande, 2015; Mufid
et al., 2019). Flask is an open-source web framework for building data-driven and dynamic web
applications quickly. Flask and various Python libraries were utilized to develop the functions of
the screening application. The list of each function and its corresponding libraries is presented in
application: (1) reading the input, (2) preprocessing the input, (3) validating the input and
showing the error, (4) translating the input to English, (5) converting input into vector
representations, (6) classifying the input, and (7) showing the viable product ideas. These
# Function Description
1 To read the input from the user The function is used to accept input from
the user on the screen page.
2 To preprocess the input The function is used to remove noise
● To convert input to lower case words/characters from the input.
● To remove special characters
● To remove emails
● To remove links
● To remove HTML tags
● To remove accented characters
● To remove repeated letters
● To remove numerical values
3 To validate the input and show error The function used to validate if the input is
message a string consists of at least three characters,
English and a noun. Error message/s is
going to appear if the input does not satisfy
any of the above conditions.
4 To translate non-English input to English The function is used to ensure that only
English and Tagalog are accepted.
5 To convert input to its vector The function is used to perform word
representations embedding to an input using the Word2Vec
encoding scheme.
6 To classify the input The function is used to classify whether an
input is good and display the results on the
result page. The output would come from
the trained text classification model.
7 To show the product ideas The function is used to display viable
product ideas in the dashboard.
The flow of the application is shown in Figure 2.6. Its quality was tested with different
inputs, including numerical, non-English, Tagalog strings, and other possible invalid inputs.
Figure 2.6 A Flowchart of the screening application
Chapter 3: Results and discussions
This chapter presents and discusses the results of the study. It is divided into two main
sections: section 3.1) building the model and section 3.2) implementation. The first section
presents the dataset used in the study. Moreover, the performance of the model is revealed. In the
second section, the result of the implementation is shown. Specifically, the UI and the different
functions are highlighted. These sections are further explained in the succeeding paragraphs.
A model is an ML algorithm that has been trained using a particular dataset and uses the
acquired knowledge to identify some patterns to make a classification (Feldman, 2013). This
study built an SVC model that classifies a product idea as good or not good. This section is
divided into five subsections: section 3.1.1) data collection, section 3.1.2) data preprocessing,
section 3.1.3) constructing the criteria for screening, section 3.1.4) data preparation, and section
3.1.5) performance evaluation. First, the gathered dataset is shown, and significant statistics are
highlighted. In the second subsection, the preprocessed dataset is presented. The result of each
subprocess is explained. Next, the results of the screening using the extracted product ideas from
the tweets are discussed. Fourth, the prepared datasets used to train and test the model are
The datasets were gathered using an advanced scraping tool called Twint. It is an open-
sourced advanced tweet scraping tool often used in research to exploit publicly posted
information, including tweets (Dutch Osint Guy, 2018). The said tool collected tweets posted
within the Philippines from August to December 2020. It uses the coordinates and radius values
when considering which tweets to collect. This study used the coordinates of Marinduque
(12.072862,122.664139), the geographical center of the Philippines, and 768.54KM radius,
respectively. The value of the radius was found to be ideal covering almost the entire island of
the country. Using a lower radius’ value resulted in collecting lesser tweets. On the other hand,
using a higher value tends to include tweets posted outside the Philippines, including tweets sent
from Malaysia in the southwest, Indonesia in the south, Vietnam in the west, and Taiwan and
Table 3.1 Sample collected tweets when using a high radius value
Tweet Country
@oogiri_zamurai なんでトイレのドア Japan
が開かないの? もうおれ我慢の限界な
んだよね。
나는 네가 웃는 것을 보는 것을 좋아한다. Korea
너는 또한 나를 행복하게 만들어 줘 오빠.
너는 나의 행복의 이유야. #예성버블
https://t.co/0baNrlGCcp
A total of 5,193,417 tweets were scraped during the data collection process. A few of the
collected tweets are shown in Appendix U. Each tweet has 35 attributes that give more
information about a particular tweet, including the date when a tweet is posted, the author, the
place where the tweet is posted, the language, and the number of retweets. The complete list of
all the attributes is shown in Appendix V. In particular, most of the gathered tweets were posted
in December 2020 with 1,194,486 (23%). Conversely, only 882,880 (19%) were collected in
August 2020. The complete distribution of the collected tweets based on the month when they
Although the tweets were only limited to areas within the Philippines, 54 languages were
observed from the collected data. Specifically, more than half of the tweets are Tagalog (tl) with
53%. It is followed by English (en) with 30%. 9% consists of the other languages, including
Ukrainian (uk) and Spanish (es). The remaining 9% has no defined language. The percentage of
Data preprocessing is an essential part of building a model. It aims to eliminate the noise in a
given dataset. The presence of the noise is detrimental to the performance of a model (Yang,
2018). The collected tweets went through a series of preprocessing steps, including (1) data
wrangling, (2) data reduction, (3) extracting the product ideas, (4) data annotation, and (5)
sentiment analysis. As a result, the dataset was reduced from over 5M rows and 35 columns to
2,926 rows with 6 columns. In particular, the columns comprise the annotated_product_idea,
tweets were preprocessed using various NLP Python libraries. A few of the preprocessed tweets
is shown in Table 3.2. The results of data preprocessing are discussed in the succeeding
paragraphs.
Table 3.2 Sample preprocessed tweets
gym 4 1 1 16 0.3884
equipment
shop
haircut 4 1 12 1 0.0772
The first step in data preprocessing was to create data wrangling. It restructures raw data into
the desired format for data preprocessing. The collected tweets were transformed from a TSV
file format into a pandas dataframe. The transformation would allow useful features of
create a dataframe with 5,193,417 rows and 36 columns. A few of the content of the created
dataframe is shown in Appendix U. The complete list of the columns is shown in Appendix V.
Second, data reduction was conducted to get rid of irrelevant data in the dataframe. It is a
technique to reduce the number of features in a dataset. Data reduction was made by reducing the
number of rows and columns of the created dataframe. In particular, the row selection only
considered English tweets. In column selection, only the columns that would be useful in getting
the engagements and sentiments of a product idea were included (Fischer & Reuber, 2011;
Sultana, Paul; Gavrilova, 2016; Anto et al., 2016; Ray & Chakrabarti, 2017; Geetha, Rekha, &
Rarthika, 2018; Costa et al., 2013). These include created_at, tweet, retweets_count,
replies_count, and likes_count columns. This step reduced the dimension of the dataframe into
1,557,442 rows and 5 columns. A few of the rows in the modified dataframe are shown in Table
3.3.
TRIPLE
2020-10-18 CHOCOLATED
0 0 1
09:20:19 PST CHOCOLATE CAKE
ROLLS 🤤😭 #sweets
suggest me sum
affordable gym
2020-11-27
equipment shops, i 0 0 2
18:53:17 PST
really need to lift some
weights🥺
2020-12-31
parang i want a haircut
23:59:58 PST 0 15 1
na.
The third step to preprocess the dataset was to extract the product ideas from the tweets.
UGC produced in social media, including tweets, could be used to find and develop product
ideas (Nascimento & Da Silveira, 2017). The process of extracting the product ideas is further
divided into seven sub-steps, including (1) data precleaning, (2) pulling out the nouns, (3) data
cleaning, (4) removing stopwords, (5) lemmatization, (6) discarding duplicate words, and (7)
eliminating short words. The results of each sub-step are further discussed in the succeeding
paragraphs. As a result, a total of 118,387 product ideas were extracted from the tweets. Sample
tweet extracted_product_idea
The first sub-step to extract the product ideas was data precleaning. The tweets were initially
cleaned using regex and NumPy modules to remove the noise words such as mentions,
modified dataframe to hold the results of the data precleaning. Rows with null values in the
newly created column precleaned_tweet were dropped. Sample results are shown in Table 3.5.
In particular, in the first row, the hashtag #sweet was removed. On the second and third rows, the
mention @jacob and the link, https://t.co/Ggcb0cJe0B, were removed, respectively. This step
tweet precleaned_tweet
TRIPLE CHOCOLATED CHOCOLATE TRIPLE CHOCOLATED CHOCOLATE
CAKE ROLLS 🤤😭 #sweets CAKE ROLLS 🤤😭
The second sub-step to extract the product ideas was to pull out the nouns from the
precleaned _tweets column. Its result was used to determine which tweets contain possible
product ideas for Philippines’s SMEs. The nouns were selected because, generally, product ideas
are expressed through nouns (Malmasi & Dras, 2015). Three POS tagger libraries, TextBlob,
NLTK, and Spacy, were used, and their outputs were evaluated to select the best library for
pulling out the nouns. A comparison of the extracted nouns using the three libraries is shown in
Table 3.6. Among the three libraries, Spacy got the slowest time when pulling out nouns with 1
hour and 48 minutes compared to TextBlob, which did the same task in 52 minutes and 39
seconds, and NLTK with 44 minutes and 15 seconds. The results of TextBlob and NLTK are
almost identical. Specifically, both often treated emojis, adverbs, and even upper-case words as
nouns, which leads to inaccurate noun extraction. Overall, Spacy got the most accurate result and
was chosen as the library. Sample results are shown in Table 3.7. A new column
named pulled_out_noun was introduced to the dataframe to store the pulled-out nouns. Rows
with null values in the pulled_out_noun column were removed from the dataframe. This step
Table 3.6 Sample nouns extracted from the precleaned tweets using the three POS tagger
Table 3.7 Sample pulled out nouns from the precleaned tweet
cleansed using the preprocess_kgptalkie package. The said package would provide,
among others, functions to remove any remaining unnecessary words and characters that were
mistakenly pulled out nouns by the Spacy library. Each noun went through a series of
preprocessing tasks, including (1) converting words to lower case, (2) expanding contractions,
(3) removing emails, (4) removing html tags, (5) removing accented characters, (6) removing
named cleaned_noun was created to hold the results of data cleaning. Sample cleaned nouns are
shown in Table 3.8. In the table, the extracted noun was converted to lower case in the first row.
In addition, emojis such as the "cheese" and "heart" in rows 1 and 2 were removed. Furthermore,
in the second row, the repeated letters from the word "cheese" were discarded. Finally, rows with
null values were dropped from the dataframe. This step managed to bring the number of rows of
Next, the Tagalog (e.g., akin, ako, at, dapat) and English (e.g., could, or, and, rather)
stopwords were removed from the values of the cleaned_noun column still using the Spacy
library. Stopwords are words that do not add much meaning to a sentence. A new column
output is shown in Table 3.9. The Tagalog stopword “na” in the sixth row was removed from the
word “haircut na.” Rows with null values in the stopwords_free_noun column were eliminated
from the dataframe. This step managed to bring the number of rows of the dataframe down to
1,081,021.
Table 3.9 Sample cleaned nouns without the stopwords
@jacob craving for craving for cheeeeeeeeese cheese pizza cheese pizza
cheeeeeeeeese cheeeeeeeeese pizza pizza 🧀 sm sm sm
pizza 🧀🍕 in SM 🧀🍕
Baguio
want stir-fried want stir-fried spinach ❤ spinach spinach
spinach 😫❤ spinach 😫❤
want lasagna and want lasagna and lasagna lasagna lasagna lasagna
lasagna only. 😭 lasagna only. 😭 lasagna lasagna
https://t.co/Ggcb0cJ
e0B
suggest me sum suggest me sum gym equipments gym gym
affordable gym affordable gym shop equipments equipment
equipment shops, i equipment shops, i shop shops
really need to lift really need to lift
some weights🥺 some weights🥺
parang i want a parang i want a haircut na haircut na haircut
haircut na. haircut na.
The fifth sub-step to extract the product ideas was lemmatization. It is used to convert a word
into its base form (e.g., from equipments to equipment, from walked to walk. from mice to
mouse, from dove to dive). The values of the stopwords_free_noun column were lemmatized
sample output is shown in Table 3.10. In particular, the word “chocolated” from the first row
was converted to “chocolate.” Furthermore, “rolls” was simplified to its singular form, “roll.”
Similarly, in the 5th row, the “gym equipments shop” was transformed to “gym equipment
shop.”
@jacob craving craving for cheeeeeeeeese cheese cheese pizza cheese pizza
for cheeeeeeeees pizza 🧀 sm pizza sm sm sm
cheeeeeeeeese e pizza 🧀🍕
pizza 🧀🍕 in
SM Baguio
want stir-fried want stir- spinach ❤ spinach spinach spinach
spinach 😫❤ fried spinach
😫❤
want lasagna want lasagna lasagna lasagna lasagna lasagna
and lasagna and lasagna lasagna lasagna lasagna lasagna
only. 😭 only. 😭
https://t.co/Ggc
b0cJe0B
suggest me suggest me gym gym gym gym
sum affordable sum equipments equipments equipments equipment
gym equipment affordable shop shop shops shop
shops, i really gym
need to lift equipment
some weights🥺 shops, i really
need to lift
some
weights🥺
parang i want a parang i want haircut na haircut na haircut haircut
haircut na. a haircut na.
Since the lemmatization converted words into their base forms, duplicates were introduced in
the lemmatized_noun column. In sub-step 6, these duplicates were taken care of. A custom
function was created to remove the repeated words. A new column named unique_noun was
created to hold the results of the operation. The sample output is shown in Table 3.11. In
particular, the second “chocolate” was removed in the first row. Rows with null values in the
unique_noun column were dropped from the dataframe. This step was able to shrink the rows of
Finally, short words were eliminated from the values of the unique_noun column. These
words consist of less than three characters. This was done to exclude words that may not have
to store the results. The sample output is shown in Table 3.12. The word “sm” in the second row
has two characters and therefore was removed. Rows with null values in
the extracted_product_idea column were removed from the dataframe. This reduced the rows
Going back to data preprocessing, the fourth step to accomplish was data annotation. This
step involves manually removing tweets that express a product idea and therefore not valid. The
basis for considering a product idea as valid is the presence of emotion (e.g., hate, want, love,
disgust) expressed towards the noun in a tweet. According to Nascimento & Da Silveira (2017),
people use social media platforms to express their emotions on any particular topic, including
product ideas. A new column named label was added to the existing dataframe to store the
output of data annotation. A value of 1 indicates that the extracted product idea is valid, while 0
means otherwise. Furthermore, since not all the extracted nouns are valid product ideas, only
those on the list of the most common product ideas for the Philippines' SMEs are considered.
The list the product ideas for SMEs are shown in Appendix K. Another column
The data annotation was done with the aid of BSIT undergraduate students. The dataframe
with 522, 943 rows, and 12 columns were divided into five subsets using the month when the
tweets were posted. This resulted in subsets, namely, August.csv, September.csv, October.csv,
November.csv, and December.csv. The number of rows per subset before and after data
annotation is shown in Table 3.14. The first two subsets were assigned to the first student, while
the next two were delegated to the second. The remaining subset was reserved for the researcher.
The results made by the students were further validated by the researcher. Rows with 0 values in
the label column were dropped from the dataframe which brings the rows down to 11,280.
particular, the label column holds the result of the screening. It has two possible values, 1 or 0. A
value of 1 means that a product idea passes the screening or good, while 0 says otherwise or not
good. The dataframe was saved for the next step, which is data preparation. The code used to
analysis was carried out using a package called VADER. It is a package that designs for complex
social media data. When getting the polarity scores, it considers punctuations, emojis, slang, and
abbreviated words that commonly appear in social media texts (Hutto & Gilber, 2014). A new
column named polarity_score was added to the dataframe to store the computed polarity scores.
Rows with polarity score less than 0.05 representing neutral and negative sentiments were
dropped. This left the dataframe to 2,926 rows. Sample results are shown in Table 3.15. The
Screening aims to evaluate new product ideas to determine which idea is worth investing in
(Rochford, 1991). A set of criteria was constructed to screen the values of the
annotated_product_idea column from the preprocessed dataset. The created criteria were based
on the factors considered by Baker & Albaum's (1986). These criteria include potential market,
trend of demand, stability of demand, and market acceptance. (See Table 2.2). A new dataframe
was created using the aggregated values of the tweets_count, retweets_count, likes_count,
replies_count, and polarity_score columns of the annotated dataframe with 522,943 rows (See
Table 3.12). The resulting dataframe comprises of 2,926 rows and 5 columns. These columns
A new column named label was added to the new dataframe to hold the values of the result
of the screening. The said column has two possible values. A value of 1 means that a product
idea passes the screening or is considered as good. On the other hand, 0 implies that a product
idea did not satisfy at least one of the four criteria and, therefore, is not a good idea. A few of the
results of the screening is shown in Table 3.17. Out of the 2,296 annotated product ideas that
were screened, 2,145 failed, while the remaining 781 made it. For instance, acrylic, adobo,
antibiotic, yema cake, and airpod are considered a good idea because they satisfied all of the
criteria for screening. Specifically, acrylic has a potential market of 19,797.3% which is way
more than the 33% acceptable value. Its trend of demand, which is computed from November to
December 2020, has an upward direction. Moreover, the stability of demand, considered from
September to October and November to December, remains optimistic. Finally, its market
acceptance has a positive sentiment. The newly created dataframe was saved for the next step,
which is data preparation. The results of the screening in each criterion are further discussed in
The first criterion considered in screening was the potential market. It is defined as the
market size of the total market size for a particular product. It was measured through tweets
engagement rate, defined as the ratio of the diffusion interactions (retweets, replies, and likes
count) to the total number of tweets (Muñoz-Expósito, 2017). A potential market of at least 33%
was set as a minimum value for considering a product idea to be viable (Mee, 2020). The
potential market values are stored under the column named potential_market. Out of the 2,969
product ideas, 1,482 passed the screening in terms of their potential market values. These
product ideas show enough tractions from possible customers to be considered as a viable
product idea for a business venture. A few of these product ideas are shown in Table 3.18. More
notably, acrylic which is under the art accessories shop got the highest potential market out of
extracted_product_idea potential_market
acrylic 19,797.3%
adobo 2,527%
antibiotic 1,377%
yema cake 101%
The second criterion considered for screening was the trend of demand. It refers to the
secular movement that describes data direction (e.g., upward or downward) long-term (Komlos,
1993). The trend of demand was monitored by the trendline's slope from August 2020 to
December 2020 using the graphical method. A trend of demand of greater than 0 was set as a
minimum value for considering a product idea to be viable (Komlos, 1993). The trend of demand
values was stored under the column named trend_of_demand. Out of the 2,926 product ideas
subject for screening, 2,908 passed the screening using the computed values of the trend of
demand. A few of these product ideas are shown in Table 3.19. These product ideas have seen an
increase in engagements in the last five months, from August 2020 to December 2020, which is
another saying that the number of people talking about these product ideas has grown. In
particular, adobo, yema cake, abaca face mask, and accessory shop are among the product ideas
annotated_product_idea trend_of_demand
acrylic 4.9
adobo 5.0
antibiotic 3.25
yema cake 5
airpod 4.91
accessory shop 5
The third criterion used in screening was the stability of demand. It describes the fluctuation
in the market size in a particular period. It was measured by getting the changes in the tweet
volume from November to December 2020. Stability of demand of at least 15% was set as a
minimum value for considering a product idea to be viable (Baremetrics, 2021). The potential
market values are stored under the column named stability_of_demand. Of the 2,926 screened
product ideas, 864 got a stable demand. A few of these product ideas are shown in Table 3.20.
The list shows that product ideas such as acrylic, adobo, antibiotic, yema cake, and AirPods have
a consistent demand. All items in the list got a 15% score which is the minimum value needed
annotated_product_idea stability_of_demand
acrylic 15%
adobo 15%
antibiotic 15%
yema cake 15%
airpod 15%
Ultimately, market acceptance was the degree of acceptance of the possible customers
towards a particular product idea. It was assessed by getting the polarity scores of the extracted
product ideas using the VADER package. The compound valence scores were considered for the
value of the market acceptance (Hutto & Gilber, 2014). A market acceptance of 0.05 was set as a
threshold to consider a product idea to be viable. The market acceptance values are stored under
the column named market_acceptance. In particular, out of the 2,926 evaluated product ideas,
1,272 got a favorable result in terms of the criterion market acceptance. A few of these product
ideas are shown in Table 3.21. More notably, people think that yema cake and acrylic are
annotated_product_idea market_acceptance
acrylic 0.444
adobo 0.05
antibiotic 0.05
Data preparation is the last step before the modeling (Zhang, Zhang, Yang, 2003). The
demand, stability_of_demand, market_acceptance, and the label, were prepared using various
packages in Python. The prepared dataset has 780 good product ideas and 2,146 not good
while the label values were the output/dependent variable. Moreover, the prepared dataset was
divided into two subsets in a 20:80 ratio to create training and testing datasets, respectively. In
particular, the training dataset contains 1,248 inputs, while the testing dataset has 312. A few of
input output
[0.36797234, 0.042480964, 0.099437095, -0.1050...] 1
values using the word embedding using the Word2Vec package. The said package converts
strings into their numerical values their vector representations. Each vectorized input comprises
an array of 300 numerical values. The array values below are only limited to the first four arrays.
The performance of a model is only as good when using its default parameters (Smit &
Eiben, 2009). The ideal values of the parameters of the model were identified through parameter
tuning using the randomizedsearchcv package. The prepared dataset shown from the
previous step was fed to the model. The parameters considered were kernel, gamma, degree, and
C. A total of 10 candidates were found and tested out in three folds cross-validation, totaling 30
fits. The values were taken in 1.5 minutes, and the results are shown in Table 3.24.
parameter result
kernel rbf
gamma 100
degree 6
C 100
Performance evaluation was performed to assess how well the model classifies a product
idea. The proposed model was evaluated in four performance metrics: (1) accuracy, (2)
precision, (3) recall, and (4) f1-score using the balanced prepared dataset. Specifically, the
generated to examine the number of correctly classified values and verify the cross-validation
results. Moreover, a classification report was generated and used to assess the model's precision,
recall, and f1-score values. The performance of the model is further discussed in the succeeding
paragraphs.
Accuracy was the first performance metric used in measuring the performance of the model.
Cross-validation was performed to determine the consistency in the accuracy of the model.
Accuracy is the ratio of correctly predicted values to the total number of observations. The
average accuracy was considered the final accuracy. The cross-validation was carried out using
the 2,926 extracted product ideas. It comprises 1,248 training subsets and 312 testing subsets.
The accuracy of the model started low with 49%. The score plunged on the second fold by
almost twice as previous. It continued to fluctuate until it reached its peak of 89% on the final
fold. Overall, the average accuracy rate of 81% was considered as the final score.
1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th Aver
fold fold fold fold fold fold fold fold fold fold age
0.487 0.858 0.801 0.884 0.846 0.852 0.807 0.814 0.826 0.878 0.80
17949 97436 28205 61538 15385 5641 69231 10256 92308 20513 5769
Moreover, a total of 312 product ideas were used to further examine the performance of the
model by generating a confusion matrix. In particular, 160 of which are not good, while 152 are
good. Based on the assessments, the model correctly classified 116 as good product ideas and
147 as not good product ideas. Furthermore, minimal inputs of 13 are false positive or type 1
error. However, a large portion of the classifications is considered false negative or type 2 errors
Precision was the second performance metric used in measuring the performance of the
model. It is computed to get the ratio of the correctly predicted values of a particular class to the
total correctly and incorrectly predicted values of that particular class. The score of the precision
is shown in Table 3.27. The weighted average was considered in measuring the precision of the
model. The report shows the model's performance using the 312 inputs from the testing subset. It
consists of 160 not-good and 152 good product ideas. Accordingly, the model has a 90%
precision rate when classifying a product idea as not good and 80% when it is good. Overall, it
got an 85% precision rate and was considered as the final value.
Recall is the third performance metric used in measuring the performance of the model. It is
calculated as the ratio of correctly predicted values of a particular class to correctly predicted
values and the total incorrectly predicted values of all other classes. The score of the recall is
shown in Table 3.28. The weighted average was considered in measuring the recall score. The
report shows the model's performance using the 312 inputs from the testing subset. It consists of
160 not-good and 152 good product ideas. Accordingly, the model has a 92% precision rate
when classifying a product idea as not good and 76% when it is good. Overall, it got an 84%
Finally, f1-score is simply the weighted average of precision and recall. The score of the f1-
score is shown in Table 3.28. The weighted average was considered in measuring the recall
score. The report shows the model's performance using the 312 inputs from the testing subset. It
consists of 160 not-good and 152 good product ideas. Accordingly, the model has an 86%
precision rate when classifying a product idea as not good and 83% when it is good. Overall, it
got an 84% precision rate and was considered as the final value.
Overall, the performance evaluation shows that the accuracy of the model is at 84%.
Furthermore, it got a precision rate of 85%, recall rate of 84%, and f1-score rate of 84%. These
results are summarized in Table 3.28. The trained model was exported and used in the
3.2. Implementation
The application was designed and developed using various ML and web technologies used in
web development, including HTML, CSS, and JS and the micro web framework, Flask. This
section is presented in two parts: section 3.2.1) front-end development and section 3.2.2) back-
end development. First, screenshots of the UIs are shown. Afterward, the different functions are
The UI of the application was implemented using Bootstrap and Flask’s templating engine,
Jinja2. In particular, Bootstrap was used to make the design responsive to any computing device,
including desktop computers, mobiles, and tablets. Furthermore, icons from Fontawesome were
added to improve the aesthetics. The created UIs were tested out on the different sizes of devices
enabled by Google Chrome developer tools. Finally, UIs were viewed in three major browsers
including Google Chrome, Mozilla Firefox, and Microsoft Edge. In particular, three pages were
created for the application, namely search, result, and dashboard. Sample screenshots of these
The screen page is the first page of the application. This page would allow Philippines’
SMEs owners or representatives to see the application's different features, such as entering their
new product ideas to the search bar, a link to see good product ideas through the dashboard, and
others. A screenshot of the screen page is shown in Figure 3.3. Moreover, when a user enters
invalid or incorrect input, the screen page will be redirected. The nature of the error/s and tips to
prevent it would pop up. A screenshot of the screen page with invalid and valid inputs are shown
and there are no errors found. This is the page where the result of the screening is revealed. It
indicates whether the entered product idea is good or not. A screenshot for the result page of face
mask is shown in Figure 3.6. Figures input/. Other sample screening results are shown Figures
in this page is sorted alphabetically. Other information including the category, potential market,
stability of demand, and market acceptance are also shown. A screenshot of the dashboard page
The application functions were developed using Flask, and a list of python packages. A total
of seven functions were constructed, including reading the input, preprocessing the input,
validating the input and showing the error, translating the input to English, converting input into
vector representations, classifying the input, and showing the good product ideas. The screening
application was tested with different inputs, such as numerical, non-English, Tagalog strings, and
other possible invalid inputs. Sample preprocessed, invalid, and classified inputs are shown in
Input Error
123123123 Your input is not valid. Please enter a string.
Gahmsahhahmnida Your language is not supported. Please enter a
Tagalog/English string only.
as Your input is too short. Try longer string.
Input Output
acrylic Good idea
adobo Not a good idea
antibiotic Good idea
yema cake Not a good idea
airpod Not a good idea
abaca face mask Good idea
accessory shop Good idea
Chapter 4: Conclusions and recommendations
4.1. Conclusions
This study developed a tool that utilizes UGC produced in social media to assist the
Philippine-based SMEs in developing product ideas. This project was done in two steps. First, a
supervised ML model was trained to classify product ideas into good or not good. Second, an
The model was built using various Python libraries (see Appendix A). A few of the packages
used are pandas, vader, spacy, and scikitlearn. The dataset used to train the said model consists
of UGC on Twitter. In particular, over 5 million tweets were collected in five months (August-
December 2020) using an advanced scrapping tool [Twint]. The data collection was performed in
a Linux environment which was configured using Oracle VM. The collected tweets went through
a series of preprocessing steps, including data precleaning, pulling out the nouns from the tweets,
removing the stopwords to prepare the datasets for modeling. In addition, data annotation was
conducted to ensure that only valid product ideas were included in the final dataset. The resulting
dataset was used to perform the screening; in particular, the extracted product ideas from the
tweets were evaluated through engagements and sentiments information obtained from the
acceptance (Baker & Albaum, 1986). The results are represented by two values. A value of 1
indicates that a product idea is good, while 0 says otherwise. These data were used as the dataset
to feed the model. Since an ML algorithm cannot process text, the product ideas were
transformed into vector representations using the Word2Vec encoding scheme. The vectorized
extracted product ideas were used as the input. On the other hand, the label (e.g., 1 or 0) was
used as the output. The dataset was divided with a ratio of 80:20 to train and test the model,
and f1-score. The evaluation shows that the model achieves an 84% accuracy rate when
classifying a product idea (e.g., good or not good). The model was saved and used in the
The application was implemented using Flask, and its UIs were designed using the CSS
framework, Bootstrap. The developed application would allow SMEs’ owners or any
representative to enter and screen their product ideas. Specifically, the tool consists of seven
features, including reading the input, validating the input and showing the error, converting input
into vector representations, and classifying the input. Three web pages were designed to show
the features of the application, namely, screen page, result page, and dashboard. The application
was tested by launching it on different devices, including phones, tablets, and computers, using
This study would help contribute to the limited literature that explores social media data in
developing new product ideas (Nascimento & Da Silveira, 2017). It supports the findings of
Rathore & Ilavarasan (2020) regarding obtaining insights on user preferences from social media.
It demonstrates how tweets on Twitter could be used in screening product ideas for Philippines’
SMEs. The screening process is part and parcel of the seven-stage process of NPD. Other factors
need to be considered when conducting NPD, such as new product strategy development,
business analysis, development, and commercialization (Booz, & Allen & Hamilton, 1982). The
proposed application does not intend to replace the screening process; instead, it would
complement it. Specifically, it would provide Philippines SMEs a platform to urgently evaluate
and select which product ideas to consider for their next business venture. Furthermore, the
application would also allow the rising numbers of SMEs putting up a business in the online
platform economy nationwide to look for product ideas to sell (Villanueva, 2020).
4.2. Recommendations
The researcher gathered over 5 million tweets posted within the Philippines that covers a
five-month period, from August to December 2020. Only the 1,557,442 English tweets from the
entire dataset were considered. Future studies could collect more tweets by allowing a broader
time period of data collection. Also, later research could include tweets with Tagalog language
and other dialects in the Philippines. Furthermore, collecting real-time tweets could also help
provide more up-to-date insights regarding product ideas that could be critical in the screening
process. Additionally, other sources of UGC such as Facebook, Instagram, and YouTube could
also be included. This study only considered text-based UGC. Considerations for UGC in other
formats such as images can be explored. Interestingly, Instagram has been the most preferred
social media platform by the consumer because of its ease of use and less complicated way to
view feedbacks (Smith, Fischer, & Yongjian, 2012). It has also been recommended that future
research can examine SM platforms that are driven by images for specific type of product ideas
Moreover, out of the 1,557,442 tweets considered in the study, a total of 2,926 product ideas
were extracted through data annotation. It was performed with the aid of two undergrad BSIT
students. The preprocessed nouns that were pulled out from the tweets were marked as 1 for
valid and 0 for invalid. The basis for considering a product idea as valid is the presence of
emotion (e.g., hate, want, love, disgust) expressed towards the noun in a tweet. Furthermore,
since not all nouns are product ideas, only those on the list of the most common product ideas
particular to the Philippines' SMEs are considered. For example, MoneyMax.com (2021)
published a list of small businesses ideas with capital in 2021. The researcher further validated
the results in a limited period of time. Similar studies could incorporate marketing experts when
annotating the product ideas and in a more considerable amount of time to have a more in-depth
Furthermore, the annotated product ideas were transformed into numerical values by getting
their vector representations using Word2Vec encoding scheme. A study suggests that combining
Word2Vec with TF-IDF yields a better result all the time (Lilleberg, Zhu, & Zhang, 2015). It is a
bag of words that assign numerical values to each word based on how frequently it appears in a
collection or given datasets. The lower the frequency of the word is the higher its value. Future
studies could further enhance the model's performance by using the features of both TF-IDF and
Word2Vec when transforming the extracted product ideas into their vector representations.
In addition, the study was able to build a model that classifies a particular product idea into
good or not using the collected UGC on Twitter. However, the model was only trained once, and
its classification depends solely on the limited number of training datasets. Meanwhile, consumer
preferences are constantly changing. What considered good product ideas today are no longer the
case next month. Future studies can consider implementing a mechanism to allow the model to
continuously learn as it accepts input from users. This would give the model more intelligence
Finally, a supervised ML algorithm, SVC, was used to build the model for classifying
product ideas. The parameters of the model were identified using RandomizedSearchCV. The
model's accuracy could be improved by getting the optimal parameter values using
GridSearchCV (Batayev, 2019). It is a technique that searches for all the possible combinations
for the parameters to get the optimal values. This study could also be extended by using
unsupervised learning. Alternatively, the SVC model could be trained using an unsupervised
Agrawal, A., & Bhuiyan, N. (2014). Achieving success in NPD projects. International Journal of
Social, Behavioral, Educational, Economic, Business and Industrial Engineering, 8(2),
476-481.
Akram Afzal, M. (2017). Risks in new product development (NPD) projects.
Albar, F. M. (2013). An Investigation of Fast and Frugal Heuristics for New Product Project
Selection.
Bahtar, A. Z., & Muda, M. (2016). The impact of User–Generated Content (UGC) on product
reviews towards online purchasing–A conceptual framework. Procedia Economics and
Finance, 37, 337-342.
Baker, K. G., & Albaum, G. S. (1986). Modeling new product screening decisions. Journal of
Product Innovation Management: AN INTERNATIONAL PUBLICATION OF THE
PRODUCT DEVELOPMENT & MANAGEMENT ASSOCIATION, 3(1), 32-39.
Batayev, N. (2019). Gas turbine fault classification based on machine learning supervised
techniques. In 2018 14th International Conference on Electronics Computer and
Computation (ICECCO) (pp. 206-212). IEEE.
Bayanihan to Heal As One Act. RA 11469. 18th Cong. (2020). Retrieved from
https://www.officialgazette.gov.ph/downloads/2020/03mar/20200324-RA-11469-
RRD.pdf
Bhimani, H., Mention, A. L., & Barlatier, P. J. (2019). Social media and innovation: A
systematic literature review and future research directions. Technological Forecasting and
Social Change, 144, 251-269.
Brainkart. (2021). Methods for measuring secular trend. Retrieved from
https://www.brainkart.com/article/Methods-of-Measuring-Secular-Trend_39269/
Booz, & Allen & Hamilton. (1982). New products management for the 1980s. Booz, Allen &
Hamilton.
Carbonell, P., Mayer, M. A., & Bravo, À. (2015, January). Exploring brand-name drug mentions
on Twitter for pharmacovigilance. In MIE (pp. 55-59).
Chenworth, M., Perrone, J., Love, J. S., Graves, R., Hogg-Bremer, W., & Sarker, A. (2021).
Methadone and Suboxone® mentions on Twitter: Thematic and sentiment analysis.
Clinical Toxicology, 1-10.
Chu, S. C., & Kim, Y. (2011). Determinants of consumer engagement in electronic word-of-
mouth (eWOM) in social networking sites. International journal of Advertising, 30(1), 47-75
CMO’s Use of Social Media During COVID-19. For what purpose has your firm used social
media during the pandemic. MarketingCharts.com. (May, 2020). Retrieved from
https://www.marketingcharts.com/charts/cmos-use-of-social-media-during-covid-
19/attachment/cmosurvey-cmo-use-of-social-media-during-covid-19-jul2020
Costa, J., Silva, C., Antunes, M., & Ribeiro, B. (2013, April). Defining semantic meta-hashtags
for twitter classification. In International Conference on Adaptive and Natural Computing
Algorithms (pp. 226-235). Springer, Berlin, Heidelberg.
Dadgar, S. M. H., Araghi, M. S., & Farahani, M. M. (2016, March). A novel text mining
approach based on TF-IDF and Support Vector Machine for news classification. In 2016
IEEE International Conference on Engineering and Technology (ICETECH) (pp. 112-116).
IEEE.
De Brentani, U., & Droge, C. (1988). Determinants of the new product screening decision A
structural model analysis. International Journal of Research in Marketing, 5(2), 91-106.
Dhaoui, C., Webster, C. M., & Tan, L. P. (2017). Social media sentiment analysis: lexicon versus
machine learning. Journal of Consumer Marketing.
Employment Situation in April 2020. (2020, June 5). Philippines Statistics Authority. Retrieved
from https://psa.gov.ph/statistics/survey/labor-and-employment/labor-force-
survey/title/Employment%20Situation%20in%20April%202020
Effendi, M. I., Sugandini, D., & Istanto, Y. (2020). Social Media Adoption in SMEs Impacted by
COVID-19: The TOE Model. The Journal of Asian Finance, Economics, and Business,
7(11), 915-925.
Ford, P., & Terris, D. (2017). NPD, design and management for SME's. The Design Society.
Ge, L., & Moh, T. S. (2017, December). Improving text classification with word embedding.
In 2017 IEEE International Conference on Big Data (Big Data) (pp. 1796-1805). IEEE.
Gharib, T. F., Habib, M. B., & Fayed, Z. T. (2009). Arabic Text Classification Using Support
Vector Machines. Int. J. Comput. Their Appl., 16(4), 192-199.
Global User Generated Content (UGC) Software Market Size, Status and Forecast 2020-2026.
MarketingResearch.com. (2020). Retrieved from
https://www.marketresearch.com/QYResearch-Group-v3531/Global-User-Generated-
Content-UGC-13615161/
Guntuku, S. C., Sherman, G., Stokes, D. C., Agarwal, A. K., Seltzer, E., Merchant, R. M., &
Ungar, L. H. (2020). Tracking mental health and symptom mentions on twitter during
covid-19. Journal of general internal medicine, 35(9), 2798-2800.
Hughes, G. D., & Chafin, D. C. (1996). Turning new product development into a continuous
learning process. Journal of Product Innovation Management: AN INTERNATIONAL
PUBLICATION OF THE PRODUCT DEVELOPMENT & MANAGEMENT
ASSOCIATION, 13(2), 89-104.
Hayon, S., Tripathi, H., Stormont, I. M., Dunne, M. M., Naslund, M. J., & Siddiqui, M. M.
(2019). Twitter mentions and academic citations in the urologic literature. Urology, 123,
28-33.
Impact of COVID-19 pandemic on micro, small, and medium enterprises (MSMEs). MSC.
(June, 2020). Retrieved from https://www.microsave.net/wp-content/uploads/2020/08/Impact-o
f-COVID-19-on-Micro-Small-and-Medium-Enterprises-MSMEs-1.pdf
Islam, M. S., Jubayer, F. E. M., & Ahmed, S. I. (2017, February). A support vector machine
mixed with TF-IDF algorithm to categorize Bengali document. In 2017 international conference
on electrical, computer and communication engineering (ECCE) (pp. 191-196). IEEE.
Jeong, E., & Jang, S. S. (2011). Restaurant experiences triggering positive electronic word-of-
mouth (eWOM) motivations. International Journal of Hospitality Management, 30(2),
356-366.
Jespersen, K. R. (2007). Is the screening of product ideas supported by the NPD process
design? European journal of Innovation management.
Jiang, J. X., & Shen, M. (2017). Traditional media, twitter and business scandals. Michael,
Traditional Media, Twitter and Business Scandals (May 1, 2017).
Joachims, T. (1998, April). Text categorization with support vector machines: Learning with
many relevant features. In European conference on machine learning (pp. 137-142). Springer,
Berlin, Heidelberg.
Kaluža, M., Kalanj, M., & Vukelić, B. (2019). A comparison of back-end frameworks for web
application development. Zbornik veleučilišta u rijeci, 7(1), 317-332.
Komlos, J. (1993). The secular trend in the biological standard of living in the United Kingdom,
1730‐1860 1. The Economic History Review, 46(1), 115-144.
Korde, V., & Mahender, C. N. (2012). Text classification and classifiers: A survey. International
Journal of Artificial Intelligence & Applications, 3(2), 85.
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019).
Text classification algorithms: A survey. Information, 10(4), 150.
Krumm, J., Davies, N., & Narayanaswami, C. (2008). User-generated content. IEEE Pervasive
Computing, 7(4), 10-11.
Kumar, S., Koolwal, V., & Mohbey, K. K. (2019). Sentiment analysis of electronic product
tweets using big data framework. Jordanian Journal of Computers and Information Technology
(JJCIT), 5(01).
Kurnia, R., Tangkuman, Y., & Girsang, A. (2020). Classification of User Comment Using
Word2vec and SVM Classifier. Int. J. Adv. Trends Comput. Sci. Eng, 9, 643-648.
Lee, I., & Shin, Y. J. (2020). Machine learning for enterprises: Applications, algorithm selection,
and challenges. Business Horizons, 63(2), 157-170.
Lee, M., & Youn, S. (2009). Electronic word of mouth (eWOM) How eWOM platforms
influence consumer product judgement. International Journal of Advertising, 28(3), 473-
499.
Lilleberg, J., Zhu, Y., & Zhang, Y. (2015, July). Support vector machines and word2vec for text
classification with semantic features. In 2015 IEEE 14th International Conference on
Cognitive Informatics & Cognitive Computing (ICCI* CC) (pp. 136-140). IEEE.
Magna Carta for Micro, Small and Medium Enterprises (MSMEs). RA 9501. 14 th Cong. (2008).
https://www.officialgazette.gov.ph/2008/05/23/republic-act-no-9501/
Money Max (23 June, 2021). 32 Micro and Small Business Ideas You Can Start with Low
Capital in 2021. Retrieved from
https://www.moneymax.ph/personal-finance/articles/small- business-ideas-philippines
Mu, J., Peng, G., & Tan, Y. (2007). New product development in Chinese SMEs: Key success
factors from a managerial perspective. International Journal of Emerging Markets, 2(2),
123-143.
Neda: PH Economy lost 1.1T during lockdown. (2020, May 23). De Vera, B. Retrieved
from https://business.inquirer.net/298037/neda-ph-economy-lost-p1-1t
Nascimento, A. M., & Da Silveira, D. S. (2017). A systematic mapping study on using social
media for business process improvement. Computers in Human Behavior, 73, 670-675.
Onarheim, B., & Christensen, B. T. (2012). Distributed idea screening in stage–gate
development processes. Journal of Engineering Design, 23(9), 660-673.
Osisanwo, F. Y., Akinsola, J. E. T., Awodele, O., Hinmikaiye, J. O., Olakanmi, O., & Akinjobi,
J. (2017). Supervised machine learning algorithms: classification and
comparison. International Journal of Computer Trends and Technology (IJCTT), 48(3),
128-138.
Owens, J. D. (2007). Why do some UK SMEs still find the implementation of a new product
development process problematical? Management Decision.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using
machine learning techniques. arXiv preprint cs/0205070.
Pantano, E., Giglio, S., & Dennis, C. (2019). Making sense of consumers’ tweets. International
Journal of Retail & Distribution Management.
Park, C., & Lee, T. M. (2009). Information direction, website reputation and eWOM effect: A
moderating role of product type. Journal of Business research, 62(1), 61-67.
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and
psychometric properties of LIWC2015.
Prantl, D., & Mičík, M. (2019). Analysis of the significance of eWOM on social media for
companies.
Schreiner, C., Torkkola, K., Gardner, M., & Zhang, K. (2006, October). Using machine learning
techniques to reduce data annotation time. In Proceedings of the Human Factors and
Ergonomics Society Annual Meeting (Vol. 50, No. 22, pp. 2438-2442). Sage CA: Los
Angeles, CA: SAGE Publications.
Scrunch. (2020, November 19). Mee, G. What is a good engagement rate on twitter? Retrieved
from https://scrunch.com/blog/what-is-a-good-engagement-rate-on-twitter.
Small Business Wage Subsidy Program. Department of Finance. (April, 2020). Retrieved from
https://sites.google.com/dof.gov.ph/small-business-wage-subsidy
Soukhoroukova, A., Spann, M., & Skiera, B. (2012). Sourcing, filtering, and evaluating new
product ideas: An empirical exploration of the performance of idea markets. Journal of
product innovation management, 29(1), 100-112.
Sultana, M., Paul, P. P., & Gavrilova, M. (2016). Identifying users from online interactions in
Twitter. In Transactions on Computational Science XXVI (pp. 111-124). Springer,
Berlin, Heidelberg.
Takeuchi, H., & Nonaka, I. (1986). The new new product development game. Harvard business
review, 64(1), 137-146. A
Tankova, H. (2021). Statista. Number of global social media network users 2017-2025.
Retrieved from https://www.statista.com/statistics/278414/number-of-worldwide-social-
network- users/
Trackmyhashtag. (2021). Twitter hashtag statistics. Retrieved from
https://www.trackmyhashtag.com/hashtag-stats
Tufekci, Z. (2014, May). Big questions for social media big data: Representativeness, validity
and other methodological pitfalls. In Proceedings of the International AAAI Conference on
Web and Social Media (Vol. 8, No. 1).
Vaishnav, M., Dalal, P. K., & Javed, A. (2020). When will the pandemic end?. Indian Journal of
Psychiatry, 62(Suppl 3), S330.
Vancic, A., & Pärson, G. F. A. (2020). Changed Buying Behavior in the COVID-19 pandemic:
the influence of Price Sensitivity and Perceived Quality.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. (2nd ed.). Springer Verlag. Pp.
1 – 20. Retrieved from website:
https://www.andrew.cmu.edu/user/kk3n/simplicity/vapnik2 000.pdf
Verworn, B., Herstatt, C., & Nagahira, A. (2006). The impact of the fuzzy front end on new
product development success in Japanese NPD projects (No. 39). Working Paper.
Villanueva, J. (2020). Philippine News Agency. Rise of online shopping nationwidem not just
from Metro. Retrieved from https://www.pna.gov.ph/articles/1112078
Wang, Z. Q., Sun, X., Zhang, D. X., & Li, X. (2006, August). An optimal SVM-based text
classification algorithm. In 2006 International Conference on Machine Learning and
Cybernetics (pp. 1378-1381). IEEE.
Weeg, C., Schwartz, H. A., Hill, S., Merchant, R. M., Arango, C., & Ungar, L. (2015). Using
Twitter to measure public discussion of diseases: a case study. JMIR public health and
surveillance, 1(1), e6.
Yin, Z., Fabbri, D., Rosenbloom, S. T., & Malin, B. (2015). A scalable framework to detect
personal health mentions on Twitter. Journal of medical Internet research, 17(6), e138.
Zhang, D., Xu, H., Su, Z., & Xu, Y. (2015). Chinese comments sentiment classification based on
word2vec and SVMperf. Expert Systems with Applications, 42(4), 1857-1863.
Zhang, S., Zhang, C., & Yang, Q. (2003). Data preparation for data mining. Applied artificial
intelligence, 17(5-6), 375-381.
Appendix A: List of Python packages
Appendix L: Baker & Albaum (1986)’s criteria for screening new product ideas
Criterion Examples
Page | 115
tweet langu mentions url photo replie retwee like hashtag cashta
age s s s_cou ts_cou s_c s gs
nt nt ou
nt
Happy new en [] [] [] 0 0 0 ['happyn
year ewyear2 []
Philippines 021']
#HappyNe
wYear202
1
🙃 und [] [] [] 1 0 0 [] []
Welcome en [] [] [] 0 0 0 [] []
2021🥳❤️
@mystoga tl [] [] [] 1 0 1 [] []
n4u pakita
ka muna
ng balls 😁
Thank you en [{'screen_ [] [] 0 0 2 [] []
@BTS_twt name':
@BigHitE 'bts_twt',
nt 'name': '
@TXT_me 방탄소년단',
mbers 💜 'id':
Happy '335141638'
New Year 🥳 },
✨🥂 {'screen_n
ame':
'bighitent'
, 'name':
'bighit
entertain
ment', 'id':
'168683422'
}]
Page | 116
link retwe quote_ video thumb near geo source user_rt user_
et url nail _id rt
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
hanswee3/ 2.6641
status/ 39,768
1344674669 536k
266411521
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
_rjtamayo/ 2.6641
status/ 39,768
1344674665 536k
554472963
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
Mty_Bautist 2.6641
a/status/ 39,768
1344674664 536k
790908936
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
MissyLeeVi 2.6641
xen/status/ 39,768
1344674657 536k
962577921
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
chinitalampa 2.6641
s/status/ 39,768
1344674657 536k
580929028
Page | 117
retweet_id reply_to retweet_date translate trans_src trans_de
st
NaN [] NaN NaN NaN NaN
Page | 118
Appendix V: List of all the columns and the number of null values
# Column Null
1 id 0
2 conversation_id 0
3 created_at 0
4 date 0
5 time 0
6 timezone 0
7 user_id 0
8 username 0
9 name 3,251
10 place 5,014,984
11 tweet 0
12 language 0
13 mentions 0
14 urls 0
15 photos 0
16 replies_count 0
17 retweets_count 0
18 likes_count 0
19 hashtags 0
20 cashtags 0
21 link 0
22 retweet 0
23 quote_url 4,552,147
24 video 0
25 thumbnail 4,556,780
26 near 5,193,417
27 geo 0
28 source 5,193,417
29 user_rt_id 5,193,417
30 user_rt 5,193,417
31 retweet_id 5,193,417
32 reply_to 0
33 retweet_date 5,193,417
34 translate 5,193,417
35 trans_src 5,193,417
36 trans_dest 5,193,417
Page | 119
Appendix W: Content Editing Certification
Page | 120
Appendix X: English Editing Certification
Page | 121
Curriculum Vitae
Landley Bernardo
Address: Baguio City, Philippines, 2600
Phone: +639752826318
Email: lmbernardo@slu.edu.ph
WORK
EXPERIENCE 02/2018 – 05/2018
Part-time faculty, Saint Louis University, Baguio City,
Philippines
08/2017 - 02/2018
Management Information Specialist, Martha Property
Management Inc., Baguio City, Philippines
Page | 122
ACHIEVEMENTS Finalist in the Philippine Start Challenge 2016
Page | 123