Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 130

SCREENING PRODUCT IDEAS THROUGH USER-GENERATED

CONTENT IN SOCIAL MEDIA TO ASSIST SMALL AND MEDIUM

ENTERPRISES IN NEW PRODUCT DEVELOPMENT

by

Landley M. Bernardo

A Capstone Project

Submitted to the Faculty of the School of Advanced Studies

In Partial Fulfillment of the Requirements for the degree of

Master in Information Technology

Computing and Information Technology Programs

School of Advanced Studies

Saint Louis University

November 2021
ENDORSEMENT

The capstone project entitled SCREENING PRODUCT IDEAS THROUGH USER-


GENERATED CONTENT IN SOCIAL MEDIA TO ASSIST SMALL AND MEDIUM
ENTERPRISES IN NEW PRODUCT DEVELOPMENT prepared and submitted by
LANDLEY M. BERNARDO for the degree MASTER IN INFORMATION
TECHNOLOGY has been examined and is recommended for acceptance and approval for oral
examination.

This is to certify further that LANDLEY M. BERNARDO is ready for oral examination.

MARIA CONCEPCION CLEMENTE


Adviser

This is to certify that the capstone project entitled SCREENING PRODUCT IDEAS
THROUGH USER-GENERATED CONTENT IN SOCIAL MEDIA TO ASSIST SMALL
AND MEDIUM ENTERPRISES IN NEW PRODUCT DEVELOPMENT prepared and
submitted by LANDLEY M. BERNARDO for the degree MASTER IN INFORMATION
TECHNOLOGY is recommended for oral examination.

Beverly Estephany P. Ferrer, DIT Cecilia A. Mercado, PhD Ed.


Member Member

Elisabeth D. Calub, MSIT Randy B. Domantay, DIT


Member Member

Beverly Estephany P. Ferrer, DIT Faridah Kristi Wetherick,


PhD
GPC for Computing and Information Technology Programs Dean
School of Advanced Studies School of Advanced Studies
Saint Louis University Saint Louis University
APPROVAL SHEET

Approved by the Committee on Oral Examination as ____PASSED____ on __June 25, 2021_.

Beverly Estephany P. Ferrer, DIT Cecilia A. Mercado, PhD Ed.


Member Member

Elisabeth D. Calub, MSIT Randy B. Domantay, DIT


Member Member

Accepted and approved in partial fulfillment of the requirements for the degree Master
in Information Technology.

Beverly Estephany P. Ferrer, DIT


Computing and Information Technology Programs
School of Advanced Studies
Saint Louis University

This is to certify that LANDLEY M. BERNARDO has completed all academic


requirements and PASSED the Public Lecture on FEBRUARY 18, 2021 for the degree of
Master in Information Technology.

Beverly Estephany P. Ferrer, DIT Faridah Kristi Wetherick, PhD


GPC for Computing and Information Technology Programs Dean
School of Advanced Studies School of Advanced Studies
Saint Louis University Saint Louis University
DEDICATION

I dedicate this capstone project to all the Philippines’ SMEs’ owners and their employees

who have been working so hard to continue their business operations during these trying times.
ACKNOWLEDGEMENTS

First and foremost, I would like to praise and thank God, the Almighty, who has granted

countless blessings, knowledge, and opportunity to me so that I have been finally able to

accomplish this capstone project.

Thank you to Dr. Cecilia Mercado for endorsing me to the Department of Science and

Technology (DOST) PROJECT STRAND scholarship. I cannot thank you enough for your

personal recommendation. Regardless of whether or not I am accepted, the fact that you

recommended me means so much. I would also like to thank Mr. Dalos Miguel and Ms. Maria

Concepcion Clemente for taking time out of your busy day to put in a good word for me. Your

endorsement letters mean so much to me.

Thank you to the DOST STRAND team for this life-changing opportunity you have given to

me. I was both thrilled and honored to hear that I had been named as a recipient of the DOST

PROJECT STRAND scholarship. By awarding me the said scholarship, you have given me

another chance to go back to school and create something awesome. I hope one day I will be able

to help students achieve their goals just as you have helped me.

Special thank you to my adviser, Maria Concepcion Clemente, for your patience, guidance,

and support. I have benefited greatly from your wealth of knowledge and meticulous advice. I

am extremely grateful that you took me on as an advisee and continued to have faith in me over

the course of a year.

Thank you to my panel members, Dr. Beverly Ferrer, Dr. Randy Domantay, Dr. Cecilia

Mercado, and Ms. Elisabeth Calub. Your encouraging words and thoughtful, detailed feedback

have been very important to me. Thank you for the time you have allotted to accommodate my

oral presentation despite your hectic schedule.


Thank you to Mark Enriquez and James Botigan III for helping me with the data annotation.

I sincerely appreciate the time you spent assisting me with this project. I am well aware that you

have put in a lot of hard work on the task assigned to both of you.

Thank you to Dr. Randy Domantay for suggestions and comments regarding the format and

content of my paper. Those comments are all valuable and very helpful for revising and

improving my writing.

Thank you to Dr. Gerry Paul Genove for teaching me a handful of techniques on smartly

conducting research and making one during a few of my classes with you. Thank you to my

classmates in MIT class, Richard, Nikki, Mehdi, Eddie, Chesca, Jerome, and JL to whom I had

the chance to get to know and learn new perspectives about our shared interest in Information

Technology and life.

Thank you to Georgina and Anna Marie for helping me with my oral presentation. Your

feedback when I had my mock presentation prepared me well during my actual presentation.

Thank you to my friends and other people (too many mention) who became part of my MIT

journey.

Finally, my deep and sincere gratitude to my family and relatives for their continuous and

unparalleled love, help, and support. I am grateful to my brother and cousin for always being

there for me as a friend and entertainer when I was down. I am forever indebted to my parents

for giving me the opportunities and experience that have made me who I am.
Abstract

This research aims to design and develop a tool to assist the Philippines-based Small and

Medium Enterprises (SMEs) in New Product Development (NPD), more specifically, in

screening new product ideas. The application utilized a Machine Learning (ML) model that is

based on User-Generated Content (UGC) on Twitter. Due to the Coronavirus disease 2019

(COVID-19) global pandemic, most businesses in the Philippines have struggled, and some have

gone out of business, especially SMEs. Previous studies suggest that conducting NPD could

mitigate the impact of the COVID-19 on businesses. However, the process of NPD could be

costly and tedious for SMEs, considering the limited resources they possess. Since the pandemic

broke out, the volume of UGC produced in social media has increased dramatically.

Nevertheless, studies focusing on exploiting this massive volume of UGC in NPD are quite

limited. This study developed an application that utilizes UGC on Twitter to assist SMEs in

performing new product idea screening. The application is powered by a supervised Machine

Learning (ML) algorithm, Support Vector Classifier (SVC) text classification model. Over 5

million tweets were collected and preprocessed using a variety of libraries in Python to train the

model. The criteria for screening include the potential market, the trend of demand, stability of

demand, and market acceptance. At least 2,926 rows with tweets that express potential produce

ideas based on common Philippines SMEs were extracted and vectorized using the Word2Vec

word embedding scheme. Consequently, the model achieved an accuracy rate of 84%. The

trained model was used to develop the proposed screening application. The application was

tested on different inputs, screens, and browsers to assess its quality. The output of this study

would help contribute to the limited literature that exploits social media data in developing

product ideas.
Keywords

SME, NPD, New Product Idea Screening, UGC, Machine Learning, NLP, Support Vector
Classifier, Word2Vec
Chapter 1: Introduction

1.1. Background of the study

A recent assessment shows that the Philippines' Luzon-wide lockdown that aims to contain

the COVID-19 has accumulated an output loss of 1.1 trillion pesos (NEDA, 2020). Furthermore,

the nation's highest unemployment rate of 17.7% has been recorded (PSA, 2020). The Inter-

Agency Task Force on Emerging Infectious Diseases (IATF-EID) recommends General

Community Quarantine (GCQ) to prevent further loss and stabilize the economy, which resulted

in the resumption of most business operations and other economic activities (Exec. Order 112, s.

2020). However, it had little effect on improving the economy because the pandemic has already

influenced consumers' confidence in the market (Vancic & Pärson, 2020). In a study conducted

by MSC (2020), 23% of the Small and Medium Enterprises (SMEs) temporarily closed their

operations, while 28% reduced business operations, affecting thousands of Filipino workers

nationwide (See Figure 1.1). If the impact of the pandemic on the SMEs continues, it could cause

the economy to collapse, driving more Filipinos to the edge of poverty (Bouey, 2020).

Figure 1.1 Status of the Philippines' SMEs during GCQ


SMEs are non-subsidiary, independent firms that employ 10-199 people and have total assets

not exceeding 3,00,000 pesos (MSMEs, 2008). These include restaurants, parlors, and small-time

renting houses. SMEs play a significant role in providing goods and services to the masses,

creating jobs, and promoting innovation through competition (Leano, 2006). 82% of the

businesses in the Philippines fall under the category of SME (International Trade Centre, 2020).

The ability to take up a loan with a lower interest rate in a more extended period to pay is just

one of the programs that the government has rolled out to help small businesses get the financial

support they need to continue operations (Bayanihan to Heal As One Act, 2020; SBSW, 2020).

Despite these efforts, a survey says that 62% of the SMEs reported not receiving any financial

support from the Government, not even from non-government sources, such as families and

other investors (MSC, 2020) (See Figure 1.2).

Figure 1.2 Level of financial support for Philippines’ SMEs during the implementation of
ECQ in Metro Manila
Studies suggest that New Product Development (NPD) could help businesses stay

competitive and achieve prosperity in a rapidly changing market (Booz, Allen & Hamilton,

1982; Hughes and Chaffin, 1996; Ford & Terris, 2017). NPD is a process that transforms market
opportunities into a product available in the market (Takeuchi & Nonaka, 1986). It is a seven-

stage process consisting of new product strategy development, idea generation, screening and

evaluation, business analysis, development, testing, and commercialization (Booz, & Allen &

Hamilton, 1982). The stages of NPD are illustrated in Figure 1.3. The screening is considered the

most critical stage because it selects and evaluates new product ideas from a pool of ideas

generated during new product ideation (Rochford, 1991). Its output is the deciding factor that

determines if an idea is fit for the next phase. Jespersen (2007) argues that screening is a

complex decision process highly influenced by market changes. Agrawal & Bhuiyan (2014)

created a critical success factors (CSF) framework that lists the metrics, CSF, and tools used in

each stage of NPD. In particular, the CSF for the screening phase is called Up-front homework.

It consists of numerous activities that aim to understand and analyze the current and future

market potential. Similarly, Baker & Albaum (1986) created a list of criteria for screening a new

product idea, including societal factors, business risk factors, demand analysis, market

acceptance, and competitive factors.

Figure 1.3 Stages of NPD


Over the years, various techniques were developed to perform screening of new product

ideas (Cooper, 1979; Baker & Albaum, 1986; Debrentani, 1988; Verworn, Herstatt, & Nagahira,

2006; Mu, Peng, & Tan, 2007; Soukhoroukova, Spann, & Skiera, 2001; Onarheim &

Christensen, 2012; Albar, 2013). However, these studies used a traditional approach in obtaining

the datasets (e.g., giving out questionnaires, conducting interviews) to screen new product ideas.

Although conventional data collection has proven effective, a study emphasizes that it could

bring validity issues, as questions are prone to misinterpretation (Pribyl, 1994). Furthermore,

traditional product screening is a tedious, expensive, and lengthy process, yet only 20% of new

product ideas reached commercialization (Ford & Terris, 2017; Rodríguez-Ferradas & Alfaro-

Tanco, 2016; Akram, 2017). Given the limited resources of SMEs, screening new product ideas

remains a major challenge (Owens, 2007).

Since the pandemic broke out, the number of people using social media (e.g., Facebook,

Twitter, and Instagram) has increased dramatically (Statista, 2020). In particular, businesses have

widely utilized these platforms to improve their brands and reach out to their target customers

(Effendi, Sugandini, & Istanto, 2020). Based on a survey conducted by MarketingCharts.com,

84% of firms use social media to build their brands and increase awareness (See Figure 1.4).

Another advantage that social media brings to a business that is often underestimated is the

availability of UGC (Chu & Kim, 2011; Prantl & Mičík, 2019). UGC has been defined as

publicly available content, such as text, image, video, and even audio, created by users, rather

than brands, to express one's opinion (Krumm & Davies, 2008). It is referred to by other

literature as Electronic Word-of-Mouth or eWOM (Park & Lee, 2009; Lee & Youn, 2009; Jeong

& Jang, 2011). Examples of UGC are posts (e.g., Facebook), comments (e.g., Instagram), tweets

(e.g., Twitter), and ratings and reviews (e.g., Amazon). It is the main contributor to the enormous
digital information produced on the Web, often referred to as Big Data (Tufekci, 2014). Social

media users are expected to reach 4.41 billion in 2025 (Statista, 2020) (See Figure 1.5).

Therefore, the amount of UGC generated in social media will also see a tremendous increase. On

a recent projection made by MarketingResearch.com (2020), the global UGC software market

will reach 447 billion dollars in 2026, and blogging platforms such as Twitter will be at the top

of the competition.

Figure 1.4 Applications of social media to SMEs


The emergence of UGC opens doors for new possibilities that never seemed to be possible a

few decades ago (Pantano, Giglio, & Dennis, 2019). Nascimento & Da Silveira (2017) argue that

UGC could be an alternative source of information for screening new product ideas. Rathore &

Ilavarasan (2020) explain that it is inexpensive and can provide real-time consumer behaviors.

Specifically, Twitter has been the leading source of UGC to measure customers' satisfaction

towards a particular product or brand (Bhimani, Mention, & Barlatier, 2018; Kumar, Koolwal, &

Mohbey, 2019). The application allows its users to post a concise tweet of 280 characters with
specific hashtags (e.g., #iphone12, #aegyocake, #talabykyla), making it searchable for

researchers (Kumar, Koolwal, & Mohbey, 2019). Newswire (2020) reports indicate that its Daily

Active Users improved by 34% in 2020, with 500 million tweets being sent every day.

Figure 1.5 Projection of social media users from 2020-2025.


Several studies have utilized tweets to measure products, services, or brand performances

(Fischer & Reuber, 2011; Sultana, Paul, and Gavrilova, 2016; Anto et al., 2016); Ray &

Chakrabarti, 2017; Geetha et al., 2018; Costa et al., 2013). In particular, Fischer & Reuber, 2011)

used Twitter engagements to increase reaching more target customers. Accordingly, it would be

an avenue for a business and its customers to develop a rapport and build trust. Interestingly,

Sultana, Paul, and Gavrilova (2016) use engagements to identify the behaviors of selected users.

Sentiments are the opinion of people towards a product. Sentiment Analysis is the process of

getting the polarity scores from a piece of text. It is computed in three levels: document,

sentence, and aspect levels. Studies made by Anto et al. (2016), Ray & Chakrabarti (2017), &

Geetha et al., (2018) categorized the ratings of a known mobile phone brand using tweets. In

particular, Ray & Chakrabarti (2017) made an assessment of people’s opinions on iPhone 6 in
the document and aspect levels sentiment analysis. Finally, hashtags promote company visibility

and spread awareness of their mission and vision (Costa et al., 2013).

In the past, sentiment analysis has been the most used methodology in measuring product

performance (Pang, Lee, & Vaithyanathan, 2002). It is a type of text classification technique that

assigns a label to particular text (Kowsari et al., 2019). It computes the polarity scores from a

piece of text to determine its sentiment. Its output could be categorized into positive, negative, or

neutral. There are two general approaches in text classification: lexicon-based and machine

learning-based (Dhaoui, Webster, & Tan 2017). The former relies on a dictionary or sentiment

lexicon to determine the emotions of a text (Pennebaker et al., 2015). The latter uses Machine

Learning (ML) algorithms to identify patterns on a given dataset and use the acquired knowledge

to perform the classification (Feldman, 2013). The ML approach requires a massive amount of

data for training and testing to achieve an accurate result. However, implementing sentiment

analysis using the ML approach is more accurate than the Lexicon-based approach (Dhaoui,

Webster, & Tan 2017).

ML algorithms are generally divided into three types: supervised, unsupervised, and

reinforcement learning (Feldman, 2013). A supervised ML algorithm uses a labeled dataset for

training and testing. It is further subcategorized into regression and classification. The regression

deals with numbers such as mean/average prediction, while the classification tries to label a

particular input into a finite number of classes. Text classification is a domain in ML that assigns

a label into texts called text classification/categorization. A typical classification problem is

categorizing news into sports, business, politics, or entertainment (Dadgar et al., 2016). One of

the most recent additions to supervised ML algorithms is Support Vector Machine (SVM)

(Vapnik, 1995). It is also considered as one of the most popular, along with linear regression, k-
nearest neighbors (kNN), and Naïve-Bayes (Lee & Shin, 2020). SVM uses hyperplanes to

separate the observations into different classes (Wang et al., 2006). A good hyperplane is

achieved when it has the largest distance to the nearest data point. The structure of the SVM is

shown in Figure 1.6. Osisanwo et al. (2017) compared the performance of the different ML

algorithms, including decision trees, neural networks, Naïve-Bayes, kNN, SVM, and rule-

learners. The result shows that SVM supersedes all other algorithms in terms of accuracy and

tolerance to irrelevant attributes. Furthermore, Gharib, Habib, & Fayed (2009) confirmed the

effectiveness of SVM when dealing with a large text dataset. Similarly, SVM consistently

outperforms other alternative models (Joachims, 1998). Accordingly, the algorithm has the

potential to be used for more complex text classification problems.

Figure 1.6 SVM algorithm


One of the critical tasks in building an ML model for text classification is word embedding

(Church, 2017). It is a process of associating numerical or vector representations to a text for

mathematical computations (Rong, 2014). Several word embedding approaches are available in

performing this task, including Word2Vec. Word2Vec encodes texts by considering their
semantic values. For instance, the numerical equivalents of men and women are more similar

than of men and horses. Recent studies have shown that incorporating Word2Vec in SVM when

building a text classifier yields a better performance (Zhang et al., 2015; Şahİn, 2017; Kurnia et

al., 2020). Specifically, Zhang et al. (2015) and Kurnia et al. (2020) utilized UGC via users’

comments regarding mobile applications and clothing products, respectively, to train a text

classification model.

Previous studies have shown that UGC produced in social media could be an alternative

dataset for screening a particular product. However, products that have not been created (new

product ideas) are yet to be explored (Nascimento & Da Silveira, 2017). According to

Soukhoroukova, Spann & Skiera (2012), new product ideas that are not immediately captured

may fade away in an organization in no time. Similarly, Albar (2013) concluded that 90% of new

product ideas are rejected before they even reach formal evaluation. Therefore, it is high time to

utilize the abundance of available UGC for screening product ideas. A tool is developed to

perform screening of product ideas using tweets. The application would assist the Philippines’

SMEs in selecting which product ideas to consider and invest in.

1.2. Statement of objectives

This study explored UGC produced in social media in developing new product ideas for

Philippine SMEs. Specifically, the objectives are the following:

1. To build a supervised text classification model using SVM.

2. To develop a web-based screening application using the trained SVM text classification
model.
1.3. Scope of the study

This study focuses on the screening phase of the NPD. The dataset used in the study was

limited to text-based UGC produced on Twitter; in particular, the dataset consists of tweets

posted within the Philippines from August to December 2020. The collected tweets went through

a series of data preprocessing using available Natural Language Processing (NLP) libraries in

Python. In addition, manual data annotation was conducted to verify the product ideas extracted

from the tweets. A list of common product ideas was used for considering if the extracted

product ideas were valid or not. The screening considered four criteria: market potential, stability

of demand, the trend of demand, and market acceptance (Baker & Albaum, 1986). A static text

classification model is built using a supervised ML algorithm called SVC. The input variable,

which consists of the extracted product ideas, were transformed into their vector representations

using Word2Vec encoding scheme. A parameter tuning was performed to get the ideal values of

four parameters of the SVC model, namely kernel, degree, gamma, and C. The model's

performance was assessed in four metrics: accuracy, precision, recall, and f1-score. The trained

model was exported and used in the development of the screening application. The proposed

application was implemented using the Python-based web microframework called Flask. A

proof-of-concept on how UGC could be utilized to assist SMEs in conducting NPD is developed.

1.4. Significance of the study

The output of this capstone project would help bridge the gap identified by Nascimento & Da

Silveira (2017) regarding the limited literature that utilizes data produced in social media to

develop new product ideas. The research also would prove how UGC could be an alternative

source of information when screening new product ideas (Owens, 2007). Moreover, the findings

would show a practical implication for SMEs by helping them to understand the importance of
consumer engagement in social media. Furthermore, the developed application would give the

rising numbers of SMEs putting up a business in the online platform economy nationwide

product ideas to sell. One report shows that 77 percent of Filipinos consider the online presence

of SMEs a must (Villanueva, 2020). Most importantly, it would encourage the Philippines’

SMEs to consider developing new product ideas. Studies made by Booz, Allen & Hamilton

(1982), Hughes and Chaffin (1996), and Ford & Terris (2017) stated that NPD could help

businesses stay competitive and achieve prosperity in a rapidly changing market.


Chapter 2: Methodology

This section describes the steps taken to accomplish the objectives of this study. The

methodology is divided into two main sections: section 2.1) building the model and section 2.2)

implementation. Section 2.1 discusses the tasks involved in creating the ML model for

classifying product ideas into good or not good. This section is further divided into six

subsections, including data collection, data preparation, and performance evaluation. Section 2.2,

shows a detailed discussion on how the proposed screening application is implemented using the

model built in section 2.1. Moreover, the technology stacks used in the implementation are

presented. The development is divided into two parts: front-end and back-end development. Each

section of the methodology is further explained in the succeeding paragraphs and is summarized

in Figure 2.1.

Figure 2.1 Overview of the methodology

The primary tool used to build the model (in section 2.1), is Anaconda Navigator. It is a

GUI-based application containing tools and pre-installed packages in Python for ML, also known

as conda packages. In particular, Jupyter Lab is one of the ML tools available by default in
Anaconda Navigator. It is an interactive open-source web application used for data exploration,

visualization, and analysis. Jupyter Lab was used to create the notebooks that contain the

different utility functions to build the model. Anaconda Navigator’s environments created a new

virtual environment (venv) to store the needed Python packages for building the model. A

package is a collection of modules. It is a file consisting of Python code that defines classes,

functions, and variables. Packages are used to build a powerful Python application. Creating a

venv would allow packages to be reinstalled easily when an unknown bug occurs in one of the

packages installed. Anaconda Navigator uses the channel named default as its default channel for

installing and adding packages to a venv. Additional channels were added to get more packages,

such as the conda-forge and pytorch channels. The packages stored in the venv and their roles

are listed in Appendix A.

Section 2.2 discusses the implementation of the proposed screening application. Python is the

language of choice when performing analysis for big data (Oliphant, 2007). Specifically, the

application was developed using Flask. It is an open-source web Python microframework for

building data-driven and dynamic web applications quickly. It follows the Model-View-

Controller (MVC) architecture pattern that separates the application into three main components:

the model, view, and controller; these components represent the data or the database, the display,

and the application's logic, respectively. Flask is the most popular web framework for Python,

along with Django (Aslam, Mohammed, & Lokhande, 2015; Mufid et al., 2019). It has 52,400

stars and 13,800 forks on its Github page at the time of this writing. Unlike Django, which is a

full-stack and comes with pre-built dependencies, libraries, and layouts, Flask is lightweight. It

only offers suggestions for possible tools for developing the application, giving developers the
flexibility and freedom to select other technologies for implementation. The Flask framework is

relatively new. Its first stable version (1.1.2) was recently released on April 3, 2020.

Nevertheless, its community has been growing. There are currently 621 contributors and trusted

by more than 5,000 projects, including well-known brands Netflix, Reddit, and Lyft.

In addition, the Integrated Development Environment (IDE) of choice was PyCharm

Professional. Although the professional version comes with a price, Jetbrains, the creator of

PyCharm, offers a free 1-year subscription for students. IDE is a software application that

provides comprehensive facilities to the developer for development. PyCharm is the

recommended code editor for building Python-based projects. It is equipped with all the

necessary tools for modern development, including a command-line interface (CLI), features to

connect to the database, create a venv, and integrate with Github. Also, it offers features for

handling big data and developing data-driven applications like conda integration to manage

packages for Python, scientific libraries and plots for performing data analytics and visualization,

and coding assistance for Python frameworks like Flask. The same venv used in section 2.1 was

utilized in the development. Once the venv was created, and packages were installed, a new

Flask project was initiated in PyCharm. PyCharm used the created venv as its interpreter to

import all the packages needed for the development (see Appendix A).

2.1. Building the model

A model is an ML algorithm that has been trained to identify patterns on a given dataset. It

uses its acquired knowledge to perform a prediction or classification (Feldman, 2013). This study

proposed a supervised text classification model that classifies a particular product idea as good or

not good. The SVM ML algorithm was considered in building the said model. According to

Joachims (1998), SVM consistently outperforms other alternative models in text classification.
This section is further broken down into six subsections: section 2.1.1) data collection, section

2.1.2) data preprocessing, section 2.1.3) constructing the criteria for screening, section 2.1.4)

data preparation, and section 2.1.5) performance evaluation. First, data collection explained how

the data needed for the study was gathered using an advanced scraping tool. In section 2.1, data

preprocessing provides a detailed look into how the collected data were cleaned and turned into

valuable data for modeling. It also shows the process of extracting product ideas from the actual

tweets. In the third step, the criteria for screening product ideas were constructed. Furthermore,

the results of the screening process of the extracted product ideas were labeled as good or not

good. Fourth, the dataset for modeling was prepared and divided into two subsets to train and test

the model. Finally, the model was trained, and its performance was assessed in different

performance metrics. The trained model was saved and used for the implementation of the

screening application.

2.1.1. Data collection

A variety of scraping tools are available for mining tweets on Twitter. The microblogging

platform [Twitter] allows third-party applications to connect and collect tweets using an

approved and authenticated developer account through a secured channel called Twitter

Application Programming Interface (API) (Kumar, Koolwal, & Mohbey, 2019). However, the

number of tweets allowed to be collected is limited. Specifically, the Twitter API v2 has a

threshold of 500,000 tweets per month.

Alternatively, Twitter Intelligent Tool (Twint) allows retrieval of tweets with no limits and

APIs required (Dutch Osint Guy, 2018). It is an open-sourced advanced tweet scraping tool often

used for Open-Source Intelligence (OSINT) research. OSINT tools focus on collecting,

analyzing, and using publicly posted information (e.g., reviews, tweets, and Facebook feeds) for
research purposes. Hence, Twint was utilized to collect the data needed for the study. In

particular, these data were the tweets that were posted within the Philippines from August to

December 2020.

To start off, a Linux-based Operating System (OS), Ubuntu, was loaded and configured in

Oracle Virtual Machine (VM) Virtual Box. The VM enables a machine to run more than one OS

at a time within the base OS. Since Twint is compatible with the Linux environment, the VM

was required to run Twint in Windows, the researcher’s default OS. The specifications set for the

virtual OS are shown in Table 2.1.

Table 2.1 Ubuntu’s configurations

Hardware Specification
Operating System Linux
Distro Ubuntu 18.04.01
Processor 2 CPUs
Based memory 4608MB
System type 64-bit Operating System, x64-based processor
Storage 40GB

Furthermore, Twint's dependencies, including the Python language, Pip Installs Packages

(pip), and Git, were installed into the virtual OS. Notably, a new venv was created to store the

other dependencies using pip. A venv allows the developer to build multiple Python-based

applications in a single machine. Pip is a package manager for Python. To ensure that only

tweets posted within the Philippines are included, the coordinates of Marinduque

(12.072862,122.664139) within the radius of 768.54KM were specified in the Twint's geo

parameter. The code used in data collection is shown in Appendix B. Marinduque was selected

as the coordinates because it is considered to be the geographical center of the Philippines. In


addition, the specified radius is an ideal estimation value that covers the entire Philippines,

ensuring that only the needed data are collected. The places in the Philippines where tweets were

collected are shown in Appendix C.

2.1.2. Data preprocessing

In predictive modeling (e.g., text classification), raw data cannot typically be used as it is. It

requires preprocessing to ensure that the dataset is fit to the model. Data preprocessing is a

process that aims to get rid of the noise in a given dataset. Some examples of noise include

special characters, unnecessary duplicate letters, and stopwords, such as “the,” “a,” and “can.”

The presence of these noises could compromise the performance of an ML model when

performing text classification (Yang, 2018). The process of data preprocessing varies

accordingly. It is highly dependent on the defined problem. This study utilized the programming

language Python to preprocess the collected data. Python is a multi-purpose language that serves

different needs, from basic web applications to training models for ML and Artificial Intelligence

(AI). It also offers a variety of packages from various sources to turn raw texts into useful

information (Karczewski, 2021).

Data preprocessing comprises of 5 subprocesses: (1) data wrangling, (2) data reduction, (3)

extracting the product ideas, (4) data annotation, and (5) sentiment analysis. These steps are

summarized in Figure 2.2. First, the collected data were transformed from a Tab Separated

Values (TSV) format into a dataframe which enables useful functions to aid with the data

preprocessing. Next, the rows and columns of the created dataframe were reduced to select only

the relevant data for the study. Third, numerous data cleansing techniques were carried out using

available NLP packages in Python to extract possible product ideas from the tweets. Lastly, the

polarity scores of the extracted product ideas were determined through sentiments analysis.
Figure 2.2 Steps of the data preprocessing
In the initial stages of data preprocessing, the collected UGC were moved from the

configured virtual OS to the Windows host machine via the shared folder feature of the Oracle

VM. Once the data were transferred, the data preprocessing began.

The first step of data preprocessing was data wrangling. Data wrangling restructures raw data

into the desired format to aid with the data preprocessing. The initial dataset was transformed

from its original file format [TSV] into a dataframe using the pandas package. A dataframe is a

2D data structure that provides streamlined forms of data representation. It is a table that consists

of columns as the labels and rows as the actual data. The pandas library was used because it
provides rich functions for data preprocessing. In addition, the encoding scheme UTF-8 was

specified in the pandas’ parameter to read and display the emojis of the tweets. The code used

in data preprocessing is shown in Appendix D, lines 4-8.

The second step of data preprocessing is data reduction. Data reduction is a technique to

reduce the number of features of a particular dataset. In this step, the columns and rows that were

irrelevant for the study were removed. The selection considers the data that is going to be helpful

when screening new product ideas. Starting off with the rows, non-English tweets were removed

from the dataframe. This would limit the dataset to English words that work well with available

NLP libraries in Python. Also, it would lead to the improvement of the quality of the dataset by

removing nuisance words that are hard to interpret. The code used to reduce the rows of the

dataframe is shown in Appendix D, lines 10-12. In the case of column selection, attributes that

could be used in measuring the performance of the extracted product ideas using available

information in the scraped UGC were considered. These include attributes with numerical values

which would be beneficial in getting the engagements and sentiments (Fischer & Reuber, 2011;

Sultana, Paul; Gavrilova, 2016; Anto et al., 2016; Ray & Chakrabarti, 2017; Geetha, Rekha, &

Rarthika, 2018; Costa et al., 2013). The code used to reduce the columns of the dataframe is

shown in Appendix D, lines 14-22.

The third step of data preprocessing was extracting the product ideas. According to

Nascimento & Da Silveira (2017), one way to find product ideas for business is in UGC

produced in social media. This information provides insights into consumers’ preferences and

with careful planning and the right tools at hand. This can be utilized by businesses for product

development (Rathore & Ilavarasan, 2020). Extracting product ideas consists of seven sub-steps,

including (1) data precleaning, (2) pulling out the nouns, (3) data cleaning, (4) removing the
stopwords, (5) lemmatization, (6) discarding the duplicate words, and (7) eliminating the short

words. These steps are explained in detail in the next paragraph.

The first step to extract the product ideas from the tweets was data precleaning. It was

intended to initially clean the tweet by discarding word patterns including mentions (@),

hashtags (#), and links (http and https). To do this task, a utility function was constructed using

the regex and numpy libraries. A new column named precleaned_tweet was created

separately from the original tweet to store the output of the applied operation. The code used to

preclean the tweets is shown in Appendix E. Rows with null values in the newly created

precleaned_tweet column were removed from the dataframe. The results of this step were used

in a later step, sentiment analysis.

The next step to extract the product ideas was to pull out the nouns from

the precleaned_tweets using a Part-of-speech (POS) tagger. POS tagging is a process that labels

each word on a sentence based on the part of speech (e.g., noun, verb, adverb, adjective,

pronoun, conjunction, and interjection) they belong to. Specifically, a noun is a part of speech

that names a person, thing, idea, action, or quality. An example of POS tagging is shown in

Figure 2.3.

Figure 2.3 An example of a tweet where the words are tagged and labeled using a POS
tagger. In this example, the yoga and mat are labeled as a noun.
This study used nouns to represent possible product ideas. Generally, product ideas are

expressed through nouns (Malmasi & Dras, 2015). By pulling out the nouns from the values of

the precleaned_tweet, it would help determine which tweets contain possible product ideas for

the Philippines’ SMEs. Rows with no extracted nouns were removed from the dataframe. Three

POS tagger libraries, namely, TextBlob, NLTK, and Spacy, were used, and their results were

manually evaluated to ensure the quality of the extracted nouns. The final result was stored in a

new column named pulled_out_noun. The code for extracting the nouns is shown in Appendix

F.

The third task to extract the product ideas was data cleaning. The results of the previous step

[pulling out the nouns] were cleansed to remove any unnecessary words and characters

mistakenly pulled out as nouns by the selected POS tagger. This step includes (1) converting

words to lower case, (2) expanding contractions, and (3) removing email addresses, html tags,

accented characters, special characters, and unnecessary repeated letters. The

preprocess_kgptalkie package was utilized to cleanse the nouns. A new column named

cleaned_noun was created to store the results of data cleaning. The code used to clean the

extracted nouns is shown in Appendix G. Rows with null values in the new column

cleaned_noun were removed from the dataframe.

The fourth task to extract the product ideas was the removal of the stopwords. The stopwords

include are both Tagalog (e.g., akin, ako, at, dapat) and English (e.g., can, might, by). The

complete list of the stopwords removed from the cleaned nouns are shown in Appendix H.

Stopwords are words that do not add much meaning to a sentence and can be safely removed

without compromising the meaning of a sentence. The stopwords were removed using the
Spacy library. By default, Spacy only contains the stopwords for the English language; the

Tagalog stopwords were added to the list. A new column named stopwords_free_noun was

added to the dataframe to store the new values of the applied operation. The code used to get rid

of the stopwords is shown in Appendix I. Rows with null values in the newly

added stopword_free_noun column were removed from the dataframe.

The fifth step to extract the product ideas was lemmatization. It is the process of converting a

word into its base form, removing endings such as -s, -ing, and -ed (e.g., from equipments to

equipment, from walked to walk). Sometimes a word could be an irregular verb, making its

conversion quite different (e.g., from mice to mouse, from dove to dive). Lemmatization was

employed to get the base form of the nouns, which would help to further simplify the values of

the extracted nouns. The make_base method from the preprocess_kgptalkie method

was used to perform the lemmatization. A new column named lemmatized_noun was

introduced to save the result of lemmatization. The code used to lemmatize the words is shown

in Appendix J, line 3.

Once the nouns were lemmatized, the next step was to discard the duplicates. The application

of lemmatization to the nouns resulted in words duplication, which needed to be handled. A

custom function was constructed to remove the duplicates. A new column

named unique_noun was added to the dataframe to keep the unique words. The code used in

removing the duplicates is shown in Appendix J, line 14. Rows with null values in the

new unique_noun column were removed from the dataframe.

Finally, short words were eliminated from the unique_noun column. In particular, these are

the words that have less than three characters. There is a high possibility that words under the

said category are merely a nuisance and do not represent a product idea and therefore discarded.
A custom one-liner function was designed to apply this step. A new column

named extracted_product_idea was introduced to store the result of the applied operation. The

code used to remove words with less than three characters is shown in Appendix J, line 17. Rows

with null values in the newly created extracted _product_idea column were removed from the

dataframe.

Moreover, rows with duplicate values in the extracted_product_idea column were

combined by aggregating their retweets_count, likes_count, and replies_count and getting the

average of their polarity_score. In addition, a new column named tweets_count was added to

the dataframe to store the number of occurrences of each extracted_product_idea. After the

values of retweets_count, likes_count, replies_count, polarity_score, and tweets_count were

aggregated, the duplicates were removed, leaving only the first occurrence of each

extracted_product_idea. The code used to implement all this step is shown in Appendix J, lines

and 24 and 26. The resulting dataframe was saved and used in section 2.1 to construct the criteria

for screening.

In the data preprocessing subsection, the fourth is data annotation. It is a process of labeling

the data available in various formats so that an ML model can quickly and clearly understand the

input patterns (Schreiner, 2006). This study conducted a manual data annotation to ensure that

only valid product ideas are included in the final dataset. Data annotation was carried out with

the aid of two BSIT undergraduate students. The result of all the steps mentioned above is a

dataset comprised of possible product ideas represented by the extracted_product_idea column

with corresponding attributes, including the number of retweets, likes, and replies. The said

dataset was divided into five subsets based on the month when the tweets were posted

(August.csv, September.csv, October.csv, November.csv, and December.csv). The first two


datasets were assigned to the first student, namely August.csv and September.csv. The

subsequent two subsets, October.csv and November.csv, were delegated to the second student.

The remaining subset (December.csv), which has the most number of rows, was given to the

researcher.

The objective of the data annotation is to mark the values of

the extracted_product_idea column as valid or invalid by labeling them as 1 or 0, respectively.

A new column named label was added to the dataframe to store the results of data annotation.

The basis for considering a product idea as valid is the presence of emotion (e.g., hate, want,

love, disgust) expressed towards the noun in a tweet. According to Nascimento & Da Silveira

(2017), people use social media platforms to express their emotions on any particular topic,

including product ideas. Moreover, since not all the values of extracted_product_idea are valid

product ideas, only those on the list of common product ideas for the Philippines' SMEs are

considered. For example, MoneyMax.com (2021) published a list of small businesses ideas with

small capital in 2021. A few of the product ideas in the said list are plant shop, beauty product

reselling business, and cake, dessert, and pastry business. The complete list of the category of the

product ideas used in categorizing the annotated product ideas is shown in Appendix K. The

values in the extracted_product_idea column were modified based on the closest category they

belong to and its result were stored in a new column named annotated_product_idea. The

output of both students was further validated by the researcher. Rows with 0 values in

the label column were dropped from the dataframe.

Finally, the polarity scores of the extracted_product_ideas were computed through

sentiment analysis. Sentiment analysis is the process of determining the sentiments on a piece of

text based on its polarity scores (Pang, Lee, & Vaithyanathan, 2002). The sentiments were
computed by getting the polarity scores of the values of the precleaned_tweets column in a

sentence level using Valence Aware Dictionary and sEntiment Reasoner (VADER). VADER is a

Python package that designs for complex social media data, as it considers words and

punctuations, emojis, slang, and abbreviated words commonly appear in social media texts

(Hutto & Gilber, 2014). The package measures sentiments by providing four valence scores:

positive, negative, neutral, and compound, ranging from –1 (extremely negative) to +1

(extremely positive). This study considered the results of the compound valence scores for the

analysis. It is the most commonly used basis for sentiment analysis by most researchers (Clayton,

H. & Gilbert, E., 2014). A new column named polarity_score was created and initialized using

the values of the computed compound valence scores. The code used to get the polarity scores is

shown in Appendix J, line 22. Rows with less 0.05 or the extracted_product_ideas with neutral

and negative sentiments values in the polarity_score column were dropped. The results resulting

dataframe was exported as a CSV file using the pandas library to be used for the succeeding

steps. The code used to export the preprocessed dataset is shown Appendix J, line 28.

2.1.3. Constructing the criteria for screening

Screening is the third phase of NPD. It is a complex decision process highly influenced by

market changes (Jespersen, 2007). Screening aims to evaluate new product ideas to determine

which idea is worth investing in (Rochford, 1991). A set of criteria relevant to a particular

business must be designed (Agrawal & Bhuiyan, 2014, Baker & Albaum, 1986). In this section,

the criteria for screening the values of extracted_product_idea column were constructed,

considering the factors considered by Baker & Albaum (1986). Accordingly, the said literature

used 33 criteria categorized into five factors: societal factor, business risk factor, demand

analysis, market acceptance factor, and competitive factor. The complete list of Baker &
Albaum’s criteria is shown in Appendix L. Out of the categories previously mentioned, only two

were chosen, namely, demand analysis and market acceptance. These factors were selected

because their values could be obtained using the available information in the gathered data. In

contrast, the remaining factors (societal, business risk, and competitive factors) require more

than just UGC. It involves information such as financials, skills, business values, and culture,

which simply are not available.

The first criterion for screening was the demand analysis. It is used to assess the number of

possible customers for a particular product (Hart et al., 2003). The latter [market acceptance] is

defined as the reaction of the possible customers to a specific product (Calatone & Cooper,

1979). First, the demand analysis was measured in three metrics: (1) potential market, (2) the

trend of demand, and (3) stability of demand. For the second factor, market acceptance, the

sentiment analysis was conducted. These factors and assigned criteria are summarized in Table

2.2. A product idea should satisfy all the criteria to pass the screening and be considered good.

The formula used in getting the scores of each criterion are shown in the succeeding paragraphs.

A new dataframe was created and filled with the computed values of each criterion mentioned

above using the preannotated version of the dataset; more particularly, the values of the

tweets_count, retweets_count, likes_count, replies_count, and polarity_score columns. The

said dataframe consists of seven columns: product_idea, potential_market, trend_of_demand,

stability_of_demand, market_acceptance, and label. In particular, the label column holds the

result of the screening. It has two possible values, 1 or 0. A value of 1 means that a product idea

passes the screening or good, while 0 says otherwise or not good. The dataframe was saved for

the next step, which is data preparation. The code used to screen the product ideas is shown in

Appendix M.
Table 2.2 Criteria for screening product ideas

Category Criterion Measurement Threshold

Demand Analysis Potential market engagement’s rate >=33%

Trend of demand slope of the trendline >0

Stability of demand increased in the tweet >=15%


volume
Market Acceptance Market acceptance sentiment score >=0.05/positive
Factor sentiment

The first criterion was the potential market. It is defined as the size of the total market size

for a particular product idea. The engagement rate was used to measure the potential market. It is

the ratio of the total interactions to the total number of tweets. Muñoz-Expósito (2017) proposed

a formula for getting the engagements on Twitter called the ratio of interest. The formula is

shown below.

retweets count +replies count + likescount


Potential market=
totalnooftweets

Accordingly, interactions consist of the total number of retweets, shared via email, replies,

and likes, collectively called diffusion interactions. Since tweets shared via email are not

available on the collected dataset, only retweets, replies, and likes are considered on the

computation. Scrunch, an influencer marketing platform, considered a 33% - 1% engagement

rate to be very high (Mee, 2020). A potential market of at least 33% was used as the threshold
value to consider a product idea as good. A new column named potential_market was

introduced to store the computed calculated potential market. The code is shown in Appendix N.

The second criterion was the trend of demand. It is the growth of demand over a period of

time. A study refers to this metric as a secular trend that describes data direction (e.g., upward or

downward) in the long term (Komlos, 1993). According to Trackmyhashtag, a firm that gives

analytics from raw tweets, tweet volume is the sum of the tweets, retweets, and replies

(Trackmyhashtag, 2020). The graphical method is used in measuring a secular trend (Komlos,

1993). The formula is shown below.

v ( December )−v ( August )


Slope=
12−8

Where: tweet volume=v=no of tweets+no of retweets + no of replies

Accordingly, a good trend is when the curve generated is smooth, which means that the

scores above the line should be greater than or equal to the score below it. Sample trendlines are

shown in Figure 2.4. The slope of the trendline from August to December was monitored to get

the trend of demand. A value greater than 0 was used as the threshold value to consider a product

idea as good. A new column named trend_of_demand was created to hold the values of the

trend of demand. The code is shown in Appendix O.


Figure 2.4 Sample trends of demand
The third was the stability of demand. It is the fluctuation in the market size from one year to

another. The increase in the tweet volume was compared to measure the stability of demand.

Usually, the standard is to assess the stability of demand annually, but since the collected data

are tweets posted in 6-month time, the increase in the tweet volume from November to

December 2020 was calculated. The formula is shown below.

previous−current
Stabilityofdemand= ∗100
previous
Where:
previous=tweetvolume ∈ November 2020

current = tweet volume in December 2020

Baremetrics, a company that provides analytics and insights for business, reported that

companies should have 15%-45% stability of demand yearly. A threshold of at least 15% was set

as the value of the stability of demand to consider a product idea. A new column named

stability_of_demand was created to store the values of the stability of demand. The code is

shown in Appendix P.

Finally, market acceptance is described as the sentiments of the people towards a particular

product idea. The computed compound valence scores from the sentiment analysis step were
utilized to measure the sentiments. According to Hutto & Gilber (2014), compound valence

scores can be negative, positive, and neutral. The interpretation table is shown in Table 2.3.

Accordingly, a score value greater than or equal to 0.05 is considered positive. A threshold of at

least 0.05 was set as the value of the market acceptance to consider a product idea. A new

column named market_acceptance was created to store the values of the market acceptance.

The code is shown in Appendix Q.

Table 2.3 Interpretation table of the polarity scores

Polarity score Interpretation

compound score >= 0.05 Positive sentiment

compound score > -0.05 and compound score < 0.05 Neutral sentiment

compound score <= 0.05 Negative sentiment

2.1.4. Data preparation

Data preparation aims to prepare the dataset for modeling (Zhang, Zhang, Yang, 2003). In

this step, the dataset to be used for building the model was finalized. In particular, the results

obtained from the screening were utilized. Specifically, two columns were considered,

namely product_idea and label. The first column contains the validated product ideas, while the

second column specifies the label for that product idea. A good product idea has a label 1;

otherwise, a value of 0 was specified. Since ML algorithms do not understand text values, word

embedding must be applied (Ge & Moh, 2017). Word embedding is the process of converting
text into numerical values. This study utilized the Word2Vec encoding scheme to convert the

values of the product_idea column to their numerical equivalent. Research has shown that using

the Word2Vec with SVC helps achieve a high-performing model for text classification

(Lilleberg, Zhu, & Zhang, 2015). The implementation of word embedding is shown in Appendix

R, line 20. Once the dataset was prepared, it was divided into two subsets. The first subset was

used to train the model. It consists of 80% of the entire dataset. The second subset was reserved

for testing. The remaining 20% was used to test the performance of the trained model. The

dataset was split using the train_test_split function from the sklearn package. To

refine further the dataset, a down sampling technique was carried out using the NearMiss

package. The code used to create the training and testing subsets is shown in Appendix R, line

26.

2.1.5. Parameter tuning

Parameter tuning was performed to get the ideal values of the parameters of the SVC

algorithm. The performance of a model is only as good when using its default parameters (Smit

& Eiben, 2009). The parameters must be fine-tuned to get their optimal values to create a high-

performing model. Parameter tuning was done using the randomizedsearchcv package;

and the results of the data preparation was utilized. It is a technique that randomly searches for

the optimal parameters for a given model. The parameters used in the parameter of the model

and their possible values are listed in Table 2.4. This study considered four parameters of the

SVC model: 1) kernel, 2) gamma, 3) degree and 4) C. The first parameter specifies the kernel

type to be used in the algorithm. Second, gamma is the value of the kernel coefficient. The third

parameter is the degree of the polynomial kernel function. Lastly, the C parameter is the

regularization parameter. The code used in parameter tuning is shown in Appendix S.


Table 2.4 Parameters of the model

# Parameter Values
1 kernel linear, rbf, poly
2 gamma 0.1, 1, 10, 100, 1000
3 degree 0,1,2,3,4,5,6
4 C 0.1, 1, 10, 100,1000

2.1.6. Performance evaluation

Performance evaluation is one of the critical activities in ML (Korde & Mahender, 2012).

Accordingly, it is not enough that a model can do prediction or classification; measuring its

performance is the most effective and reliable way to assess how well a model performs. Four

performance metrics, including (1) accuracy, (2) precision, (3) recall, and (4) f1-score were

calculated using the sklearn libraries to evaluate the performance of the model. Specifically, a

classification report was generated to get the precision, recall, and f1-score values, and cross-

validation was conducted to determine the accuracy. In addition, a confusion matrix was

generated to examine the results of the classification. The trained and tested model was saved to

be used in implementing the proposed screening application. The metrics used in performance

evaluation are further explained in the succeeding paragraphs.

The first performance metric was the accuracy. It is the ratio of correctly predicted values to

the total number of observations. The cross-validation was carried out using the

cross_val_score method from the sklearn module. It is a technique that evaluates the

model's performance by dividing a dataset into n number of subsets to get the consistency in the

accuracy of a model; more particularly, 10th fold cross-validation was performed. The average
score was used as the final value of the accuracy. The implementation is shown in Appendix T,

line 24. The formula for getting the accuracy is shown below.

True Positive+ True Negative


Accuracy=
Total no of observations

The second performance metric was the precision. It is defined as the ratio of the correctly

predicted values of a particular class to the total correctly and incorrectly predicted values of that

specific class. The classification_report module was used to get the precision score of the model.

In particular, a classification report was generated. The weighted average precision score was

considered as the precision of the model. The implementation is shown in Appendix T, line 21.

The formula for getting the precision is shown below.

True Negative True Positive


+
True Negative+ False Negative True Positive + False Positive
Precision=
No of classes

The third performance metric was the recall. It is the ratio of correctly predicted values of a

particular class to correctly predicted values and the total incorrectly predicted values of all other

classes. To get the value of the recall, a classification report was generated. The weighted

average recall score was considered as the recall of the model. The implementation is shown in

Appendix T, line 21. The formula for getting the recall is shown below.

True Negative True Positive


+
True Negative+ False Positive True Positive + False Negative
Recall=
No of classes
Finally, the f1-score is the weighted average of Precision and Recall. A classification report

was generated to get the value of the f1-score. The weighted average f1-score score was

considered as the f1-score of the model. The implementation is shown in Appendix T, line 21.

The formula for getting the f1-score is shown below.


2∗( Precision∗Recall )
F 1−score=
Precision+ Recall
2.2. Implementation

Application development is the process of creating a program or set of programs to solve an

existing challenge or improve a current process (Burns & Dennis, 1985). In particular, web

development aims to create an application that can be accessed across the web using any device

with a browser. Various ML and web technologies were utilized to implement the proposed

screening application. Furthermore, the persisted SVC model was integrated and used in the

development. The technology stack considered in the implementation is shown in Table 2.5.

Table 2.5 Technology stacks

Development category Technologies


Front-end HTML 5, CSS 3, JS (ES2015), Bootstrap v5.1, Font awesome
v5.15.4
Back-end Python libraries (see Appendix A), Python 3.8, Flask v2.0.2

Moreover, the implementation comprises of front-end and back-end development. The

former shows the front-end technologies used and the tools for designing the User Interface (UI)

of the application. The latter provides a list of the different functions of the application and the

various Python libraries and frameworks used to construct them. The packages used in the

implementation are stored in the same venv used in the former steps and are listed in Appendix

A. These steps are further discussed in the succeeding sections. The system architecture of the

application is presented in Figure 2.5.


Figure 2.5 System Architecture of the screening application
2.2.1. Front-end development

Over the years, the devices that are capable of accessing the Internet have grown. Gardner

(2011) reports that a new standard for web design has been proposed. According to the report, it

aims to standardize responsive web design to reduce the developer's workload by developing a

single application that can adapt across devices and improve user experience. The web

fundamental technologies, including HTML, CSS, JS, were used in designing the front-end. In

addition, Bootstrap and icons from Font Awesome were utilized. In particular, Bootstrap is a

CSS framework designed to make a web application responsive in any computing device with a

browser, including desktop computers, mobiles, and tablets. The design of the UI was simplified

to allow more time for the back-end development.

More specifically, three pages were designed: search page, result page, and dashboard. The

first page was created to allow users to enter a product idea. The result page shows the results of

the screening. Finally, a dashboard that provides a list of the good product ideas. The pages were

launched on different devices such as desktop computers, mobiles, and tablets enabled by

Google Chrome developer tools to assess the UI quality. Furthermore, the application was
opened in three major browsers to check its compatibility, including Google Chrome, Mozilla

Firefox, and Microsoft Edge.

2.2.2. Back-end development

Back-end development refers to server-side development. It focuses on maintaining the

technology stacks behind the application (Kaluža, 2019). In recent years, a Python-based

microframework, Flask, has been the preferred web development platform for ML applications,

mainly because of its simplicity and ease of use (Aslam, Mohammed, & Lokhande, 2015; Mufid

et al., 2019). Flask is an open-source web framework for building data-driven and dynamic web

applications quickly. Flask and various Python libraries were utilized to develop the functions of

the screening application. The list of each function and its corresponding libraries is presented in

Appendix A. In particular, seven functions were constructed to implement the proposed

application: (1) reading the input, (2) preprocessing the input, (3) validating the input and

showing the error, (4) translating the input to English, (5) converting input into vector

representations, (6) classifying the input, and (7) showing the viable product ideas. These

functions are listed in Table 2.6.

Table 2.6 Functions of the screening application

# Function Description
1 To read the input from the user The function is used to accept input from
the user on the screen page.
2 To preprocess the input The function is used to remove noise
● To convert input to lower case words/characters from the input.
● To remove special characters
● To remove emails
● To remove links
● To remove HTML tags
● To remove accented characters
● To remove repeated letters
● To remove numerical values
3 To validate the input and show error The function used to validate if the input is
message a string consists of at least three characters,
English and a noun. Error message/s is
going to appear if the input does not satisfy
any of the above conditions.
4 To translate non-English input to English The function is used to ensure that only
English and Tagalog are accepted.
5 To convert input to its vector The function is used to perform word
representations embedding to an input using the Word2Vec
encoding scheme.
6 To classify the input The function is used to classify whether an
input is good and display the results on the
result page. The output would come from
the trained text classification model.
7 To show the product ideas The function is used to display viable
product ideas in the dashboard.

The flow of the application is shown in Figure 2.6. Its quality was tested with different

inputs, including numerical, non-English, Tagalog strings, and other possible invalid inputs.
Figure 2.6 A Flowchart of the screening application
Chapter 3: Results and discussions

This chapter presents and discusses the results of the study. It is divided into two main

sections: section 3.1) building the model and section 3.2) implementation. The first section

presents the dataset used in the study. Moreover, the performance of the model is revealed. In the

second section, the result of the implementation is shown. Specifically, the UI and the different

functions are highlighted. These sections are further explained in the succeeding paragraphs.

3.1. Building the model

A model is an ML algorithm that has been trained using a particular dataset and uses the

acquired knowledge to identify some patterns to make a classification (Feldman, 2013). This

study built an SVC model that classifies a product idea as good or not good. This section is

divided into five subsections: section 3.1.1) data collection, section 3.1.2) data preprocessing,

section 3.1.3) constructing the criteria for screening, section 3.1.4) data preparation, and section

3.1.5) performance evaluation. First, the gathered dataset is shown, and significant statistics are

highlighted. In the second subsection, the preprocessed dataset is presented. The result of each

subprocess is explained. Next, the results of the screening using the extracted product ideas from

the tweets are discussed. Fourth, the prepared datasets used to train and test the model are

introduced. Finally, the performance of the model is revealed and examined.

3.1.1. Data collection

The datasets were gathered using an advanced scraping tool called Twint. It is an open-

sourced advanced tweet scraping tool often used in research to exploit publicly posted

information, including tweets (Dutch Osint Guy, 2018). The said tool collected tweets posted

within the Philippines from August to December 2020. It uses the coordinates and radius values

when considering which tweets to collect. This study used the coordinates of Marinduque
(12.072862,122.664139), the geographical center of the Philippines, and 768.54KM radius,

respectively. The value of the radius was found to be ideal covering almost the entire island of

the country. Using a lower radius’ value resulted in collecting lesser tweets. On the other hand,

using a higher value tends to include tweets posted outside the Philippines, including tweets sent

from Malaysia in the southwest, Indonesia in the south, Vietnam in the west, and Taiwan and

mainland China in the north (See Table 3.1).

Table 3.1 Sample collected tweets when using a high radius value

Tweet Country
@oogiri_zamurai なんでトイレのドア Japan
が開かないの? もうおれ我慢の限界な
んだよね。

Tou amoré 😍 Portugal

나는 네가 웃는 것을 보는 것을 좋아한다. Korea
너는 또한 나를 행복하게 만들어 줘 오빠.
너는 나의 행복의 이유야. #예성버블
https://t.co/0baNrlGCcp

@taIktaIk2O2O เดี๋ยวเงินหมด ก็มาหาหลอกกะเทย Thailand


ใหมค
่ ะ่
20K फॉलोवर्स की बधाई और शुभकामनाएं आप निरंतर आगे बढ़ते India
रहिये और ऐसे ही निडर होकर लिखते रहिये स्वस्थ रहिये खुश
रहिये🙏🏻 https://t.co/PKd9MFiruI

A total of 5,193,417 tweets were scraped during the data collection process. A few of the

collected tweets are shown in Appendix U. Each tweet has 35 attributes that give more

information about a particular tweet, including the date when a tweet is posted, the author, the

place where the tweet is posted, the language, and the number of retweets. The complete list of

all the attributes is shown in Appendix V. In particular, most of the gathered tweets were posted
in December 2020 with 1,194,486 (23%). Conversely, only 882,880 (19%) were collected in

August 2020. The complete distribution of the collected tweets based on the month when they

are posted is shown in Figure 3.1.

Figure 3.1 Tweet Month Distribution

Although the tweets were only limited to areas within the Philippines, 54 languages were

observed from the collected data. Specifically, more than half of the tweets are Tagalog (tl) with

53%. It is followed by English (en) with 30%. 9% consists of the other languages, including

Ukrainian (uk) and Spanish (es). The remaining 9% has no defined language. The percentage of

the languages of the tweets is summarized in Figure 3.2.


Figure 3.2 Tweet Language Distribution

3.1.2. Data preprocessing

Data preprocessing is an essential part of building a model. It aims to eliminate the noise in a

given dataset. The presence of the noise is detrimental to the performance of a model (Yang,

2018). The collected tweets went through a series of preprocessing steps, including (1) data

wrangling, (2) data reduction, (3) extracting the product ideas, (4) data annotation, and (5)

sentiment analysis. As a result, the dataset was reduced from over 5M rows and 35 columns to

2,926 rows with 6 columns. In particular, the columns comprise the annotated_product_idea,

tweets_count, replies_count, retweets_count, likes_count, and polarity_score. The original

tweets were preprocessed using various NLP Python libraries. A few of the preprocessed tweets

is shown in Table 3.2. The results of data preprocessing are discussed in the succeeding

paragraphs.
Table 3.2 Sample preprocessed tweets

extracted tweets_ tweets_ replies_ likes_ polarity_


product_idea count count count count score
chocolate 64 12 19 52 0.2462
cake roll

cheese pizza 5 4 6 6 0.0454

spinach 43 12 22 137 0.2240


lasagna 90 117 133 217 0.2576

gym 4 1 1 16 0.3884
equipment
shop
haircut 4 1 12 1 0.0772

The first step in data preprocessing was to create data wrangling. It restructures raw data into

the desired format for data preprocessing. The collected tweets were transformed from a TSV

file format into a pandas dataframe. The transformation would allow useful features of

the pandas package to be used in data preprocessing. Consequently, the process was able to

create a dataframe with 5,193,417 rows and 36 columns. A few of the content of the created

dataframe is shown in Appendix U. The complete list of the columns is shown in Appendix V.

Second, data reduction was conducted to get rid of irrelevant data in the dataframe. It is a

technique to reduce the number of features in a dataset. Data reduction was made by reducing the

number of rows and columns of the created dataframe. In particular, the row selection only

considered English tweets. In column selection, only the columns that would be useful in getting

the engagements and sentiments of a product idea were included (Fischer & Reuber, 2011;
Sultana, Paul; Gavrilova, 2016; Anto et al., 2016; Ray & Chakrabarti, 2017; Geetha, Rekha, &

Rarthika, 2018; Costa et al., 2013). These include created_at, tweet, retweets_count,

replies_count, and likes_count columns. This step reduced the dimension of the dataframe into

1,557,442 rows and 5 columns. A few of the rows in the modified dataframe are shown in Table

3.3.

Table 3.3 Sample rows from the modified dataframe

created_at tweet retweets_count replies_count likes_count

TRIPLE
2020-10-18 CHOCOLATED
0 0 1
09:20:19 PST CHOCOLATE CAKE
ROLLS 🤤😭 #sweets

@jacob craving for


2020-12-18
cheeeeeeeeese pizza 0 4 0
12:00:37 PST
🧀🍕 in SM Baguio

2020-10-07  want stir-fried spinach


0 0 1
17:06:05 PST 😫❤

 want lasagna and


2020-10-27
lasagna only. 😭 1 0 19
21:57:19 PST
https://t.co/Ggcb0cJe0B

suggest me sum
affordable gym
2020-11-27
equipment shops, i 0 0 2
18:53:17 PST
really need to lift some
weights🥺
2020-12-31
parang i want a haircut
23:59:58 PST 0 15 1
na.
The third step to preprocess the dataset was to extract the product ideas from the tweets.

UGC produced in social media, including tweets, could be used to find and develop product

ideas (Nascimento & Da Silveira, 2017). The process of extracting the product ideas is further

divided into seven sub-steps, including (1) data precleaning, (2) pulling out the nouns, (3) data
cleaning, (4) removing stopwords, (5) lemmatization, (6) discarding duplicate words, and (7)

eliminating short words. The results of each sub-step are further discussed in the succeeding

paragraphs. As a result, a total of 118,387 product ideas were extracted from the tweets. Sample

extracted product ideas are shown in Table 3.4.

Table 3.4 Sample product ideas extracted from the tweets

tweet extracted_product_idea

TRIPLE CHOCOLATED CHOCOLATE chocolate cake


CAKE ROLLS 🤤😭 #sweets

@jacob craving for cheeeeeeeeese pizza 🧀🍕 cheese pizza


in SM Baguio
want stir-fried spinach 😫❤ spinach

want lasagna and lasagna only. 😭 lasagna


https://t.co/Ggcb0cJe0B
suggest me sum affordable gym equipment gym equipment
shops, i really need to lift some weights🥺

parang i want a haircut na. barber shop

The first sub-step to extract the product ideas was data precleaning. The tweets were initially

cleaned using regex and NumPy modules to remove the noise words such as mentions,

hashtags, retweets, and links. A new column named precleaned_tweet was added to the

modified dataframe to hold the results of the data precleaning. Rows with null values in the

newly created column precleaned_tweet were dropped. Sample results are shown in Table 3.5.

In particular, in the first row, the hashtag #sweet was removed. On the second and third rows, the

mention @jacob and the link, https://t.co/Ggcb0cJe0B, were removed, respectively. This step

managed to bring the number of rows of the dataframe down to 1,501,123.


Table 3.5 Sample precleaned tweets

tweet precleaned_tweet
TRIPLE CHOCOLATED CHOCOLATE TRIPLE CHOCOLATED CHOCOLATE
CAKE ROLLS 🤤😭 #sweets CAKE ROLLS 🤤😭

@jacob craving for cheeeeeeeeese pizza 🧀🍕 craving for cheeeeeeeeese pizza 🧀🍕


in SM Baguio
want stir-fried spinach 😫❤ want stir-fried spinach 😫❤
want lasagna and lasagna only. 😭 want lasagna and lasagna only. 😭
https://t.co/Ggcb0cJe0B
suggest me sum affordable gym equipment suggest me sum affordable gym equipment
shops, i really need to lift some weights🥺 shops, i really need to lift some weights🥺
parang i want a haircut na. parang i want a haircut na.

The second sub-step to extract the product ideas was to pull out the nouns from the

precleaned _tweets column. Its result was used to determine which tweets contain possible

product ideas for Philippines’s SMEs. The nouns were selected because, generally, product ideas

are expressed through nouns (Malmasi & Dras, 2015). Three POS tagger libraries, TextBlob,

NLTK, and Spacy, were used, and their outputs were evaluated to select the best library for

pulling out the nouns. A comparison of the extracted nouns using the three libraries is shown in

Table 3.6. Among the three libraries, Spacy got the slowest time when pulling out nouns with 1

hour and 48 minutes compared to TextBlob, which did the same task in 52 minutes and 39

seconds, and NLTK with 44 minutes and 15 seconds. The results of TextBlob and NLTK are

almost identical. Specifically, both often treated emojis, adverbs, and even upper-case words as

nouns, which leads to inaccurate noun extraction. Overall, Spacy got the most accurate result and

was chosen as the library. Sample results are shown in Table 3.7. A new column
named pulled_out_noun was introduced to the dataframe to store the pulled-out nouns. Rows

with null values in the pulled_out_noun column were removed from the dataframe. This step

further reduced the dataframe into 1,481,864 rows.

Table 3.6 Sample nouns extracted from the precleaned tweets using the three POS tagger

precleaned_tweet textblob_noun nltk_noun spacy_noun


TRIPLE CHOCOLATE CHOCOLATE CHOCOLATE CHOCOLATED
CAKE ROLLS 🤤😭 CAKE CAKE ROLLS CHOCOLATE
CAKE ROLLS
Craving for cheeeeeeeeese pizza 🧀🍕 pizza 🧀🍕 cheeeeeeeeese pizza
pizza 🧀🍕 🧀 sm

want stir-fried spinach 😫 spinach 😫❤ spinach 😫❤ spinach ❤



want lasagna and lasagna lasagna lasagna. lasagna lasagna 😭 lasagna lasagna
only. 😭 😭
suggest me sum gym equipment gym equipment gym equipments
affordable gym equipment weights🥺 shops weights🥺 shop
shops, i really need to lift
some weights🥺
parang i want a haircut na. parang i haircut parang i haircut na.. haircut na
na.

Table 3.7 Sample pulled out nouns from the precleaned tweet

tweet precleaned_tweet pulled_out_noun


TRIPLE CHOCOLATED TRIPLE CHOCOLATED CHOCOLATED
CHOCOLATE CAKE ROLLS CHOCOLATE CAKE CHOCOLATE CAKE
🤤😭 #sweets ROLLS 🤤😭 ROLLS

@jacob craving for craving for cheeeeeeeeese cheeeeeeeeese pizza 🧀 sm


cheeeeeeeeese pizza 🧀🍕 in pizza 🧀🍕
SM Baguio
want stir-fried spinach 😫❤ want stir-fried spinach 😫 spinach ❤

want lasagna and lasagna only. want lasagna and lasagna lasagna lasagna
😭 https://t.co/Ggcb0cJe0B only. 😭
suggest me sum affordable suggest me sum affordable gym equipments shop
gym equipment shops, i really gym equipment shops, i
need to lift some weights🥺 really need to lift some
weights🥺
parang i want a haircut na. parang i want a haircut na. haircut na
Once the nouns were pulled out of the precleaned tweets, the pulled_out_noun were further

cleansed using the preprocess_kgptalkie package. The said package would provide,

among others, functions to remove any remaining unnecessary words and characters that were

mistakenly pulled out nouns by the Spacy library. Each noun went through a series of

preprocessing tasks, including (1) converting words to lower case, (2) expanding contractions,

(3) removing emails, (4) removing html tags, (5) removing accented characters, (6) removing

special characters, (7) discarding unnecessary repeated letters. A new column

named cleaned_noun was created to hold the results of data cleaning. Sample cleaned nouns are

shown in Table 3.8. In the table, the extracted noun was converted to lower case in the first row.

In addition, emojis such as the "cheese" and "heart" in rows 1 and 2 were removed. Furthermore,

in the second row, the repeated letters from the word "cheese" were discarded. Finally, rows with

null values were dropped from the dataframe. This step managed to bring the number of rows of

the dataframe down to1,401,264 rows.


Table 3.8 Sample cleaned nouns

tweet precleaned_tweet pulled_out_noun cleaned_noun


TRIPLE TRIPLE CHOCOLATED chocolate cake rolls
CHOCOLATED CHOCOLATED CHOCOLATE
CHOCOLATE CAKE CHOCOLATE CAKE CAKE ROLLS
ROLLS 🤤😭 #sweets ROLLS 🤤😭

@jacob craving for craving for cheeeeeeeeese cheese pizza sm


cheeeeeeeeese pizza cheeeeeeeeese pizza pizza 🧀 sm
🧀🍕 in SM Baguio 🧀🍕
want stir-fried spinach want stir-fried spinach spinach ❤ spinach
😫❤ 😫❤
want lasagna and want lasagna and lasagna lasagna lasagna lasagna
lasagna only. 😭 lasagna only. 😭
https://t.co/Ggcb0cJe0B
suggest me sum suggest me sum gym equipments gym equipments
affordable gym affordable gym shop shop
equipment shops, i equipment shops, i
really need to lift some really need to lift some
weights🥺 weights🥺
parang i want a haircut parang i want a haircut haircut na haircut na
na. na.

Next, the Tagalog (e.g., akin, ako, at, dapat) and English (e.g., could, or, and, rather)

stopwords were removed from the values of the cleaned_noun column still using the Spacy

library. Stopwords are words that do not add much meaning to a sentence. A new column

named stopwords_free_noun was introduced to hold the results of the operation. The sample

output is shown in Table 3.9. The Tagalog stopword “na” in the sixth row was removed from the

word “haircut na.” Rows with null values in the stopwords_free_noun column were eliminated

from the dataframe. This step managed to bring the number of rows of the dataframe down to

1,081,021.
Table 3.9 Sample cleaned nouns without the stopwords

tweet precleaned_ pulled_out_ cleaned_ stopwords_


tweet noun noun free_noun
TRIPLE TRIPLE CHOCOLATED chocolate chocolated
CHOCOLATED CHOCOLATED CHOCOLATE cake rolls chocolate
CHOCOLATE CHOCOLATE CAKE ROLLS cake rolls
CAKE ROLLS CAKE ROLLS
🤤😭 #sweets 🤤😭

@jacob craving for craving for cheeeeeeeeese cheese pizza cheese pizza
cheeeeeeeeese cheeeeeeeeese pizza pizza 🧀 sm sm sm
pizza 🧀🍕 in SM 🧀🍕
Baguio
want stir-fried want stir-fried spinach ❤ spinach spinach
spinach 😫❤ spinach 😫❤
want lasagna and want lasagna and lasagna lasagna lasagna lasagna
lasagna only. 😭 lasagna only. 😭 lasagna lasagna
https://t.co/Ggcb0cJ
e0B
suggest me sum suggest me sum gym equipments gym gym
affordable gym affordable gym shop equipments equipment
equipment shops, i equipment shops, i shop shops
really need to lift really need to lift
some weights🥺 some weights🥺
parang i want a parang i want a haircut na haircut na haircut
haircut na. haircut na.

The fifth sub-step to extract the product ideas was lemmatization. It is used to convert a word

into its base form (e.g., from equipments to equipment, from walked to walk. from mice to

mouse, from dove to dive). The values of the stopwords_free_noun column were lemmatized

using the make_base method available in the preprocess_kgptalkie package. The

lemmatization was a necessary step to remove the duplicates. A new column

named lemmatized_noun was added to the dataframe to store the lemmatized nouns. The

sample output is shown in Table 3.10. In particular, the word “chocolated” from the first row

was converted to “chocolate.” Furthermore, “rolls” was simplified to its singular form, “roll.”
Similarly, in the 5th row, the “gym equipments shop” was transformed to “gym equipment

shop.”

Table 3.10 Sample lemmatized nouns

tweet precleaned_ pulled_out_ cleaned_ stopwords_ lemmatized_


tweet noun noun free_noun noun
TRIPLE TRIPLE CHOCOLAT chocolate chocolated chocolate
CHOCOLATE CHOCOLAT ED cake rolls chocolate chocolate
D ED CHOCOLAT cake rolls cake roll
CHOCOLATE CHOCOLAT E CAKE
CAKE ROLLS E CAKE ROLLS
🤤😭 #sweets ROLLS 🤤😭

@jacob craving craving for cheeeeeeeeese cheese cheese pizza cheese pizza
for cheeeeeeeees pizza 🧀 sm pizza sm sm sm
cheeeeeeeeese e pizza 🧀🍕
pizza 🧀🍕 in
SM Baguio
want stir-fried want stir- spinach ❤ spinach spinach spinach
spinach 😫❤ fried spinach
😫❤
want lasagna want lasagna lasagna lasagna lasagna lasagna
and lasagna and lasagna lasagna lasagna lasagna lasagna
only. 😭 only. 😭
https://t.co/Ggc
b0cJe0B
suggest me suggest me gym gym gym gym
sum affordable sum equipments equipments equipments equipment
gym equipment affordable shop shop shops shop
shops, i really gym
need to lift equipment
some weights🥺 shops, i really
need to lift
some
weights🥺
parang i want a parang i want haircut na haircut na haircut haircut
haircut na. a haircut na.

Since the lemmatization converted words into their base forms, duplicates were introduced in

the lemmatized_noun column. In sub-step 6, these duplicates were taken care of. A custom

function was created to remove the repeated words. A new column named unique_noun was
created to hold the results of the operation. The sample output is shown in Table 3.11. In

particular, the second “chocolate” was removed in the first row. Rows with null values in the

unique_noun column were dropped from the dataframe. This step was able to shrink the rows of

the dataframe to 1,079,001.

Table 3.11 Sample unique nouns

tweet precleaned pulled_out cleaned_ stopwords_ lemmatiz unique_n


_ _ noun free_noun ed_noun oun
tweet noun
TRIPLE TRIPLE CHOCOLA chocolate chocolated chocolate chocolate
CHOCOLA CHOCOL TED cake rolls chocolate chocolate cake roll
TED ATED CHOCOLA cake rolls cake roll
CHOCOLA CHOCOL TE CAKE
TE CAKE ATE ROLLS
ROLLS CAKE
🤤😭 ROLLS
#sweets 🤤😭

@jacob craving for cheeeeeeee cheese cheese pizza cheese cheese


craving for cheeeeeeee ese pizza 🧀 pizza sm sm pizza sm pizza sm
cheeeeeeeees ese pizza sm
e pizza 🧀🍕 🧀🍕
in SM
Baguio
want stir- want stir- spinach ❤ spinach spinach spinach spinach
fried spinach fried
😫❤ spinach 😫

want lasagna want lasagna lasagna lasagna lasagna lasagna
and lasagna lasagna and lasagna lasagna lasagna lasagna
only. 😭 lasagna
https://t.co/G only. 😭
gcb0cJe0B
suggest me suggest me gym gym gym gym gym
sum sum equipments equipme equipment equipment equipment
affordable affordable shop nts shop shops shop shop
gym gym
equipment equipment
shops, i shops, i
really need really need
to lift some to lift some
weights🥺 weights🥺
parang i parang i haircut na haircut haircut haircut haircut
want a want a na
haircut na. haircut na.

Finally, short words were eliminated from the values of the unique_noun column. These

words consist of less than three characters. This was done to exclude words that may not have

value in the unique_noun column. A new column named extracted_product_idea was created

to store the results. The sample output is shown in Table 3.12. The word “sm” in the second row

has two characters and therefore was removed. Rows with null values in

the extracted_product_idea column were removed from the dataframe. This reduced the rows

of the dataframe to 522,943.

Table 3.12 Sample nouns without short words

tweet preclean pulled_o cleaned stopword lemmati unique_ extracte


ed_ ut_ _ s_ zed_nou noun d_produ
tweet noun noun free_nou n ct_idea 
n
TRIPLE TRIPLE CHOCOL chocola chocolate chocolat chocolat chocolat
CHOCOL CHOCO ATED te cake d e e cake e cake
ATED LATED CHOCOL rolls chocolate chocolat roll roll
CHOCOL CHOCO ATE cake rolls e cake
ATE LATE CAKE roll
CAKE CAKE ROLLS
ROLLS ROLLS
🤤😭 🤤😭
#sweets

@jacob craving cheeeeeee cheese cheese cheese cheese cheese


craving for for eese pizza pizza pizza sm pizza sm pizza sm pizza
cheeeeeeee cheeeeeee 🧀 sm sm
ese pizza eese pizza
🧀🍕 in 🧀🍕
SM Baguio
want stir- want stir- spinach spinach spinach spinach spinach spinach
fried fried ❤
spinach 😫 spinach
❤ 😫❤
want want lasagna lasagna lasagna lasagna lasagna lasagna
lasagna lasagna lasagna lasagna lasagna lasagna
and and
lasagna lasagna
only. 😭 only. 😭
https://t.co/
Ggcb0cJe0
B
suggest me suggest gym gym gym gym gym gym
sum me sum equipment equipm equipment equipme equipme equipme
affordable affordabl s shop ents shops nt shop nt shop nt shop
gym e gym shop
equipment equipmen
shops, i t shops, i
really need really
to lift some need to
weights🥺 lift some
weights🥺
parang i parang i haircut na haircut haircut haircut haircut haircut
want a want a na
haircut na. haircut
na.

Going back to data preprocessing, the fourth step to accomplish was data annotation. This

step involves manually removing tweets that express a product idea and therefore not valid. The

basis for considering a product idea as valid is the presence of emotion (e.g., hate, want, love,

disgust) expressed towards the noun in a tweet. According to Nascimento & Da Silveira (2017),

people use social media platforms to express their emotions on any particular topic, including

product ideas. A new column named label was added to the existing dataframe to store the

output of data annotation. A value of 1 indicates that the extracted product idea is valid, while 0

means otherwise. Furthermore, since not all the extracted nouns are valid product ideas, only

those on the list of the most common product ideas for the Philippines' SMEs are considered.

The list the product ideas for SMEs are shown in Appendix K. Another column

named category was added to the dataframe to store the modified value of the


extracted_product_idea column based on the closest product ideas’ category they belong to.

Sample output is shown in Table 3.13.

Table 3.13 Sample annotated product ideas

extracted_product_idea label category 


acrylic 1 Art accessories shop

adobo 1 Home cooked meals


antibiotic 1 Drug store
yema cake 1 Cake, dessert, and pastry business
airpod 1 Tech and accessories shop
abaca face mask 1 Face mask/face shield business
accessory shop 1 Online fashion boutique
circle year 0 None
disappointment today 0 None

The data annotation was done with the aid of BSIT undergraduate students. The dataframe

with 522, 943 rows, and 12 columns were divided into five subsets using the month when the

tweets were posted. This resulted in subsets, namely, August.csv, September.csv, October.csv,

November.csv, and December.csv. The number of rows per subset before and after data

annotation is shown in Table 3.14. The first two subsets were assigned to the first student, while

the next two were delegated to the second. The remaining subset was reserved for the researcher.

The results made by the students were further validated by the researcher. Rows with 0 values in

the label column were dropped from the dataframe which brings the rows down to 11,280.

particular, the label column holds the result of the screening. It has two possible values, 1 or 0. A

value of 1 means that a product idea passes the screening or good, while 0 says otherwise or not

good. The dataframe was saved for the next step, which is data preparation. The code used to

screen the product ideas is shown in Appendix M.


Table 3.14 Comparison of each subset before and after data annotation

Subset No of rows of the pre- No of rows of rows of the


annotated subset post-annotated subset
August.csv 100,185 1,951

September.csv 105, 587 6,949

October.csv 96, 172 947

November.csv 104, 203 123

December.csv 116, 796 1,310

Total 522,943 11,280

Finally, sentiment analysis was done to compute the polarity scores of

the annotated_product_idea using the values of the precleaned_tweet column. Sentiment

analysis was carried out using a package called VADER. It is a package that designs for complex

social media data. When getting the polarity scores, it considers punctuations, emojis, slang, and

abbreviated words that commonly appear in social media texts (Hutto & Gilber, 2014). A new

column named polarity_score was added to the dataframe to store the computed polarity scores.

Rows with polarity score less than 0.05 representing neutral and negative sentiments were

dropped. This left the dataframe to 2,926 rows. Sample results are shown in Table 3.15. The

resulting dataframe was exported and used in the succeeding steps.


Table 3.15 Sample sentiment scores obtained from the precleaned tweets

precleaned_tweet extracted_product_idea polarity_score 

TRIPLE CHOCOLATE chocolate cake roll 0.2462


CAKE ROLLS 🤤😭
Craving for cheeeeeeeeese cheese pizza 0.0454
pizza 🧀🍕
want stir-fried spinach 😫❤ spinach 0.2240

want lasagna and lasagna lasagna 0.2576


only. 😭
suggest me sum affordable gym equipment shop 0.3884
gym equipment shops, i
really need to lift some
weights🥺
parang i want a haircut na. haircut 0.0772

3.1.3. Constructing the criteria for screening

Screening aims to evaluate new product ideas to determine which idea is worth investing in

(Rochford, 1991). A set of criteria was constructed to screen the values of the

annotated_product_idea column from the preprocessed dataset. The created criteria were based

on the factors considered by Baker & Albaum's (1986). These criteria include potential market,

trend of demand, stability of demand, and market acceptance. (See Table 2.2). A new dataframe

was created using the aggregated values of the tweets_count, retweets_count, likes_count,

replies_count, and polarity_score columns of the annotated dataframe with 522,943 rows (See

Table 3.12). The resulting dataframe comprises of 2,926 rows and 5 columns. These columns

include extracted_product_idea. market_potential, trend_of_demand, stability_of_demand, and

market_acceptance. In addition, a variety of utility functions were programmed to automate the


computation of the values of each criterion discussed above. A few of the rows used in screening

are shown in Table 3.16..

Table 3.16 Sample data from the new dataframe

annotated_product_ potential_ trend_of_ stability_of_de market_


idea market demand mand acceptance

acrylic 19,797.3% 4.9 15% 0.333


adobo 2,527% 5.0 15% 0.444
antibiotic 1,377% 3.25 15% 0.05
yema cake 101% 5 15% 0.05
airpod 4.5% 4.91 15% 0.87

abaca face mask 8% 5 14% 0.04

accessory shop 8% 5 14% 0.04

A new column named label was added to the new dataframe to hold the values of the result

of the screening. The said column has two possible values. A value of 1 means that a product

idea passes the screening or is considered as good. On the other hand, 0 implies that a product

idea did not satisfy at least one of the four criteria and, therefore, is not a good idea. A few of the

results of the screening is shown in Table 3.17. Out of the 2,296 annotated product ideas that

were screened, 2,145 failed, while the remaining 781 made it. For instance, acrylic, adobo,

antibiotic, yema cake, and airpod are considered a good idea because they satisfied all of the

criteria for screening. Specifically, acrylic has a potential market of 19,797.3% which is way

more than the 33% acceptable value. Its trend of demand, which is computed from November to

December 2020, has an upward direction. Moreover, the stability of demand, considered from

September to October and November to December, remains optimistic. Finally, its market

acceptance has a positive sentiment. The newly created dataframe was saved for the next step,
which is data preparation. The results of the screening in each criterion are further discussed in

the succeeding paragraphs.

Table 3.17 Sample results of the screening

annotated_pr potential_ trend_of_ stability_of market_ label


oduct_ market demand _demand acceptance
idea
acrylic 19,797.3% 4.9 15% 0.333 1

adobo 2,527% 5.0 15% 0.444 1


antibiotic 1,377% 3.25 15% 0.05 1
yema cake 101% 5 15% 0.05 1
airpod 4.5% 4.91 15% 0.87 1
abaca face 8% 5 14% 0.04 0
mask
accessory shop 8% 5 14% 0.04 0

The first criterion considered in screening was the potential market. It is defined as the

market size of the total market size for a particular product. It was measured through tweets

engagement rate, defined as the ratio of the diffusion interactions (retweets, replies, and likes

count) to the total number of tweets (Muñoz-Expósito, 2017). A potential market of at least 33%

was set as a minimum value for considering a product idea to be viable (Mee, 2020). The

potential market values are stored under the column named potential_market. Out of the 2,969

product ideas, 1,482 passed the screening in terms of their potential market values. These

product ideas show enough tractions from possible customers to be considered as a viable

product idea for a business venture. A few of these product ideas are shown in Table 3.18. More

notably, acrylic which is under the art accessories shop got the highest potential market out of

the samples below with 19,797.3%.


Table 3.18 Sample product ideas with potential market

extracted_product_idea potential_market

acrylic 19,797.3%

adobo 2,527%
antibiotic 1,377%
yema cake 101%

The second criterion considered for screening was the trend of demand. It refers to the

secular movement that describes data direction (e.g., upward or downward) long-term (Komlos,

1993). The trend of demand was monitored by the trendline's slope from August 2020 to

December 2020 using the graphical method. A trend of demand of greater than 0 was set as a

minimum value for considering a product idea to be viable (Komlos, 1993). The trend of demand

values was stored under the column named trend_of_demand. Out of the 2,926 product ideas

subject for screening, 2,908 passed the screening using the computed values of the trend of

demand. A few of these product ideas are shown in Table 3.19. These product ideas have seen an

increase in engagements in the last five months, from August 2020 to December 2020, which is

another saying that the number of people talking about these product ideas has grown. In

particular, adobo, yema cake, abaca face mask, and accessory shop are among the product ideas

with a high chance of getting more demand in the succeeding months.


s

Table 3.19 Sample product ideas with trend of demand

annotated_product_idea trend_of_demand

acrylic 4.9

adobo 5.0

antibiotic 3.25

yema cake 5

airpod 4.91

abaca face mask 5

accessory shop 5

The third criterion used in screening was the stability of demand. It describes the fluctuation

in the market size in a particular period. It was measured by getting the changes in the tweet

volume from November to December 2020. Stability of demand of at least 15% was set as a

minimum value for considering a product idea to be viable (Baremetrics, 2021). The potential

market values are stored under the column named stability_of_demand. Of the 2,926 screened

product ideas, 864 got a stable demand. A few of these product ideas are shown in Table 3.20.

The list shows that product ideas such as acrylic, adobo, antibiotic, yema cake, and AirPods have

a consistent demand. All items in the list got a 15% score which is the minimum value needed

for measuring that a product idea’s market is stable.


Table 3.20 Sample product ideas with stability of demand

annotated_product_idea stability_of_demand

acrylic 15%
adobo 15%
antibiotic 15%
yema cake 15%
airpod 15%

Ultimately, market acceptance was the degree of acceptance of the possible customers

towards a particular product idea. It was assessed by getting the polarity scores of the extracted

product ideas using the VADER package. The compound valence scores were considered for the

value of the market acceptance (Hutto & Gilber, 2014). A market acceptance of 0.05 was set as a

threshold to consider a product idea to be viable. The market acceptance values are stored under

the column named market_acceptance. In particular, out of the 2,926 evaluated product ideas,

1,272 got a favorable result in terms of the criterion market acceptance. A few of these product

ideas are shown in Table 3.21. More notably, people think that yema cake and acrylic are

amongst the positive ones.

Table 3.21 Sample product ideas with market acceptance

annotated_product_idea market_acceptance

acrylic 0.444

adobo 0.05
antibiotic 0.05

yema cake 0.87


accessory shop 0.333

3.1.4. Data preparation

Data preparation is the last step before the modeling (Zhang, Zhang, Yang, 2003). The

screening results, which consist of the extracted_product_idea, potential_market, trend_of_

demand, stability_of_demand, market_acceptance, and the label, were prepared using various

packages in Python. The prepared dataset has 780 good product ideas and 2,146 not good

product ideas. The extracted_product_idea column served as the input/independent variable,

while the label values were the output/dependent variable. Moreover, the prepared dataset was

divided into two subsets in a 20:80 ratio to create training and testing datasets, respectively. In

particular, the training dataset contains 1,248 inputs, while the testing dataset has 312. A few of

the rows are shown in Table 3.22.

Table 3.22 Sample prepared dataset

input output
[0.36797234, 0.042480964, 0.099437095, -0.1050...] 1

[-0.25395, -0.021710003, 0.29691, -0.065814994...] 1

[0.15996951, 0.2001715, -0.17681801, -0.004529...] 1

[0.009951855, 0.24370115, 0.094669, -0.3137708...] 1

[0.2085925, 0.10699375, -0.0006571803, -0.0292...] 1

[-0.116972834, -0.11884499, 0.013468166, 0.224...] 0

[0.073656656, 0.22216666, 0.068681, -0.0890286...] 0


The initial text values of the extracted_product_idea column were transformed to numerical

values using the word embedding using the Word2Vec package. The said package converts

strings into their numerical values their vector representations. Each vectorized input comprises

an array of 300 numerical values. The array values below are only limited to the first four arrays.

Sample datasets are shown in Table 3.23.

Table 3.23 Sample vectorized inputs

input vectorized input


acrylic [0.36797234, 0.042480964, 0.099437095, -0.1050...]

adobo [-0.25395, -0.021710003, 0.29691, -0.065814994...]

antibiotic [0.15996951, 0.2001715, -0.17681801, -0.004529...]

yema cake [0.009951855, 0.24370115, 0.094669, -0.3137708...]

airpod [0.2085925, 0.10699375, -0.0006571803, -0.0292...]

abaca face mask [-0.116972834, -0.11884499, 0.013468166, 0.224...]

accessory shop [0.073656656, 0.22216666, 0.068681, -0.0890286...]

3.1.5. Parameter tuning

The performance of a model is only as good when using its default parameters (Smit &

Eiben, 2009). The ideal values of the parameters of the model were identified through parameter

tuning using the randomizedsearchcv package. The prepared dataset shown from the

previous step was fed to the model. The parameters considered were kernel, gamma, degree, and
C. A total of 10 candidates were found and tested out in three folds cross-validation, totaling 30

fits. The values were taken in 1.5 minutes, and the results are shown in Table 3.24.

Table 3.24 Parameters of the model

parameter result

kernel rbf
gamma 100

degree 6
C 100

3.1.6. Performance evaluation

Performance evaluation was performed to assess how well the model classifies a product

idea. The proposed model was evaluated in four performance metrics: (1) accuracy, (2)

precision, (3) recall, and (4) f1-score using the balanced prepared dataset. Specifically, the

accuracy was measured in ten-fold cross-validation. In addition, a confusion matrix was

generated to examine the number of correctly classified values and verify the cross-validation

results. Moreover, a classification report was generated and used to assess the model's precision,

recall, and f1-score values. The performance of the model is further discussed in the succeeding

paragraphs.

Accuracy was the first performance metric used in measuring the performance of the model.

Cross-validation was performed to determine the consistency in the accuracy of the model.

Accuracy is the ratio of correctly predicted values to the total number of observations. The

average accuracy was considered the final accuracy. The cross-validation was carried out using
the 2,926 extracted product ideas. It comprises 1,248 training subsets and 312 testing subsets.

The results of the cross-validation are shown in Table 3.25.

The accuracy of the model started low with 49%. The score plunged on the second fold by

almost twice as previous. It continued to fluctuate until it reached its peak of 89% on the final

fold. Overall, the average accuracy rate of 81% was considered as the final score.

Table 3.25 Accuracy of the model in 10th cross-validation

1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th Aver
fold fold fold fold fold fold fold fold fold fold age

0.487 0.858 0.801 0.884 0.846 0.852 0.807 0.814 0.826 0.878 0.80
17949 97436 28205 61538 15385 5641 69231 10256 92308 20513 5769

Moreover, a total of 312 product ideas were used to further examine the performance of the

model by generating a confusion matrix. In particular, 160 of which are not good, while 152 are

good. Based on the assessments, the model correctly classified 116 as good product ideas and

147 as not good product ideas. Furthermore, minimal inputs of 13 are false positive or type 1

error. However, a large portion of the classifications is considered false negative or type 2 errors

with 136 items. A confusion matrix is shown in Table 3.26.

Table 3.26 Confusion matrix

Act n=312 Predicted values


ual
val 0 1
ues
0 147 13
1 136 116

Precision was the second performance metric used in measuring the performance of the

model. It is computed to get the ratio of the correctly predicted values of a particular class to the

total correctly and incorrectly predicted values of that particular class. The score of the precision
is shown in Table 3.27. The weighted average was considered in measuring the precision of the

model. The report shows the model's performance using the 312 inputs from the testing subset. It

consists of 160 not-good and 152 good product ideas. Accordingly, the model has a 90%

precision rate when classifying a product idea as not good and 80% when it is good. Overall, it

got an 85% precision rate and was considered as the final value.

Table 3.27 Classification report

Precision Recall F1-score

0 80% 92% 86%

1 90% 76% 83%


Weighted avg 85% 84% 84%

Recall is the third performance metric used in measuring the performance of the model. It is

calculated as the ratio of correctly predicted values of a particular class to correctly predicted

values and the total incorrectly predicted values of all other classes. The score of the recall is

shown in Table 3.28. The weighted average was considered in measuring the recall score. The

report shows the model's performance using the 312 inputs from the testing subset. It consists of

160 not-good and 152 good product ideas. Accordingly, the model has a 92% precision rate

when classifying a product idea as not good and 76% when it is good. Overall, it got an 84%

precision rate and was considered as the final value.

Finally, f1-score is simply the weighted average of precision and recall. The score of the f1-

score is shown in Table 3.28. The weighted average was considered in measuring the recall

score. The report shows the model's performance using the 312 inputs from the testing subset. It

consists of 160 not-good and 152 good product ideas. Accordingly, the model has an 86%
precision rate when classifying a product idea as not good and 83% when it is good. Overall, it

got an 84% precision rate and was considered as the final value.

Overall, the performance evaluation shows that the accuracy of the model is at 84%.

Furthermore, it got a precision rate of 85%, recall rate of 84%, and f1-score rate of 84%. These

results are summarized in Table 3.28. The trained model was exported and used in the

implementation of the proposed screening application.

Table 3.28 Performance evaluation

Accuracy Precision Recall F1-score

84% 85% 84% 84%

3.2. Implementation

The application was designed and developed using various ML and web technologies used in

web development, including HTML, CSS, and JS and the micro web framework, Flask. This

section is presented in two parts: section 3.2.1) front-end development and section 3.2.2) back-

end development. First, screenshots of the UIs are shown. Afterward, the different functions are

presented with sample inputs.

3.2.1. Front-end development

The UI of the application was implemented using Bootstrap and Flask’s templating engine,

Jinja2. In particular, Bootstrap was used to make the design responsive to any computing device,

including desktop computers, mobiles, and tablets. Furthermore, icons from Fontawesome were

added to improve the aesthetics. The created UIs were tested out on the different sizes of devices

enabled by Google Chrome developer tools. Finally, UIs were viewed in three major browsers

including Google Chrome, Mozilla Firefox, and Microsoft Edge. In particular, three pages were
created for the application, namely search, result, and dashboard. Sample screenshots of these

pages are shown below.

The screen page is the first page of the application. This page would allow Philippines’

SMEs owners or representatives to see the application's different features, such as entering their

new product ideas to the search bar, a link to see good product ideas through the dashboard, and

others. A screenshot of the screen page is shown in Figure 3.3. Moreover, when a user enters

invalid or incorrect input, the screen page will be redirected. The nature of the error/s and tips to

prevent it would pop up. A screenshot of the screen page with invalid and valid inputs are shown

in Figures 3.4 and 3.5, respectively.

Figure 3.3 Screen page


Figure 3.4 Screen page with invalid input

Figure 3.5 Screen page with valid input


Second, the result page is where the user is redirected when the screening button is pressed

and there are no errors found. This is the page where the result of the screening is revealed. It

indicates whether the entered product idea is good or not. A screenshot for the result page of face

mask is shown in Figure 3.6. Figures input/. Other sample screening results are shown Figures

3.7 and 3.8.


Figure 3.6 Result page for face mask

Figure 3.7 Result page for cake


Figure 3.8 Result page for car wash
Ultimately, the dashboard is where the good product ideas are listed. The product ideas listed

in this page is sorted alphabetically. Other information including the category, potential market,

stability of demand, and market acceptance are also shown. A screenshot of the dashboard page

is shown in Figure 3.9.

Figure 3.9 Dashboard


3.2.2. Back-end development

The application functions were developed using Flask, and a list of python packages. A total

of seven functions were constructed, including reading the input, preprocessing the input,

validating the input and showing the error, translating the input to English, converting input into

vector representations, classifying the input, and showing the good product ideas. The screening

application was tested with different inputs, such as numerical, non-English, Tagalog strings, and

other possible invalid inputs. Sample preprocessed, invalid, and classified inputs are shown in

Tables 3.29, 3.30, and 3.31.

Table 3.29 Sample input and their preprocessed equivalents

Input Preprocessed input Function

Bread bread Converted input to lowercase


https://www.bread.com Removed links
<html>bread</html> bread Removed HTML tags
Bread!@ bread Removed special characters
Breeeeeaaaad bread Removed extra characters

Bread123 bread Removed numerical


Tinapay bread Translated input to English

Breads bread Lemmatized the input

Table 3.30 Sample invalid inputs and their corresponding errors

Input Error
123123123 Your input is not valid. Please enter a string.
Gahmsahhahmnida Your language is not supported. Please enter a
Tagalog/English string only.
as Your input is too short. Try longer string.

cheerful Your input is not a product idea. Please enter a noun.

accident Your input is not a valid product idea. See sample


product ideas in the dashboard.

aaaaaaaaaaaaaaaaaaaaa Your input is too short. Try longer string.

Table 3.31 Sample product ideas and their labels

Input Output
acrylic Good idea
adobo Not a good idea
antibiotic Good idea
yema cake Not a good idea
airpod Not a good idea
abaca face mask Good idea
accessory shop Good idea
Chapter 4: Conclusions and recommendations

4.1. Conclusions

This study developed a tool that utilizes UGC produced in social media to assist the

Philippine-based SMEs in developing product ideas. This project was done in two steps. First, a

supervised ML model was trained to classify product ideas into good or not good. Second, an

application was implemented using the created model.

The model was built using various Python libraries (see Appendix A). A few of the packages

used are pandas, vader, spacy, and scikitlearn. The dataset used to train the said model consists

of UGC on Twitter. In particular, over 5 million tweets were collected in five months (August-

December 2020) using an advanced scrapping tool [Twint]. The data collection was performed in

a Linux environment which was configured using Oracle VM. The collected tweets went through

a series of preprocessing steps, including data precleaning, pulling out the nouns from the tweets,

removing the stopwords to prepare the datasets for modeling. In addition, data annotation was

conducted to ensure that only valid product ideas were included in the final dataset. The resulting

dataset was used to perform the screening; in particular, the extracted product ideas from the

tweets were evaluated through engagements and sentiments information obtained from the

scraped tweets; more particularly, four factors were considered in the

screening: potential market, the trend of demand, stability of demand, and market

acceptance (Baker & Albaum, 1986). The results are represented by two values. A value of 1

indicates that a product idea is good, while 0 says otherwise. These data were used as the dataset

to feed the model. Since an ML algorithm cannot process text, the product ideas were

transformed into vector representations using the Word2Vec encoding scheme. The vectorized

extracted product ideas were used as the input. On the other hand, the label (e.g., 1 or 0) was
used as the output. The dataset was divided with a ratio of 80:20 to train and test the model,

respectively. The model's performance was assessed in four metrics: accuracy, precision, recall,

and f1-score. The evaluation shows that the model achieves an 84% accuracy rate when

classifying a product idea (e.g., good or not good). The model was saved and used in the

implementation of the application.

The application was implemented using Flask, and its UIs were designed using the CSS

framework, Bootstrap. The developed application would allow SMEs’ owners or any

representative to enter and screen their product ideas. Specifically, the tool consists of seven

features, including reading the input, validating the input and showing the error, converting input

into vector representations, and classifying the input. Three web pages were designed to show

the features of the application, namely, screen page, result page, and dashboard. The application

was tested by launching it on different devices, including phones, tablets, and computers, using

the Google developer tools.

This study would help contribute to the limited literature that explores social media data in

developing new product ideas (Nascimento & Da Silveira, 2017). It supports the findings of

Rathore & Ilavarasan (2020) regarding obtaining insights on user preferences from social media.

It demonstrates how tweets on Twitter could be used in screening product ideas for Philippines’

SMEs. The screening process is part and parcel of the seven-stage process of NPD. Other factors

need to be considered when conducting NPD, such as new product strategy development,

business analysis, development, and commercialization (Booz, & Allen & Hamilton, 1982). The

proposed application does not intend to replace the screening process; instead, it would

complement it. Specifically, it would provide Philippines SMEs a platform to urgently evaluate

and select which product ideas to consider for their next business venture. Furthermore, the
application would also allow the rising numbers of SMEs putting up a business in the online

platform economy nationwide to look for product ideas to sell (Villanueva, 2020).

4.2. Recommendations

The researcher gathered over 5 million tweets posted within the Philippines that covers a

five-month period, from August to December 2020. Only the 1,557,442 English tweets from the

entire dataset were considered. Future studies could collect more tweets by allowing a broader

time period of data collection. Also, later research could include tweets with Tagalog language

and other dialects in the Philippines. Furthermore, collecting real-time tweets could also help

provide more up-to-date insights regarding product ideas that could be critical in the screening

process. Additionally, other sources of UGC such as Facebook, Instagram, and YouTube could

also be included. This study only considered text-based UGC. Considerations for UGC in other

formats such as images can be explored. Interestingly, Instagram has been the most preferred

social media platform by the consumer because of its ease of use and less complicated way to

view feedbacks (Smith, Fischer, & Yongjian, 2012). It has also been recommended that future

research can examine SM platforms that are driven by images for specific type of product ideas

(Rathore & Ilavarasan, 2020).

Moreover, out of the 1,557,442 tweets considered in the study, a total of 2,926 product ideas

were extracted through data annotation. It was performed with the aid of two undergrad BSIT

students. The preprocessed nouns that were pulled out from the tweets were marked as 1 for

valid and 0 for invalid. The basis for considering a product idea as valid is the presence of

emotion (e.g., hate, want, love, disgust) expressed towards the noun in a tweet. Furthermore,

since not all nouns are product ideas, only those on the list of the most common product ideas

particular to the Philippines' SMEs are considered. For example, MoneyMax.com (2021)
published a list of small businesses ideas with capital in 2021. The researcher further validated

the results in a limited period of time. Similar studies could incorporate marketing experts when

annotating the product ideas and in a more considerable amount of time to have a more in-depth

analysis and further validation of the dataset.

Furthermore, the annotated product ideas were transformed into numerical values by getting

their vector representations using Word2Vec encoding scheme. A study suggests that combining

Word2Vec with TF-IDF yields a better result all the time (Lilleberg, Zhu, & Zhang, 2015). It is a

bag of words that assign numerical values to each word based on how frequently it appears in a

collection or given datasets. The lower the frequency of the word is the higher its value. Future

studies could further enhance the model's performance by using the features of both TF-IDF and

Word2Vec when transforming the extracted product ideas into their vector representations.

In addition, the study was able to build a model that classifies a particular product idea into

good or not using the collected UGC on Twitter. However, the model was only trained once, and

its classification depends solely on the limited number of training datasets. Meanwhile, consumer

preferences are constantly changing. What considered good product ideas today are no longer the

case next month. Future studies can consider implementing a mechanism to allow the model to

continuously learn as it accepts input from users. This would give the model more intelligence

and reliability as it would enable it to analyze current trends in the market.

Finally, a supervised ML algorithm, SVC, was used to build the model for classifying

product ideas. The parameters of the model were identified using RandomizedSearchCV. The

model's accuracy could be improved by getting the optimal parameter values using

GridSearchCV (Batayev, 2019). It is a technique that searches for all the possible combinations

for the parameters to get the optimal values. This study could also be extended by using
unsupervised learning. Alternatively, the SVC model could be trained using an unsupervised

clustering approach for text classification (Shafiabady et al., 2016).


Definition of terms

Small and medium enterprises It is a non-subsidiary, independent firm in the


Philippines that employs 10-199 people and has total
assets not exceeding 3,000,000 pesos.
New product development It is a process that transforms market opportunities
into a product available for sale in the market. It
consists of seven stages: new product strategy
development, idea generation, screening and
evaluation, business analysis, development, testing,
and development.
Screening It is a phase of the new product development that
selects and evaluates new product ideas from a pool
of ideas generated during new product ideation.
Social media It is a website and application that enable users to
create and share content or participate in social
networking (e.g., Facebook, Instagram, and Twitter).
User-generated content It is publicly available content on the Internet, usually
posted on social media sites, such as text, image,
video, and audio, created by users rather than brands.
Twitter It is a social media application that allows its users to
post a concise tweet of 280 characters with the option
to include hashtags.
Machine learning It is a branch of AI based on the idea that systems can
learn from data, identify patterns and make decisions
with minimal human intervention.
Supervised machine learning It is a machine learning task of learning a function
algorithm that maps an input to an output based on input-output
pairs.
Support vector machine It is a set of supervised learning methods used for
classification and regression.
Sentiment analysis It is a process of getting the polarity scores from a
piece of text to determine its sentiment. The
sentiment is categorized into positive, negative, or
neutral.
Text classification It is a machine learning technique that assigns a label
to a particular text. A typical text classification
problem is categorizing news into sports, business,
politics, or entertainment.
Word embedding It is a process of associating numerical or vector
representations to a text for mathematical
computations. One typical technique used in word
embedding is Word2Vec.
Market potential It is the criteria used to measure the size of the total
market size of a particular new product idea.
Stability of demand It is the criteria used to measure the fluctuation in the
market size from one year to another of a particular
new product idea.
Trend of demand It is the criteria used to measure the growth of
demand over a period of time of a particular new
product idea.
Market acceptance It is the criteria used to determine the sentiment
(positive, negative, or neutral) of the potential
customers towards a particular product idea.
References

Agrawal, A., & Bhuiyan, N. (2014). Achieving success in NPD projects. International Journal of
Social, Behavioral, Educational, Economic, Business and Industrial Engineering, 8(2),
476-481.
Akram Afzal, M. (2017). Risks in new product development (NPD) projects.

Albar, F. M. (2013). An Investigation of Fast and Frugal Heuristics for New Product Project
Selection.

Bahtar, A. Z., & Muda, M. (2016). The impact of User–Generated Content (UGC) on product
reviews towards online purchasing–A conceptual framework. Procedia Economics and
Finance, 37, 337-342.

Baker, K. G., & Albaum, G. S. (1986). Modeling new product screening decisions. Journal of
Product Innovation Management: AN INTERNATIONAL PUBLICATION OF THE
PRODUCT DEVELOPMENT & MANAGEMENT ASSOCIATION, 3(1), 32-39.

Baremetrics. (2021). Growth rate. Retrieved from https://baremetrics.com/academy/growth-rate

Batayev, N. (2019). Gas turbine fault classification based on machine learning supervised
techniques. In 2018 14th International Conference on Electronics Computer and
Computation (ICECCO) (pp. 206-212). IEEE.

Bayanihan to Heal As One Act. RA 11469. 18th Cong. (2020). Retrieved from
https://www.officialgazette.gov.ph/downloads/2020/03mar/20200324-RA-11469-
RRD.pdf

Bhimani, H., Mention, A. L., & Barlatier, P. J. (2019). Social media and innovation: A
systematic literature review and future research directions. Technological Forecasting and
Social Change, 144, 251-269.
Brainkart. (2021). Methods for measuring secular trend. Retrieved from
https://www.brainkart.com/article/Methods-of-Measuring-Secular-Trend_39269/

Booz, & Allen & Hamilton. (1982). New products management for the 1980s. Booz, Allen &
Hamilton.

Bouey, J. (2020). Assessment of COVID-19's Impact on Small and Medium-Sized Enterprises:


Implications from China.

Carbonell, P., Mayer, M. A., & Bravo, À. (2015, January). Exploring brand-name drug mentions
on Twitter for pharmacovigilance. In MIE (pp. 55-59).

Chenworth, M., Perrone, J., Love, J. S., Graves, R., Hogg-Bremer, W., & Sarker, A. (2021).
Methadone and Suboxone® mentions on Twitter: Thematic and sentiment analysis.
Clinical Toxicology, 1-10.
Chu, S. C., & Kim, Y. (2011). Determinants of consumer engagement in electronic word-of-
mouth (eWOM) in social networking sites. International journal of Advertising, 30(1), 47-75

Church, K. W. (2017). Word2Vec. Natural Language Engineering, 23(1), 155-162.

CMO’s Use of Social Media During COVID-19. For what purpose has your firm used social
media during the pandemic. MarketingCharts.com. (May, 2020). Retrieved from
https://www.marketingcharts.com/charts/cmos-use-of-social-media-during-covid-
19/attachment/cmosurvey-cmo-use-of-social-media-during-covid-19-jul2020

Cooper, R. G. (1979). Identifying industrial new product success: Project NewProd. Industrial


Marketing Management, 8(2), 124-135.

Costa, J., Silva, C., Antunes, M., & Ribeiro, B. (2013, April). Defining semantic meta-hashtags
for twitter classification. In International Conference on Adaptive and Natural Computing
Algorithms (pp. 226-235). Springer, Berlin, Heidelberg.

Dadgar, S. M. H., Araghi, M. S., & Farahani, M. M. (2016, March). A novel text mining
approach based on TF-IDF and Support Vector Machine for news classification. In 2016
IEEE International Conference on Engineering and Technology (ICETECH) (pp. 112-116).
IEEE.

De Brentani, U., & Droge, C. (1988). Determinants of the new product screening decision A
structural model analysis. International Journal of Research in Marketing, 5(2), 91-106.

Dhaoui, C., Webster, C. M., & Tan, L. P. (2017). Social media sentiment analysis: lexicon versus
machine learning. Journal of Consumer Marketing.

Employment Situation in April 2020. (2020, June 5). Philippines Statistics Authority. Retrieved
from https://psa.gov.ph/statistics/survey/labor-and-employment/labor-force-
survey/title/Employment%20Situation%20in%20April%202020

Exec. Order 112, s. 2020. (April 30, 2020). Retrieved from


https://www.officialgazette.gov.ph/downloads/2020/10oct/20201008-IATF-Omnibus-
Guidelines-RRD.pdf

Effendi, M. I., Sugandini, D., & Istanto, Y. (2020). Social Media Adoption in SMEs Impacted by
COVID-19: The TOE Model. The Journal of Asian Finance, Economics, and Business,
7(11), 915-925.

Feldman, R. (2013). Techniques and applications for sentiment analysis. Communications of the


ACM, 56(4), 82-89.
Fischer, E., & Reuber, A. R. (2011). Social interaction via new social media:(How) can
interactions on Twitter affect effectual thinking and behavior?. Journal of business
venturing, 26(1), 1-18.

Ford, P., & Terris, D. (2017). NPD, design and management for SME's. The Design Society.

Ge, L., & Moh, T. S. (2017, December). Improving text classification with word embedding.
In 2017 IEEE International Conference on Big Data (Big Data) (pp. 1796-1805). IEEE.

Gharib, T. F., Habib, M. B., & Fayed, Z. T. (2009). Arabic Text Classification Using Support
Vector Machines. Int. J. Comput. Their Appl., 16(4), 192-199.

Global User Generated Content (UGC) Software Market Size, Status and Forecast 2020-2026.
MarketingResearch.com. (2020). Retrieved from
https://www.marketresearch.com/QYResearch-Group-v3531/Global-User-Generated-
Content-UGC-13615161/

Guntuku, S. C., Sherman, G., Stokes, D. C., Agarwal, A. K., Seltzer, E., Merchant, R. M., &
Ungar, L. H. (2020). Tracking mental health and symptom mentions on twitter during
covid-19. Journal of general internal medicine, 35(9), 2798-2800.

Hughes, G. D., & Chafin, D. C. (1996). Turning new product development into a continuous
learning process. Journal of Product Innovation Management: AN INTERNATIONAL
PUBLICATION OF THE PRODUCT DEVELOPMENT & MANAGEMENT
ASSOCIATION, 13(2), 89-104.

Hayon, S., Tripathi, H., Stormont, I. M., Dunne, M. M., Naslund, M. J., & Siddiqui, M. M.
(2019). Twitter mentions and academic citations in the urologic literature. Urology, 123,
28-33.

Impact of COVID-19 pandemic on micro, small, and medium enterprises (MSMEs). MSC.
(June, 2020). Retrieved from https://www.microsave.net/wp-content/uploads/2020/08/Impact-o
f-COVID-19-on-Micro-Small-and-Medium-Enterprises-MSMEs-1.pdf

Islam, M. S., Jubayer, F. E. M., & Ahmed, S. I. (2017, February). A support vector machine
mixed with TF-IDF algorithm to categorize Bengali document. In 2017 international conference
on electrical, computer and communication engineering (ECCE) (pp. 191-196). IEEE.

Jeong, E., & Jang, S. S. (2011). Restaurant experiences triggering positive electronic word-of-
mouth (eWOM) motivations. International Journal of Hospitality Management, 30(2),
356-366.

Jespersen, K. R. (2007). Is the screening of product ideas supported by the NPD process
design? European journal of Innovation management.
Jiang, J. X., & Shen, M. (2017). Traditional media, twitter and business scandals. Michael,
Traditional Media, Twitter and Business Scandals (May 1, 2017).

Joachims, T. (1998, April). Text categorization with support vector machines: Learning with
many relevant features. In European conference on machine learning (pp. 137-142). Springer,
Berlin, Heidelberg.

Kaluža, M., Kalanj, M., & Vukelić, B. (2019). A comparison of back-end frameworks for web
application development. Zbornik veleučilišta u rijeci, 7(1), 317-332.

Komlos, J. (1993). The secular trend in the biological standard of living in the United Kingdom,
1730‐1860 1. The Economic History Review, 46(1), 115-144.

Korde, V., & Mahender, C. N. (2012). Text classification and classifiers: A survey. International
Journal of Artificial Intelligence & Applications, 3(2), 85.

Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019).
Text classification algorithms: A survey. Information, 10(4), 150.

Krumm, J., Davies, N., & Narayanaswami, C. (2008). User-generated content. IEEE Pervasive
Computing, 7(4), 10-11.

Kumar, S., Koolwal, V., & Mohbey, K. K. (2019). Sentiment analysis of electronic product
tweets using big data framework. Jordanian Journal of Computers and Information Technology
(JJCIT), 5(01).

Kurnia, R., Tangkuman, Y., & Girsang, A. (2020). Classification of User Comment Using
Word2vec and SVM Classifier. Int. J. Adv. Trends Comput. Sci. Eng, 9, 643-648.

Leano, R. M. (2006). SMEs in the Philippines. Cacci Journal, 3, 1-10.


Promoting SME competitiveness in the Philippines. International Trade Centre.
(November, 2020). Retrieved from
https://www.intracen.org/uploadedFiles/intracenorg/Content/Publications/Philippines_S
ME_v6.pdf

Lee, I., & Shin, Y. J. (2020). Machine learning for enterprises: Applications, algorithm selection,
and challenges. Business Horizons, 63(2), 157-170.

Lee, M., & Youn, S. (2009). Electronic word of mouth (eWOM) How eWOM platforms
influence consumer product judgement. International Journal of Advertising, 28(3), 473-
499.

Lilleberg, J., Zhu, Y., & Zhang, Y. (2015, July). Support vector machines and word2vec for text
classification with semantic features. In 2015 IEEE 14th International Conference on
Cognitive Informatics & Cognitive Computing (ICCI* CC) (pp. 136-140). IEEE.
Magna Carta for Micro, Small and Medium Enterprises (MSMEs). RA 9501. 14 th Cong. (2008).
https://www.officialgazette.gov.ph/2008/05/23/republic-act-no-9501/

Money Max (23 June, 2021). 32 Micro and Small Business Ideas You Can Start with Low
Capital in 2021. Retrieved from
https://www.moneymax.ph/personal-finance/articles/small- business-ideas-philippines

Muñoz-Expósito, M., Oviedo-García, M. Á., & Castellanos-Verdugo, M. (2017). How to


measure engagement in Twitter: advancing a metric. Internet Research.

Mu, J., Peng, G., & Tan, Y. (2007). New product development in Chinese SMEs: Key success
factors from a managerial perspective. International Journal of Emerging Markets, 2(2),
123-143.

Neda: PH Economy lost 1.1T during lockdown. (2020, May 23). De Vera, B. Retrieved
from https://business.inquirer.net/298037/neda-ph-economy-lost-p1-1t

Nascimento, A. M., & Da Silveira, D. S. (2017). A systematic mapping study on using social
media for business process improvement. Computers in Human Behavior, 73, 670-675.
Onarheim, B., & Christensen, B. T. (2012). Distributed idea screening in stage–gate
development processes. Journal of Engineering Design, 23(9), 660-673.

Osisanwo, F. Y., Akinsola, J. E. T., Awodele, O., Hinmikaiye, J. O., Olakanmi, O., & Akinjobi,
J. (2017). Supervised machine learning algorithms: classification and
comparison. International Journal of Computer Trends and Technology (IJCTT), 48(3),
128-138.

Owens, J. D. (2007). Why do some UK SMEs still find the implementation of a new product
development process problematical? Management Decision.

Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using
machine learning techniques. arXiv preprint cs/0205070.

Pantano, E., Giglio, S., & Dennis, C. (2019). Making sense of consumers’ tweets. International
Journal of Retail & Distribution Management.

Park, C., & Lee, T. M. (2009). Information direction, website reputation and eWOM effect: A
moderating role of product type. Journal of Business research, 62(1), 61-67.

Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and
psychometric properties of LIWC2015.

Prantl, D., & Mičík, M. (2019). Analysis of the significance of eWOM on social media for
companies.

Pribyl, J. R. (1994). Using surveys and questionnaires.


Rathore, A. K., & Ilavarasan, P. V. (2020). Pre-and post-launch emotions in new product
development: Insights from twitter analytics of three products. International Journal of
Information Management, 50, 111-127.

Rochford, L. (1991). Generating and screening new products ideas. Industrial marketing


management, 20(4), 287-296.

Rodríguez-Ferradas, M. I., & Alfaro-Tanco, J. A. (2016). Open innovation in automotive SMEs


suppliers: an opportunity for new product development. Universia Business Review, (50),
142-157.

Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.


Şahİn, G. (2017, May). Turkish document classification based on Word2Vec and SVM
classifier. In 2017 25th signal processing and communications applications conference
(SIU) (pp. 1- 4). IEEE.

Schreiner, C., Torkkola, K., Gardner, M., & Zhang, K. (2006, October). Using machine learning
techniques to reduce data annotation time. In Proceedings of the Human Factors and
Ergonomics Society Annual Meeting (Vol. 50, No. 22, pp. 2438-2442). Sage CA: Los
Angeles, CA: SAGE Publications.

Scrunch. (2020, November 19). Mee, G. What is a good engagement rate on twitter? Retrieved
from https://scrunch.com/blog/what-is-a-good-engagement-rate-on-twitter.

Small Business Wage Subsidy Program. Department of Finance. (April, 2020). Retrieved from
https://sites.google.com/dof.gov.ph/small-business-wage-subsidy

Soukhoroukova, A., Spann, M., & Skiera, B. (2012). Sourcing, filtering, and evaluating new
product ideas: An empirical exploration of the performance of idea markets. Journal of
product innovation management, 29(1), 100-112.

Sultana, M., Paul, P. P., & Gavrilova, M. (2016). Identifying users from online interactions in
Twitter. In Transactions on Computational Science XXVI (pp. 111-124). Springer,
Berlin, Heidelberg.

Takeuchi, H., & Nonaka, I. (1986). The new new product development game. Harvard business
review, 64(1), 137-146. A

Tankova, H. (2021). Statista. Number of global social media network users 2017-2025.
Retrieved from https://www.statista.com/statistics/278414/number-of-worldwide-social-
network- users/
Trackmyhashtag. (2021). Twitter hashtag statistics. Retrieved from
https://www.trackmyhashtag.com/hashtag-stats
Tufekci, Z. (2014, May). Big questions for social media big data: Representativeness, validity
and other methodological pitfalls. In Proceedings of the International AAAI Conference on
Web and Social Media (Vol. 8, No. 1).

Vaishnav, M., Dalal, P. K., & Javed, A. (2020). When will the pandemic end?. Indian Journal of
Psychiatry, 62(Suppl 3), S330.

Vancic, A., & Pärson, G. F. A. (2020). Changed Buying Behavior in the COVID-19 pandemic:
the influence of Price Sensitivity and Perceived Quality.

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. (2nd ed.). Springer Verlag. Pp.
1 – 20. Retrieved from website:
https://www.andrew.cmu.edu/user/kk3n/simplicity/vapnik2 000.pdf

Verworn, B., Herstatt, C., & Nagahira, A. (2006). The impact of the fuzzy front end on new
product development success in Japanese NPD projects (No. 39). Working Paper.

Villanueva, J. (2020). Philippine News Agency. Rise of online shopping nationwidem not just
from Metro. Retrieved from https://www.pna.gov.ph/articles/1112078

Wang, Z. Q., Sun, X., Zhang, D. X., & Li, X. (2006, August). An optimal SVM-based text
classification algorithm. In 2006 International Conference on Machine Learning and
Cybernetics (pp. 1378-1381). IEEE.

Weeg, C., Schwartz, H. A., Hill, S., Merchant, R. M., Arango, C., & Ungar, L. (2015). Using
Twitter to measure public discussion of diseases: a case study. JMIR public health and
surveillance, 1(1), e6.

Yin, Z., Fabbri, D., Rosenbloom, S. T., & Malin, B. (2015). A scalable framework to detect
personal health mentions on Twitter. Journal of medical Internet research, 17(6), e138.

Zhang, D., Xu, H., Su, Z., & Xu, Y. (2015). Chinese comments sentiment classification based on
word2vec and SVMperf. Expert Systems with Applications, 42(4), 1857-1863.

Zhang, S., Zhang, C., & Yang, Q. (2003). Data preparation for data mining. Applied artificial
intelligence, 17(5-6), 375-381.
Appendix A: List of Python packages

Methodology Package Usage


Building the model
Data collection twint ● To gather the dataset.
Data preprocessing pandas ● To read and convert the
collected tweets into a
dataframe.
● To reduce the rows and
columns of the dataframe.
● To discard the duplicates from
the nouns.
● To eliminate the short words
from the nouns.
● To verify the validity of the
extracted product ideas from
the tweets (data annotation).
regex, numpy ● To remove the mentions,
hashtags, emails, urls, html
tags, retweets from the tweets.
nltk, spacy, textblob ● To pull out the nouns from the
precleaned tweets.
preprocess_kgptalkie ● To convert the pulled-out
nouns to lower case
● To expand the contractions of
the pulled-out.
● To remove the accented
characters, special characters,
and repeated letters from the
pulled-out nouns.
● To get the base forms of the
pulled-out nouns
(lemmatization).
spacy ● To remove the stopwords from
the pulled-out nouns.
vader ● To compute the polarity scores
of the precleaned tweets
(sentiment analysis).
Data preparation svc ● To create the model
spacy ● To get the vector
representations of the extracted
product ideas.
nearmiss ● To balance the dataset
test_split_function ● To divide the dataset into
training and testing
Parameter tuning randomizedsearchcv ● To get the value of the
parameters of the model
Performance evaluation cross_val_score, stratifiedkfold ● To compute the accuracy score
of the model.
confusion_matrix ● To examine the number of
correctly and incorrectly
classified product ideas.
classification_report ● To get the precision, recall,
and f1-scores of the model.
pickle ● To persist the model.
Implementation
flask ● To create the back-end of the
-flask, request, application.
● To read the input.
render_template,
● To display the error.
pickle ● To load the SVC model and
classify the input.
preprocess_kgptalkie, ● To convert the input into
re vector values.
spacy ● To preprocess and validate the
input
pandas ● To show the list of the good
product ideas in the dashboard.
googletrans ● To validate the language of the
input and convert non-English
input to English.

Appendix B: Sample code: data collection


This screenshot from the Ubuntu terminal shows the code used in collecting tweets in September
2020 within the Philippines using Twint. twint indicates that a Twint command is being used,
-g <longitude latitude, radius> stands for geography. It is the parameter used to
specify where the geographical location of the tweets. It uses three parameters: longitude,
latitude, and radius. --since <date> and --until <date> indicate the date range of
tweet to be collected. -o <filename> simply tells to save the retrieved tweets in a file using
the specified filename. –-csv means that the output is to be saved in a csv format. Lastly, --
count simply counts the number of scraped tweets.

Appendix C: Geographical sources of the tweets


This screenshot from mapdevelopers.com shows the sources of the collected tweets. The center
of the circle is located in Marinduque which is known as the geographical center of the
Philippines. From Marinduque, tweets sent within 768.54km radius were scraped. The value of
radius was an ideal value to cover the entire Philippines.

Appendix D: Source code: Data reduction


Appendix E: Source code: data precleaning
Appendix F: Source code: noun extraction
Appendix G: Source code: data cleaning
Appendix H: List of Tagalog and English stopwords
Tagalog English
akin, aking, ako, alin, am, amin, aming, ang, a, about, above, after, again, against, all, also,
ano, anumang, apat, at, atin, ating, ay, bababa, am, an, and, any, are, aren't, as, at, be,
bago, bakit, bawat, bilang, dahil, dalawa, because, been, before, being, below, between,
dapat, din, dito, doon, gagawin, gayunman, both, but, by, can, can't, cannot, com, could,
ginagawa, ginawa, ginawang, gumawa, gusto, couldn't, did, didn't, do, does, doesn't, doing,
habang, hanggang, hindi, huwag, iba, ibaba, don't, down, during, each, else, ever, few, for,
ibabaw, ibig, ikaw, ilagay, ilalim, ilan, inyong, from, further, get, had, hadn't, has, hasn't,
isa, isang, itaas, ito, iyo, iyon, iyong, ka, kahit, have, haven't, having, he, he'd, he'll, he's,
kailangan, kailanman, kami, kanila, kanilang, hence, her, here, here's, hers, herself, him,
kanino, kanya, kanyang, kapag, kapwa, himself, his, how, how's, however, http, i, i'd,
karamihan, katiyakan, katulad, kaya, kaysa, i'll, i'm, i've, if, in, into, is, isn't, it, it's, its,
ko, kong, kulang, kumuha, kung, laban, lahat, itself, just, k, let's, like, me, more, most,
lamang, likod, lima, maaari, maaaring, mustn't, my, myself, no, nor, not, of, off, on,
maging, mahusay, makita, marami, marapat, once, only, or, other, otherwise, ought, our,
masyado, may, mayroon, mga, minsan, ours, ourselves, out, over, own, r, same, shall,
mismo, mula, muli, na, nabanggit, naging, shan't, she, she'd, she'll, she's, should,
nagkaroon, nais, nakita, namin, napaka, narito, shouldn't, since, so, some, such, than, that,
nasaan, ng, ngayon, ni, nila, nilang, nito, niya, that's, the, their, theirs, them, themselves, then,
niyang, noon, o, pa, paano, pababa, paggawa, there, there's, therefore, these, they, they'd,
pagitan, pagkakaroon, pagkatapos, palabas, they'll, they're, they've, this, those, through, to,
pamamagitan, panahon, pangalawa, para, too, under, until, up, very, was, wasn't, we,
paraan, pareho, pataas, pero, pumunta, we'd, we'll, we're, we've, were, weren't, what,
pumupunta, sa, saan, sabi, sabihin, sarili, sila, what's, when, when's, where, where's, which,
sino, siya, tatlo, tayo, tulad, tungkol, una, while, who, who's, whom, why, why's, with,
walang, won't, would, wouldn't, www, you, you'd,
you'll, you're, you've, your, yours, yourself,
yourselves,

Appendix I: Source code: stopwords removal


Appendix J: Source code: lemmatization
Appendix K: List of common product ideas for Philippines' SMEs
1 agricultural shop
2 appliance store
3 art accessories shop
4 baby products
5 baking supply store
6 ballet lesson business
7 bar business
8 barbeque stand/inihaw business
9 beauty products reselling business
10 bicycle shop business
11 bills payment business
12 book store
13 cake, dessert, and pastry business
14 call center business
15 candle shop
17 car parts and services
18 car rental business
20 car wash
21 cart parts and services
22 Catering/buffet business
23 cellphone loading business
24 chocolate/candy store
26 cigarette/vape shop
27 classic cheese burger
28 coffee shop
29 community store
30 construction supply
31 cooking equipment business
32 cosplay costume shop
33 crochet tutorial business
34 cryptocurrency
35 culinary class business
36 dance school business
37 delivery/courier services
38 dental clinic
39 department store
40 derma clinic
41 driving lesson business
42 drug store
43 event planning business
44 eyewear shop
45 face mask/face shield business
46 fireworks/firecrackers shop
47 flower shop
48 food cart franchise
49 funeral service
50 furniture shop
51 gas station
52 graphic design and video editing business
53 gun shop
54 hair and grooming home services
55 hardware store
56 home cleaning and repair services
57 home cooked meals
58 home essentials shop
59 home rental business
60 homemade beverage or palamig business
61 hygiene products
62 ice cream store
63 internet shop
64 piso wi-fi vending machine
65 jewelry shop
66 kitchen supply store
67 laundry shop
68 lingerie shop
69 liquor store
70 loan business
71 local grocery store business
72 lugaw/arozz caldo/goto business
73 mask face shield
74 meal plan and delivery
75 meat shop
76 medical supply store
77 medicine
78 milk tea business
79 money changer/pawn shop business
80 motorcycle shop
81 music store
82 online fashion boutique
83 online gambling business
84 online tutoring business
85 Organic/vegan shop
86 parking business
87 pawn shop
88 personal shopping services
89 personalized gift business
90 pest control services
92 pet shop
93 plant shop
94 plastic manufacture company
95 poultry shop
96 Recreational drug store
97 restaurant
98 school supply business
99 seafood shop
10 sex toy store
0
10 shoe business
1
10 souvenir shop
2
10 spa
3
10 sports store
4
10 swimming lesson business
5
10 t-shirt printing business
6
10 tailor shop
7
10 tattoo studio
8
10 tech and accessories shop
9
11 thrift shop
0
11 toy store
1
11 travel products and accessories
2
11 vegetables and fruits shop
3
11 vend machine
4
11 veterinary shop
5
11 vintage shop
6
11 vitamins and supplements reselling business
7
11 vlogging
8
11 water refilling station
9
12 web design business
0
12 workout equipment shop
1

Appendix L: Baker & Albaum (1986)’s criteria for screening new product ideas

Criterion Examples

Societal Factor ● legality: product liability


● safety: usage hazards
● environment impact: pollution potential
● societal impact: benefit to society
Business Risk Factor ● functional feasibility: work as intended
● production feasibility: technical feasible
● stage of development: prototype development
● investment costs: development costs
● payback period: time to recover investment
● profitability: profit potential
● marketing research: necessary market information
● research and development: production development
Demand Analysis ● potential market: size of total market size
● potential sales: economies of sale
● trend of demand: growth of demand
● stability of demand: demand fluctuation
● product life cycle: expected length of cycle
● product line potential
Market Acceptance Factor ● compatibility
● learning
● need
● dependence
● visibility
● promotion
● distribution
● service
Competitive Factor ● appearance
● function
● durability
● price
● existing competition
● new competition
● protection

Appendix M: Source code: screening product ideas


Appendix N: Source code: potential market
Appendix O: Source code: trend of demand
Appendix P: Source code: stability of demand
Appendix Q: Source code: market acceptance
Appendix R: Source code: data preparation
Appendix S: Source code: parameter tuning
Appendix T: Source code: performance evaluation
Appendix U: Sample collected tweets

id conversatio created date time timezo user_i userna name place


n_id _at ne d me
13446 1344674669 2020- 2020 23:59 800 18521 hanswee ʎnsɐʎʎ NaN
74669 266411521 12-31 -12- :58 876 3 nsuɐɥ
26641 23:59:5 31
1521 8 PST
13446 1344674665 2020- 2020 23:59 800 80357 _rjtama rjrjrj NaN
74665 554472963 12-31 -12- :57 32073 yo
55447 23:59:5 31 24192
2963 7 PST 774
13446 1344674664 2020- 2020 23:59 800 13395 mty_bau rawr NaN
74664 790908936 12-31 -12- :57 20174 tista
79090 23:59:5 31 30777
8936 7 PST 0369
13446 1344656853 2020- 2020 23:59 800 74168 missylee Bearbie NaN
74657 154856964 12-31 -12- :55 93278 vixen 💞
96257 23:59:5 31 73232
7921 5 PST 896
13446 1344674657 2020- 020- 23:59 800 70416 chinitala ᴮᴱ𝕔𝕙𝕚𝕟 NaN
74657 580929028 12-31 12- :55 647 mpas 𝕚
58092 23:59:5 31 𝕖𝕝𝕒𝕫𝕖𝕘
9028 5 PST 𝕦𝕚-
𝕥𝕒𝕝𝕒𝕞
𝕡𝕒𝕤⁷

Page | 115
tweet langu mentions url photo replie retwee like hashtag cashta
age s s s_cou ts_cou s_c s gs
nt nt ou
nt
Happy new en [] [] [] 0 0 0 ['happyn
year ewyear2 []
Philippines 021']
#HappyNe
wYear202
1
🙃 und [] [] [] 1 0 0 [] []

Welcome en [] [] [] 0 0 0 [] []
2021🥳❤️
@mystoga tl [] [] [] 1 0 1 [] []
n4u pakita
ka muna
ng balls 😁
Thank you en [{'screen_ [] [] 0 0 2 [] []
@BTS_twt name':
@BigHitE 'bts_twt',
nt 'name': '
@TXT_me 방탄소년단',
mbers 💜 'id':
Happy '335141638'
New Year 🥳 },
✨🥂 {'screen_n
ame':
'bighitent'
, 'name':
'bighit
entertain
ment', 'id':
'168683422'
}]

Page | 116
link retwe quote_ video thumb near geo source user_rt user_
et url nail _id rt
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
hanswee3/ 2.6641
status/ 39,768
1344674669 536k
266411521
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
_rjtamayo/ 2.6641
status/ 39,768
1344674665 536k
554472963
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
Mty_Bautist 2.6641
a/status/ 39,768
1344674664 536k
790908936
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
MissyLeeVi 2.6641
xen/status/ 39,768
1344674657 536k
962577921
https:// False NaN 0 NaN NaN 12.072 NaN NaN NaN
twitter.com/ 862,12
chinitalampa 2.6641
s/status/ 39,768
1344674657 536k
580929028

Page | 117
retweet_id reply_to retweet_date translate trans_src trans_de
st
NaN [] NaN NaN NaN NaN

NaN [] NaN NaN NaN NaN

NaN [] NaN NaN NaN NaN

NaN [{'screen_name' NaN NaN NaN NaN


: 'mystogan4u',
'name': '. . .',
'id':
'838906585850
200064'}]
NaN [] NaN NaN NaN NaN

Page | 118
Appendix V: List of all the columns and the number of null values

# Column Null

1 id 0
2 conversation_id 0
3 created_at 0
4 date 0
5 time 0
6 timezone 0
7 user_id 0
8 username 0
9 name 3,251
10 place 5,014,984
11 tweet 0
12 language 0
13 mentions 0
14 urls 0
15 photos 0
16 replies_count 0
17 retweets_count 0
18 likes_count 0
19 hashtags 0
20 cashtags 0
21 link 0
22 retweet 0
23 quote_url 4,552,147
24 video 0
25 thumbnail 4,556,780
26 near 5,193,417
27 geo 0
28 source 5,193,417
29 user_rt_id 5,193,417
30 user_rt 5,193,417
31 retweet_id 5,193,417
32 reply_to 0
33 retweet_date 5,193,417
34 translate 5,193,417
35 trans_src 5,193,417
36 trans_dest 5,193,417
Page | 119
Appendix W: Content Editing Certification

Page | 120
Appendix X: English Editing Certification

Page | 121
Curriculum Vitae

Landley Bernardo
Address: Baguio City, Philippines, 2600
Phone: +639752826318
Email: lmbernardo@slu.edu.ph

WORK
EXPERIENCE 02/2018 – 05/2018
Part-time faculty, Saint Louis University, Baguio City,
Philippines

08/2017 - 02/2018
Management Information Specialist, Martha Property
Management Inc., Baguio City, Philippines

EDUCATION 2014 - 2017


Bachelor Degree of Computer Science, Saint Louis
University

ADDITIONAL Microsoft Office package: Microsoft Word, Excel, PowerPoint


SKILLS Video Editing: VDSC
Database operation: MySQL
Machine Learning: Python
Programming: Java, Python
Web technologies/framework: HTML, CSS, JS, Laravel
Front-end framework: Bootstrap

Page | 122
ACHIEVEMENTS Finalist in the Philippine Start Challenge 2016

Page | 123

You might also like