Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

question

received:

SELECT
title,
answer_count,
REPLACE(tags, "|", ",") as tags
FROM
`bigquery-public-data.stackoverflow.posts_questions`
WHERE
REGEXP_CONTAINS( tags, r"(?:keras|matplotlib|pandas)")

The query results in the following output:

R answ
o er_co
w title unt tags

1 Building a new column in a pandas 6 python,python-


dataframe by matching string values in a 2.7,pandas,replace,nested-
list loops

2 Extracting specific selected columns to 6 python,pandas,chained-


new DataFrame as a copy assignment

3 Where do I call the BatchNormalization 7 python,keras,neural-


function in Keras? network,data-science,batch-
normalization

4 Using Excel like solver in Python or 8 python,sql,numpy,pandas,solv


SQL er

When representing text using the bag of words (BOW) approach, we


imagine each text input to our model as a bag of Scrabble tiles, with each
tile containing a single word instead of a letter. BOW does not preserve
the order of our text, but it does detect the presence or absence of certain
words in each piece of text we send to our model. This approach is a type
of multi-hot encoding where each text input is converted into an array of
1s and 0s. Each index in this BOW array corresponds to a word from our
vocabulary.

HOW BAG OF WORDS WORKS


The first step in BOW encoding is choosing our vocabulary size, which will include the top N most
frequently occurring words in our text corpus. In theory, our vocabulary size could be equal to the number
of unique words in our entire dataset. However, this would lead to very large input arrays of mostly zeros,
since many words could be unique to a single question. Instead, we’ll want to choose a vocabulary size
small enough to include key, recurring words that convey meaning for our prediction task, but big enough
that our vocabulary isn’t limited to words found in nearly every question (like “the,” “is,” “and,” etc.).

Each input to our model will then be an array the size of our vocabulary. This BOW representation
therefore entirely disregards words that aren’t included in our vocabulary. There isn’t a magic number or
percentage for choosing vocabulary size—it’s helpful to try a few and see which performs best on our
model.

To understand BOW encoding, let’s first look at a simplified example. For this example, let’s say we’re
predicting the tag of a Stack Overflow question from a list of three possible tags: “pandas,” “keras,” and
“matplotlib.” To keep things simple, assume our vocabulary consists of only the 10 words listed below:

dataframe
layer
series
graph
column
plot
color
axes
read_csv
activation

This list is our word index, and every input we feed into our model will be a 10-element array where each
index corresponds with one of the words listed above. For example, a 1 in the first index of an input array
means a particular question contains the word dataframe. To understand BOW encoding from the
perspective of our model, imagine we’re learning a new language and the 10 words above are the only
words we know. Every “prediction” we make will be based solely on the presence or absence of these 10
words and will disregard any words outside this list.

Therefore, given question title, “How to plot dataframe bar graph,” how will we transform it into a BOW
representation? First, let’s take note of the words in this sentence that appear in our vocabulary: plot,
dataframe, and graph. The other words in this sentence will be ignored by the bag of words approach.
Using our word index above, this sentence becomes:

[ 1 0 0 1 0 1 0 0 0 0 ]

Note that the 1s in this array correspond with the indices of dataframe, graph, and plot, respectively. To
summarize, Figure 2-21 shows how we transformed our input from raw text to a BOW-encoded array
based on our vocabulary.

Keras has some utility methods for encoding text as a bag of words, so we don’t need to write the code for
identifying the top words from our text corpus and encoding raw text into multi-hot arrays from scratch.
Figure 2-21. Raw input text → identifying words present in this text from our vocabulary →
transforming to a multi-hot BOW encoding.

Given that there are two different approaches for representing text
(Embedding and BOW), which approach should we choose for a given
task? As with many aspects of machine learning, this depends on our
dataset, the nature of our prediction task, and the type of model we’re
planning to use.

Embeddings add an extra layer to our model and provide extra information
about word meaning that is not available from the BOW encoding.
However, embeddings require training (unless we can use a pre-trained
embedding for our problem). While a deep learning model may achieve
higher accuracy, we can also try using BOW encoding in a linear
regression or decision-tree model using frameworks like scikit-learn or
XGBoost. Using BOW encoding with a simpler model type can be useful
for fast prototyping or to verify that the prediction task we’ve chosen will
work on our dataset. Unlike embeddings, BOW doesn’t take into account
the order or meaning of words in a text document. If either of these are
important to our prediction task, embeddings may be the best approach.

There may also be benefits to building a deep model that combines both
bag of words and text embedding representations to extract more patterns
from our data. To do this, we can use the Multimodal Input approach,
except that instead of concatenating text and tabular features, we can
concatenate the Embedding and BOW representations (see code on
GitHub). Here, the shape of our Input layer would be the vocabulary size
of the BOW representation. Some benefits of representing text in multiple
ways include:

BOW encoding provides strong signals for the most significant


words present in our vocabulary, while embeddings can identify
relationships between words in a much larger vocabulary.
If we have text that switches between languages, we can build
embeddings (or BOW encodings) for each one and concatenate
them.
Embeddings can encode the frequency of words in text, where the
BOW treats the presence of each word as a boolean value. Both
representations are valuable.
BOW encoding can identify patterns between reviews that all
contain the word “amazing,” while an embedding can learn to
correlate the phrase “not amazing” with a below-average review.
Again, both of these representations are valuable.

Extracting tabular features from text

In addition to encoding raw text data, there are often other characteristics
of text that can be represented as tabular features. Let’s say we are
building a model to predict whether or not a Stack Overflow question will
get a response. Various factors about the text but unrelated to the exact
words themselves may be relevant to training a model on this task. For
example, maybe the length of a question or the presence of a question
mark influences the likelihood of an answer. However, when we create an
embedding, we usually truncate the words to a certain length. The actual
length of a question is lost in that data representation. Similarly,
punctuation is often removed. We can use the Multimodal Input design
pattern to bring back this lost information to the model.

In the following query, we’ll extract some tabular features from the
title field of the Stack Overflow dataset to predict whether or not a
question will get an answer:

SELECT
LENGTH(title) AS title_len,
ARRAY_LENGTH(SPLIT(title, " ")) AS word_count,
ENDS_WITH(title, "?") AS ends_with_q_mark,
IF
(answer_count > 0,
1,
0) AS is_answered,
FROM
`bigquery-public-data.stackoverflow.posts_questions`

This results in the following:

Row title_len word_count ends_with_q_mark is_answered

1 84 14 true 0

2 104 16 false 0

3 85 19 true 1

4 88 14 false 1
4 88 14 false 1

5 17 3 false 1

In addition to these features extracted directly from a question’s title, we


could also represent metadata about the question as features. For example,
we could add features representing the number of tags the question had
and the day of the week it was posted. We could then combine these
tabular features with our encoded text and feed both representations into
our model using Keras’s Concatenate layer to combine the BOW-encoded
text array with the tabular metadata describing our text.

MULTIMODAL REPRESENTATION OF IMAGES

Similar to our analysis of embeddings and BOW encoding for text, there
are many ways to represent image data when preparing it for an ML
model. Like raw text, images cannot be fed directly into a model and need
to be transformed into a numerical format that the model can understand.
We’ll start by discussing some common approaches to representing image
data: as pixel values, as sets of tiles, and as sets of windowed sequences.
The Multimodal Input design pattern provides a way to use more than one
representation of an image in our model.

Images as pixel values

At their core, images are arrays of pixel values. A black and white image,
for example, contains pixel values ranging from 0 to 255. We could
therefore represent a 28×28-pixel black-and-white image in a model as a
28×28 array with integer values ranging from 0 to 255. In this section,
we’ll be referencing the MNIST dataset, a popular ML dataset that
includes images of handwritten digits.

You might also like