Professional Documents
Culture Documents
Large-Scale Multi-Label Text Classification - 1716327730214
Large-Scale Multi-Label Text Classification - 1716327730214
classi cation
strings
◆ Use strati ed splits because of
class imbalance
Author: Sayak Paul, Soumik Rakshit ◆ Multi-label binarization
Date created: 2020/09/25
◆ Data preprocessing and
Last modi ed: 2020/12/23 tf.data.Dataset objects
Description: Implementing a large-scale multi-label text classi cation model.
◆ Dataset preview
◆ Vectorization
ⓘ This example uses Keras 2
◆ Create a text classi cation model
View in Colab • GitHub source ◆ Train the model
Evaluate the model
◆ Inference
◆ Acknowledgements
Introduction
In this example, we will build a multi-label text classi er to predict the subject areas of arXiv papers
from their abstract bodies. This type of classi er can be useful for conference submission portals
like OpenReview. Given a paper abstract, the portal could provide suggestions for which areas the
paper would best belong to.
The dataset was collected using the arXiv Python library that provides a wrapper around the
original arXiv API. To learn more about the data collection process, please refer to this notebook.
Additionally, you can also nd the dataset on Kaggle.
Imports
from tensorflow.keras import layers
from tensorflow import keras
import tensorflow as tf
arxiv_data = pd.read_csv(
"https://github.com/soumik12345/multi-label-text-
classification/releases/download/v0.2/arxiv_data.csv"
)
arxiv_data.head()
◆ Dataset preview
Real-world data is noisy. One of the most commonly observed source of noise is data duplication. ◆ Vectorization
Here we notice that our initial dataset has got about 13k duplicate entries. ◆ Create a text classi cation model
◆ Train the model
total_duplicate_titles = sum(arxiv_data["titles"].duplicated()) Evaluate the model
print(f"There are {total_duplicate_titles} duplicate titles.") ◆ Inference
◆ Acknowledgements
arxiv_data = arxiv_data[~arxiv_data["titles"].duplicated()]
print(f"There are {len(arxiv_data)} rows in the deduplicated dataset.")
As observed above, out of 3,157 unique combinations of terms, 2,321 entries have the lowest
occurrence. To prepare our train, validation, and test sets with strati cation, we need to drop these
terms.
(36651, 3)
arxiv_data_filtered["terms"] = arxiv_data_filtered["terms"].apply(
lambda x: literal_eval(x)
)
arxiv_data_filtered["terms"].values[:5]
array([list(['cs.CV', 'cs.LG']), list(['cs.CV', 'cs.AI', 'cs.LG']),
list(['cs.CV', 'cs.AI']), list(['cs.CV']),
list(['cs.CV', 'cs.LG'])], dtype=object)
Multi-label binarization
Now we preprocess our labels using the StringLookup layer.
terms = tf.ragged.constant(train_df["terms"].values)
lookup = tf.keras.layers.StringLookup(output_mode="multi_hot")
lookup.adapt(terms)
vocab = lookup.get_vocabulary()
def invert_multi_hot(encoded_labels):
"""Reverse a single multi-hot encoded label to a tuple of vocab terms."""
hot_indices = np.argwhere(encoded_labels == 1.0)[..., 0]
return np.take(vocab, hot_indices)
print("Vocabulary:\n")
print(vocab)
Vocabulary:
['[UNK]', 'cs.CV', 'cs.LG', 'stat.ML', 'cs.AI', 'eess.IV', 'cs.RO', 'cs.CL', 'cs.NE',
'cs.CR', 'math.OC', 'eess.SP', 'cs.GR', 'cs.SI', 'cs.MM', 'cs.SY', 'cs.IR', 'cs.MA',
'eess.SY', 'cs.HC', 'math.IT', 'cs.IT', 'cs.DC', 'cs.CY', 'stat.AP', 'stat.TH',
'math.ST', 'stat.ME', 'eess.AS', 'cs.SD', 'q-bio.QM', 'q-bio.NC', 'cs.DS', 'cs.GT',
'cs.CG', 'cs.SE', 'cs.NI', 'I.2.6', 'stat.CO', 'math.NA', 'cs.NA', 'physics.chem-ph', Large-scale multi-label text
'cs.DB', 'q-bio.BM', 'cs.PL', 'cs.LO', 'cond-mat.dis-nn', '68T45', 'math.PR', classi cation
'physics.comp-ph', 'I.2.10', 'cs.CE', 'cs.AR', 'q-fin.ST', 'cond-mat.stat-mech',
◆ Introduction
'68T05', 'quant-ph', 'math.DS', 'physics.data-an', 'cs.CC', 'I.4.6', 'physics.soc-ph',
'physics.ao-ph', 'cs.DM', 'econ.EM', 'q-bio.GN', 'physics.med-ph', 'astro-ph.IM', ◆ Imports
'I.4.8', 'math.AT', 'cs.PF', 'cs.FL', 'I.4', 'q-fin.TR', 'I.5.4', 'I.2', '68U10', 'hep- ◆ Perform exploratory data analysis
ex', 'cond-mat.mtrl-sci', '68T10', 'physics.optics', 'physics.geo-ph', 'physics.flu-
◆ Convert the string labels to lists of
dyn', 'math.CO', 'math.AP', 'I.4; I.5', 'I.4.9', 'I.2.6; I.2.8', '68T01', '65D19', 'q-
strings
fin.CP', 'nlin.CD', 'cs.MS', 'I.2.6; I.5.1', 'I.2.10; I.4; I.5', 'I.2.0; I.2.6',
'68T07', 'q-fin.GN', 'cs.SC', 'cs.ET', 'K.3.2', 'I.2.8', '68U01', '68T30', 'q-fin.EC', ◆ Use strati ed splits because of
'q-bio.MN', 'econ.GN', 'I.4.9; I.5.4', 'I.4.5', 'I.2; I.5', 'I.2; I.4; I.5', 'I.2.6; class imbalance
I.2.7', 'I.2.10; I.4.8', '68T99', '68Q32', '68', '62H30', 'q-fin.RM', 'q-fin.PM', 'q- ◆ Multi-label binarization
bio.TO', 'q-bio.OT', 'physics.bio-ph', 'nlin.AO', 'math.LO', 'math.FA', 'hep-ph',
◆ Data preprocessing and
'cond-mat.soft', 'I.4.6; I.4.8', 'I.4.4', 'I.4.3', 'I.4.0', 'I.2; J.2', 'I.2; I.2.6;
tf.data.Dataset objects
I.2.7', 'I.2.7', 'I.2.6; I.5.4', 'I.2.6; I.2.9', 'I.2.6; I.2.7; H.3.1; H.3.3', 'I.2.6;
I.2.10', 'I.2.6, I.5.4', 'I.2.1; J.3', 'I.2.10; I.5.1; I.4.8', 'I.2.10; I.4.8; I.5.4', ◆ Dataset preview
'I.2.10; I.2.6', 'I.2.1', 'H.3.1; I.2.6; I.2.7', 'H.3.1; H.3.3; I.2.6; I.2.7', 'G.3', ◆ Vectorization
'F.2.2; I.2.7', 'E.5; E.4; E.2; H.1.1; F.1.1; F.1.3', '68Txx', '62H99', '62H35', '14J60
◆ Create a text classi cation model
(Primary) 14F05, 14J26 (Secondary)']
◆ Train the model
Evaluate the model
Here we are separating the individual unique classes available from the label pool and then using ◆ Inference
this information to represent a given label set with 0's and 1's. Below is an example. ◆ Acknowledgements
sample_label = train_df["terms"].iloc[0]
print(f"Original label: {sample_label}")
label_binarized = lookup([sample_label])
print(f"Label-binarized representation: {label_binarized}")
count 32985.000000
mean 156.497105
std 41.528225
min 5.000000
25% 128.000000
50% 154.000000
75% 183.000000
max 462.000000
Name: summaries, dtype: float64
Notice that 50% of the abstracts have a length of 154 (you may get a di erent number based on the
split). So, any number close to that value is a good enough approximate for the maximum sequence
length.
Abstract: b'Dense reconstructions often contain errors that prior work has so
far\nminimised using high quality sensors and regularising the output.
Nevertheless,\nerrors still persist. This paper proposes a machine learning technique
to\nidentify errors in three dimensional (3D) meshes. Beyond simply
identifying\nerrors, our method quantifies both the magnitude and the direction of
depth\nestimate errors when viewing the scene. This enables us to improve
the\nreconstruction accuracy.\n We train a suitably deep network architecture with two
3D meshes: a\nhigh-quality laser reconstruction, and a lower quality stereo
image\nreconstruction. The network predicts the amount of error in the lower
quality\nreconstruction with respect to the high-quality one, having only view
the\nformer through its input. We evaluate our approach by correcting\ntwo-dimensional
(2D) inverse-depth images extracted from the 3D model, and show\nthat our method
improves the quality of these depth reconstructions by up to a\nrelative 10% RMSE.'
Label(s): ['cs.CV' 'cs.RO']
Vectorization
Before we feed the data to our model, we need to vectorize it (represent it in a numerical form). For
that purpose, we will use the TextVectorization layer. It can operate as a part of your main model
so that the model is excluded from the core preprocessing logic. This greatly reduces the chances of
training / serving skew during inference.
Large-scale multi-label text
We rst calculate the number of unique words present in the abstracts. classi cation
◆ Introduction
# Source: https://stackoverflow.com/a/18937309/7636462 ◆ Imports
vocabulary = set()
◆ Perform exploratory data analysis
train_df["summaries"].str.lower().str.split().apply(vocabulary.update)
vocabulary_size = len(vocabulary) ◆ Convert the string labels to lists of
print(vocabulary_size) strings
◆ Use strati ed splits because of
class imbalance
153338 ◆ Multi-label binarization
◆ Data preprocessing and
tf.data.Dataset objects
We now create our vectorization layer and map() to the tf.data.Datasets created earlier.
◆ Dataset preview
◆ Vectorization
text_vectorizer = layers.TextVectorization(
◆ Create a text classi cation model
max_tokens=vocabulary_size, ngrams=2, output_mode="tf_idf"
) ◆ Train the model
Evaluate the model
# `TextVectorization` layer needs to be adapted as per the vocabulary from our ◆ Inference
# training set.
◆ Acknowledgements
with tf.device("/CPU:0"):
text_vectorizer.adapt(train_dataset.map(lambda text, label: text))
train_dataset = train_dataset.map(
lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)
validation_dataset = validation_dataset.map(
lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)
test_dataset = test_dataset.map(
lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)
A batch of raw text will rst go through the TextVectorization layer and it will generate their integer
representations. Internally, the TextVectorization layer will rst create bi-grams out of the
sequences and then represent them using TF-IDF. The output representations will then be passed to
the shallow model responsible for text classi cation.
To learn more about other possible con gurations with TextVectorizer, please consult the o cial
documentation.
Note: Setting the max_tokens argument to a pre-calculated vocabulary size is not a requirement.
def make_model():
shallow_mlp_model = keras.Sequential(
[
layers.Dense(512, activation="relu"),
layers.Dense(256, activation="relu"),
layers.Dense(lookup.vocabulary_size(), activation="sigmoid"),
] # More on why "sigmoid" has been used here in a moment.
)
return shallow_mlp_model
plot_result("loss")
plot_result("binary_accuracy")
Epoch 1/20
258/258 [==============================] - 87s 332ms/step - loss: 0.0326 -
binary_accuracy: 0.9893 - val_loss: 0.0189 - val_binary_accuracy: 0.9943
Epoch 2/20
258/258 [==============================] - 100s 387ms/step - loss: 0.0033 - Large-scale multi-label text
binary_accuracy: 0.9990 - val_loss: 0.0271 - val_binary_accuracy: 0.9940 classi cation
Epoch 3/20
◆ Introduction
258/258 [==============================] - 99s 384ms/step - loss: 7.8393e-04 -
binary_accuracy: 0.9999 - val_loss: 0.0328 - val_binary_accuracy: 0.9939 ◆ Imports
Epoch 4/20 ◆ Perform exploratory data analysis
258/258 [==============================] - 109s 421ms/step - loss: 3.0132e-04 -
◆ Convert the string labels to lists of
binary_accuracy: 1.0000 - val_loss: 0.0366 - val_binary_accuracy: 0.9939
strings
Epoch 5/20
258/258 [==============================] - 105s 405ms/step - loss: 1.6006e-04 - ◆ Use strati ed splits because of
binary_accuracy: 1.0000 - val_loss: 0.0399 - val_binary_accuracy: 0.9939 class imbalance
Epoch 6/20 ◆ Multi-label binarization
258/258 [==============================] - 107s 414ms/step - loss: 1.2400e-04 -
◆ Data preprocessing and
binary_accuracy: 1.0000 - val_loss: 0.0412 - val_binary_accuracy: 0.9939
tf.data.Dataset objects
Epoch 7/20
258/258 [==============================] - 110s 425ms/step - loss: 7.7131e-05 - ◆ Dataset preview
binary_accuracy: 1.0000 - val_loss: 0.0439 - val_binary_accuracy: 0.9940 ◆ Vectorization
Epoch 8/20
◆ Create a text classi cation model
258/258 [==============================] - 105s 405ms/step - loss: 5.5611e-05 -
binary_accuracy: 1.0000 - val_loss: 0.0446 - val_binary_accuracy: 0.9940 ◆ Train the model
Evaluate the model
Epoch 9/20
258/258 [==============================] - 103s 397ms/step - loss: 4.5994e-05 - ◆ Inference
binary_accuracy: 1.0000 - val_loss: 0.0454 - val_binary_accuracy: 0.9940 ◆ Acknowledgements
Epoch 10/20
258/258 [==============================] - 105s 405ms/step - loss: 3.5126e-05 -
binary_accuracy: 1.0000 - val_loss: 0.0472 - val_binary_accuracy: 0.9939
Epoch 11/20
258/258 [==============================] - 109s 422ms/step - loss: 2.9927e-05 -
binary_accuracy: 1.0000 - val_loss: 0.0466 - val_binary_accuracy: 0.9940
Epoch 12/20
258/258 [==============================] - 133s 516ms/step - loss: 2.5748e-05 -
binary_accuracy: 1.0000 - val_loss: 0.0484 - val_binary_accuracy: 0.9940
Epoch 13/20
258/258 [==============================] - 129s 497ms/step - loss: 4.3529e-05 -
binary_accuracy: 1.0000 - val_loss: 0.0500 - val_binary_accuracy: 0.9940
Epoch 14/20
258/258 [==============================] - 158s 611ms/step - loss: 8.1068e-04 -
binary_accuracy: 0.9998 - val_loss: 0.0377 - val_binary_accuracy: 0.9936
Epoch 15/20
258/258 [==============================] - 144s 558ms/step - loss: 0.0016 -
binary_accuracy: 0.9995 - val_loss: 0.0418 - val_binary_accuracy: 0.9935
Epoch 16/20
258/258 [==============================] - 131s 506ms/step - loss: 0.0018 -
binary_accuracy: 0.9995 - val_loss: 0.0479 - val_binary_accuracy: 0.9931
Epoch 17/20
258/258 [==============================] - 127s 491ms/step - loss: 0.0012 -
binary_accuracy: 0.9997 - val_loss: 0.0521 - val_binary_accuracy: 0.9931
Epoch 18/20
258/258 [==============================] - 153s 594ms/step - loss: 6.3144e-04 -
binary_accuracy: 0.9998 - val_loss: 0.0549 - val_binary_accuracy: 0.9934
Epoch 19/20
258/258 [==============================] - 142s 550ms/step - loss: 3.1753e-04 -
binary_accuracy: 0.9999 - val_loss: 0.0589 - val_binary_accuracy: 0.9934
Epoch 20/20
258/258 [==============================] - 153s 594ms/step - loss: 2.0258e-04 -
binary_accuracy: 1.0000 - val_loss: 0.0585 - val_binary_accuracy: 0.9933
Large-scale multi-label text
classi cation
◆ Introduction
◆ Imports
◆ Perform exploratory data analysis
◆ Convert the string labels to lists of
strings
◆ Use strati ed splits because of
class imbalance
◆ Multi-label binarization
◆ Data preprocessing and
tf.data.Dataset objects
◆ Dataset preview
◆ Vectorization
◆ Create a text classi cation model
◆ Train the model
Evaluate the model
◆ Inference
◆ Acknowledgements
While training, we notice an initial sharp fall in the loss followed by a gradual decay.
Inference
An important feature of the preprocessing layers provided by Keras is that they can be included
inside a tf.keras.Model. We will export an inference model by including the text_vectorization
layer on top of shallow_mlp_model. This will allow our inference model to directly operate on raw
strings.
Note that during training it is always preferable to use these preprocessing layers as a part of the
data input pipeline rather than the model to avoid surfacing bottlenecks for the hardware
accelerators. This also allows for asynchronous data processing.
# Create a model for inference.
model_for_inference = keras.Sequential([text_vectorizer, shallow_mlp_model])
Abstract: b'We propose to jointly learn multi-view geometry and warping between views
of\nthe same object instances for robust cross-view object detection. What
makes\nmulti-view object instance detection difficult are strong changes in
viewpoint,\nlighting conditions, high similarity of neighbouring objects, and
strong\nvariability in scale. By turning object detection and instance\nre-
identification in different views into a joint learning task, we are able
to\nincorporate both image appearance and geometric soft constraints into a
single,\nmulti-view detection process that is learnable end-to-end. We validate
our\nmethod on a new, large data set of street-level panoramas of urban objects
and\nshow superior performance compared to various baselines. Our contribution
is\nthreefold: a large-scale, publicly available data set for multi-view
instance\ndetection and re-identification; an annotation tool custom-tailored
for\nmulti-view instance detection; and a novel, holistic multi-view
instance\ndetection and re-identification method that jointly models geometry
and\nappearance across views.'
Label(s): ['cs.CV' 'cs.LG' 'stat.ML']
Predicted Label(s): (cs.CV, cs.RO, cs.MM)
Abstract: b'Recurrent meta reinforcement learning (meta-RL) agents are agents that
employ\na recurrent neural network (RNN) for the purpose of "learning a
learning\nalgorithm". After being trained on a pre-specified task distribution,
the\nlearned weights of the agent\'s RNN are said to implement an efficient
learning\nalgorithm through their activity dynamics, which allows the agent to
quickly\nsolve new tasks sampled from the same distribution. However, due to
the\nblack-box nature of these agents, the way in which they work is not yet
fully\nunderstood. In this study, we shed light on the internal working mechanisms
of\nthese agents by reformulating the meta-RL problem using the Partially\nObservable
Markov Decision Process (POMDP) framework. We hypothesize that the\nlearned activity
dynamics is acting as belief states for such agents. Several\nillustrative experiments
suggest that this hypothesis is true, and that\nrecurrent meta-RL agents can be viewed
as agents that learn to act optimally in\npartially observable environments consisting
Large-scale multi-label text
of multiple related tasks. This\nview helps in understanding their failure cases and classi cation
some interesting\nmodel-based results reported in the literature.'
◆ Introduction
Label(s): ['cs.LG' 'cs.AI']
Predicted Label(s): (stat.ML, cs.LG, cs.AI) ◆ Imports
◆ Perform exploratory data analysis
◆ Convert the string labels to lists of
The prediction results are not that great but not below the par for a simple model like ours. We can strings
improve this performance with models that consider word order like LSTM or even those that use
◆ Use strati ed splits because of
Transformers (Vaswani et al.). class imbalance
◆ Multi-label binarization
◆ Data preprocessing and
tf.data.Dataset objects
Acknowledgements ◆ Dataset preview
We would like to thank Matt Watson for helping us tackle the multi-label binarization part and ◆ Vectorization
inverse-transforming the processed labels to the original form. ◆ Create a text classi cation model
◆ Train the model
Thanks Cingis Kratochvil for suggesting and extending this code example by the binary accuracy. Evaluate the model
◆ Inference
◆ Acknowledgements
Terms | Privacy