Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for

Multilingual Tweet Representations

Xinyang Zhang1∗, Yury Malkov2 , Omar Florez2 , Serim Park2 ,


Brian McWilliams2 , Jiawei Han1 , Ahmed El-Kishky2
1
The University of Illinois at Urbana-Champaign, 2 Twitter Cortex
{xz43,hanj}@illinois.edu, {ymalkov,oflorez,serimp,brimcwilliams,aelkishky}@twitter.com

Benjamin and Tom follow


Abstract
Kerry Hanna @ crypto_fan · 12m
Hot take: MLB the show 21 has the best
We present TwHIN-BERT, a multilingual soundtrack of any sports game ever.
language model trained on in-domain data 4 15 73
arXiv:2209.07562v1 [cs.CL] 15 Sep 2022

from the popular social network Twitter. Show replies


TwHIN-BERT differs from prior pre-trained (a)
language models as it is trained with not only
text-based self-supervision, but also with a so- like reply
Bottom of …
cial objective based on the rich social engage- the ninth, Three
ments within a Twitter heterogeneous informa- author author strikes and
two outs,
and down by you’re out!!!
tion network (TwHIN). Our model is trained …
one!!! retweet reply
on 7 billion tweets covering over 100 distinct
languages providing a valuable representation (b)
to model short, noisy, user-generated text. We
evaluate our model on a variety of multi- Figure 1: (a) This mock-up shows a short-text Tweet
lingual social recommendation and semantic and social engagements such as Faves, Retweets,
understanding tasks and demonstrate signifi- Replies, Follows that create a social context to Tweets
cant metric improvement over established pre- and signify Tweet appeal to engaging users. (b) Co-
trained language models. We will freely open- engagement is a strong indicator of Tweet similarity.
source TwHIN-BERT and our curated hashtag
prediction and social engagement benchmark
datasets to the research community1 .
As a result, PLMs designed for general text cor-
1 Introduction pora may struggle to understand Tweet semantics
accurately. Existing works (Nguyen et al., 2020;
The proliferation of pre-trained language models Barbieri et al., 2021) on Twitter LM pre-training do
(PLMs) (Devlin et al., 2019; Conneau et al., 2020) not address these challenges and simply replicate
based on the Transformer architecture (Vaswani general domain pre-training on Twitter corpora.
et al., 2017) has pushed the state of the art across A distinctive feature of Twitter social media is
many tasks in natural language processing (NLP). the user interactions through Tweet engagements.
As an application of transfer learning, these mod- As seen in Figure 1, when a user visits Twitter,
els are typically trained on massive text corpora in addition to posting Tweets, they can perform
and, when fine-tuned on downstream tasks, have a variety of social actions such as “Favoriting”,
demonstrated state-of-the-art performance. “Replying” and “Retweeting” Tweets. The wealth
Despite the success of PLMs in general-domain of such engagement information is invaluable to
NLP, fewer attempts have been made in language Tweet content understanding. For example, the
model pre-training for user-generated text on so- post “bottom of the ninth, two outs, and down by
cial media. In this work, we pre-train a language one!!” would be connected to baseball topics by
model for Twitter – a prominent social media its co-engaged Tweets, such as “three strikes and
platform where users post short messages called you’re out!!!”. Without the social contexts, a con-
Tweets. Tweets contain informal diction, abbrevia- ventional text-only PLM objective would struggle
tions, emojis, and topical tokens such as hashtags. to build this connection. As an additional bene-

Work was conducted while author was working at Twitter. fit, a socially-enriched language model will also
1
https://github.com/xinyangz/TwHIN-BERT vastly benefit common applications on social me-
dia, such as social recommendations (Ying et al., first construct and embed a user-Tweet engagement
2018) and information diffusion prediction (Cheng network. The resultant Tweet embeddings are then
et al., 2014; Sankar et al., 2020). used to mine pairs of socially similar Tweets. These
We introduce TwHIN-BERT– a multilingual lan- Tweet pairs and others are then used to pre-train
guage model for Twitter pre-trained with social en- TwHIN-BERT, which can then be fine-tuned for
gagements. The key idea of our method is to lever- various downstream tasks.
age socially similar Tweets for pre-training. Build-
2.1 Mining Socially Similar Tweets
ing on this idea, TwHIN-BERT has the following
features. (1) We construct a Twitter Heterogeneous With abundant social engagement logs, we (infor-
Information Network(TwHIN) (El-Kishky et al., mally) define socially similar Tweets as Tweets that
2022) to unify the multi-typed user engagement are co-engaged by a similar set of users. The chal-
logs. Then, we run scalable embedding and approx- lenge lies in how to implement the notion of social
imate nearest neighbor search to sift through hun- similarity by (1) fusing heterogeneous engagement
dreds of billions of engagement records and mine types, such as “Favorite”, “Reply” , “Retweet”,
socially similar Tweet pairs. (2) In conjunction and (2) efficiently mining similar Tweet pairs at the
with masked language modeling, we introduce a scale of billions.
contrastive social objective that enforces the model To address these challenges, TwHIN-BERT first
to tell if a pair of Tweets are socially similar or constructs a Twitter Heterogeneous Information
not. Our model is trained on 7 billion Tweets from Network (TwHIN) from the engagement logs, then
over 100 distinct languages, of which 1 billion have runs a scalable heterogeneous network embedding
social engagement logs. method to capture co-engagement and map Tweets
We evaluate the TwHIN-BERT model on both and users into a vector space. With this, social
social recommendation and semantic understand- similarity translates to embedding space similarity.
ing downstream evaluation tasks. To comprehen- Subsequently, we mine similar Tweet pairs by dis-
sively evaluate on many languages, we curate two tributed approximate nearest neighbor search on
large-scale datasets, a social engagement predic- the embeddings.
tion dataset focused on social aspects and a hashtag 2.1.1 Constructing TwHIN
prediction dataset focused on language aspects. In
Formally, our TwHIN is defined and constructed as
addition to these two curated datasets, we also eval-
follows:
uate on established benchmark datasets to draw
direct comparisons to other available pre-trained Definition 1 (TwHIN) Our Twitter Heteroge-
language models. TwHIN-BERT achieves state-of- neous Information Network is a directed bipartite
the-art performance in our evaluations with a major graph G = (U, T, E, φ), where U is the set of user
advantage in the social tasks. nodes, T is the set of Tweet nodes, E = U × T is
In summary, our contributions are as follows: the set of engagement edges. φ : E 7→ R is an
• We build the first ever socially-enriched pre- edge type mapping function. Each edge e ∈ E
trained language model for noisy user-generated belongs to a type of engagement in R.
text on Twitter. Our curated TwHIN consists of approximately
• Our model is the strongest multilingual Twitter 200 million distinct users, 1 billion Tweets, and
PLM so far, covering 100 distinct languages. over 100 billion edges. We posit that our TwHIN
• Our model has a major advantage in capturing encodes not only user preferences but also Tweet
social appeal of Tweets. social appeal and perform scalable network em-
• We open-source TwHIN-BERT as well as two bedding to derive a social similarity metric from
new Tweet benchmark datasets: (1) hashtag pre- TwHIN. The network embedding fuses the hetero-
diction and (2) social engagement prediction. geneous engagements into a unified vector space
that’s easy to operate on.
2 TwHIN-BERT
2.1.2 Embedding TwHIN Nodes
In this section, we outline how we construct train- While our approach is agnostic to the exact method-
ing examples for our social objectives and subse- ology used to embed TwHIN, we follow the ap-
quently train TwHIN-BERT with social and text proach outlined in (El-Kishky et al., 2022; El-
pre-training objectives. As seen in Figure 2, we Kishky et al., 2022b). We perform training with
Social Text Downstream
Engagement Data Tweet Corpus Objective Objective Tasks

Preprocess Tweet CLS We are … ! Engagement Prediction


Fave, Reply, Retweet
Corpora

Hashtag Prediction
Construct a Twitter … … …
Mine socially similar
Heterogeneous Information
Network (TwHIN) Tweet pairs Sentiment Analysis


CLS MASK are MASK
Tweet Topic Classification
Embed Entities from TwHIN ANN Index
Embedding Transformer LM

Mining Socially Similar Tweets Pre-training with Text + Social Downstream Fine-tuning

Figure 2: We outline the end-to-end TwHIN-BERT process. This three-step process involves (1) mining socially
similar Tweet pairs by embedding a Twitter Heterogeneous Information Network (2) training TwHIN-BERT using
a joint social and MLM objective and finally (3) fine-tuning TwHIN-BERT on downstream tasks.

Tweets keyed by their engagement-based TwHIN


embeddings. As each Tweet embedding is 256-
dimensional, storing billion-scale Tweet embed-
dings would require more than one TB of memory.
To reduce the size of the index such that it can fit
on a 16 A100 GPU node with each GPU possess-
ing 40GB of memory, we apply product quantiza-
Figure 3: Twitter Heterogeneous Information Network
tion (Jegou et al., 2010) to discretize and reduce
(TwHIN) capturing social engagements between users embeddings size. The resultant index corresponds
and Tweets. to OPQ64,IVF65536,PQ64 in the FAISS index
factory terminology.
After creating the FAISS index and populating
the TransE embedding (Bordes et al., 2013) ob- it with TwHIN Tweet embeddings, we search the
jective to co-embed users and Tweets using the index using Tweet embedding queries to find pairs
PyTorch-Biggraph (Lerer et al., 2019) framework of similar Tweets (ti , tj ) such that ti and tj are
for scalability. Following previous approaches, we close in the embedding space as defined by their
train for 10 epochs and perform negative sampling cosine distance. To ensure high recall, we query the
both uniformly and proportional to entity preva- FAISS index with 2000 probes. Finally, we select
lence in TwHIN (Bordes et al., 2013; Lerer et al., the k closet Tweets defined by the cosine distance
2019). Optimization is performed via Adagrad. between the query Tweet and retrieved Tweets’ em-
Upon learning dense representations of nodes in beddings. These Tweet pairs are used in our social-
TwHIN, we utilize the learned Tweet representa- objective when pre-training TwHIN-BERT.
tions to mine socially similar Tweets.
2.2 Pre-training Objectives
2.1.3 Mining Similar Tweet Pairs
Given the mined socially similar Tweets, we
Given the learned TwHIN Tweet embeddings, we
describe our language model training process.
seek to identify pairs of Tweets with similar so-
To train TwHIN-BERT, we first run the Tweets
cial appeal – that is, Tweets that appeal to (i.e.,
through the language model and then train the
are likely to be engaged with) similar users. We
model with a joint contrastive social loss and
will use these socially-similar Tweet pairs as self-
masked language model loss.
supervision when training TwHIN-BERT. To iden-
tify these pairs, we perform an approximate nearest Tweet Encoding with LM. We use a Trans-
neighbor (ANN) search in the TwHIN embedding former language model to encode each Tweet. Sim-
space. To efficiently perform the search over 1B+ ilar to BERT (Devlin et al., 2019), given the tok-
Tweets, we use the optimized FAISS2 toolkit (John- enized text wt = [w1 , w2 , ..., wn ] of a Tweet t, we
son et al., 2019) to create a compact index of add special tokens to mark the start and end of the
2
https://github.com/facebookresearch/fa Tweet: ŵt = [CLS]wt [SEP]. As the Tweets are
iss usually shorter than the maximum sequence length
of a language model, we group multiple Tweets and using the 6 billion Tweets without user engagement.
feed them together into the language model when The model is trained for 500K steps on 16 Nvidia
possible. We then apply CLS-pooling, which takes A100 GPUs (a2-megagpu-16g) with a total batch
the [CLS] token embedding of each Tweet. These size of 6K. We set the peak learning rate to 2e-4
Tweet embeddings are passed through an MLP pro- and use the first 30K steps for warm-up. In the
jection head for the social loss computation. second stage, the model is trained for another 500K
steps on the 1 billion Tweets with the joint MLM
 and engagement loss. We set the peak learning
[et1 , et2 , ...] = Pool LM([ŵt1 , ŵt2 , ...]) (1) rate to 1e-4, contrastive loss temperature to 0.1,
z t = MLP(et ) (2) loss balancing factor λ to 0.1 for the second stage.
We follow RoBERTa (Liu et al., 2019) for the rest
Contrastive Social Loss. We use a contrastive
of our hyperparameters. We use mixed precision
loss to allow our model to learn whether two Tweets
during training. Overall pre-training takes approxi-
are socially similar or not. For each batch of B so-
mately five days.
cially similar Tweet pairs {(ti , tj )}B , we compute
the NT-Xent loss (Chen et al., 2020) with in-batch 3 Experiments
negatives:
In this section, we discuss baseline model speci-
exp(sim(z i , z j ))/τ
Lsocial (i, j) = − log P fications, evaluation setup, and results from two
NB (i) exp(sim(z i , z k )/τ ) families of downstream evaluation tasks.
(3)
The negatives NB (i) of Tweet ti are the (2B − 1) 3.1 Baselines
other Tweets in the batch that are not paired with We evaluate TwHIN-BERT against the following
ti . We use cosine similarity for function sim(·, ·). four baselines:
τ is the loss temperature.
Our pre-training objective is a combination of Multilingual BERT: mBERT (Devlin et al.,
the contrastive social loss and the masked language 2019) is a widely used pre-trained language model
model loss (Devlin et al., 2019): trained on a general domain web corpus. We use
the base-cased version of the model.
L = Lsocial + λLMLM (4)
BERTweet: BERTweet (Nguyen et al., 2020) is
λ is a hyperparameter that balances the social and the previous state-of-the-art monolingual language
language loss. model trained on Tweets, however this model is
2.3 Pre-training Setup trained on only English Tweets.

Model Architecture. We use the same Trans- XLM-T: XLM-T (Barbieri et al., 2021) is a mul-
former architecture as BERT (Devlin et al., 2019) tilingual language model based on XLM-R (Con-
for our language model. We adopt the XLM- neau et al., 2020) with continued training on a
R (Conneau et al., 2020) tokenizer, which offers Tweet corpus.
good capacity and coverage in all languages. The TwHIN-BERT-MLM: This is an ablation of our
model has a vocabulary size of 250K. The max model which has the same architecture as ours and
sequence length is set to 128 tokens. We use a is trained on the same corpus. However, this variant
[768, 768] dimensional projection head for con- is trained with only the masked language model
trastive loss. Note that although we have chosen objective without the additional social objective.
this specific architecture, our social objective can We utilize the base variant of each model (which
be used in conjunction with a wide range of lan- has around 135M to 278M parameters depending
guage model architectures. on the size of the tokenizer) in our experiments.
Pre-training Data. We collect 7 billion Tweets in
100 languages from Jan. 2020 to Jun. 2022. Ad- 3.2 Social Engagement Prediction
ditionally, we collect 100 billion user-Tweet social Our first benchmark task is social engagement pre-
engagement data covering 1 billion of our Tweets. diction. This task aims to evaluate how well the pre-
Training Procedure. Our training has two stages. trained language models capture social aspects of
In the first stage, we train the model from scratch user-generated text. In our task, we predict whether
Table 1: Engagement prediction results on high, mid, and low-resource languages.

mBERT XLM-T TwHIN-BERT-MLM TwHIN-BERT


lang. MRR HITS@10 MRR HITS@10 MRR HITS@10 MRR HITS@10
English (en) .0321 .0710 .0670 .1120 .0790 .1454 .0810 .1607
Japanese (ja) .0144 .0236 .0424 .0924 .0578 .1188 .1260 .2370
Arabic (ar) .0331 .0724 .0638 .1368 .0610 .1401 .0915 .2275
Greek (el) .0264 .0497 .0320 .0575 .0326 .0611 .0343 .0617
Urdu (ur) .0184 .0400 .0175 .0371 .0229 .0467 .0265 .0589
Dutch (nl) .0318 .0587 .0373 .0706 .0433 .0883 .0511 .0963
Norwegian (no) .0350 .0696 .0654 .1431 .0588 .1279 .0774 .1617
Danish (da) .0477 .1000 .0510 .1117 .0564 .1247 .0576 .1187
Pashto (ps) .0260 .0524 .0334 .0693 .0341 .0678 .0456 .0997
Average .0294 .0597 .0455 .0923 .0495 .1023 .0657 .1358

users modeled via a user embedding vector will per- as the second part. The two parts are concatenated
form a certain social engagement on a given Tweet. to form the Combined embedding of a Tweet.
We use different pre-trained language models With LM-derived Tweet embeddings, pre-
to generate representations for Tweets, and then trained user embeddings, and the user-Tweet en-
feed these representations into a simple prediction gagement records, we build an engagement predic-
model alongside the corresponding user represen- tion model Θ = (W t , W u ). Given a user u and a
tation. The model is trained to predict whether a Tweet t, the model projects the user embedding eu
user will engage with a specific Tweet. and the Tweet embedding et into the same space,
Dataset. To curate our Tweet-Engagement dataset, and then calculates the probability of engagement:
we select the 50 most popular languages on Twitter
and sample 10,000 (or all if the total number is less
than 10,000) Tweets of each language. All Tweets hu = W Tu eu , ht = W Tt et
 
are available via the Twitter public API. We then P (t | u) = σ hTu ht
collect the user-Tweet engagement records asso-
ciated with these Tweets. There are, on average, The model is trained by optimizing a negative
29K engagement records per language. We ensure sampling loss on the training engagement records
that there is no overlap between the evaluation and R. For each engagement pair (u, t) ∈ R, the loss
pre-training datasets. is defined as:
Each engagement record consists of a pre-trained    
256-dimensional user embedding (El-Kishky et al., log σ hTu ht + Et0 ∼Pn (R) log σ −hTu ht0
2022) and a Tweet ID that indicates the user has
engaged with the given Tweet. To ensure privacy, where Pn (R) is a negative sampling distribution.
each user embedding appears only once, however We use the frequency of each Tweet in R raised to
each tweet may be engaged by multiple users. We the power of 3/4 for this distribution.
split the Tweets into train, development, and test Our prediction model closely resembles classical
sets with a 0.8/0.1/0.1 ratio, and then collect the link prediction models such as (Tang et al., 2015b).
respective engagement records for each subset. We keep the model simple, making sure it will not
Prediction Model. Given a pre-trained language overpower the language model embeddings.
model, we use it to generate an embedding for each Evaluation Setup and Metrics. We conduct hy-
Tweet t given its content wt : et = Pool LM(wt ) perparameter search on the English validation
We apply the following pooling strategies to cal- dataset and use these hyperparameters for the other
culate the Tweet embedding from the language languages. The prediction model projects user and
model. First, we take [CLS] token embedding as Tweet embedding to 128 dimensions. We set batch
the first part of overall embedding. Then, we take size to 512, learning rate to 1e-3. The best model
the average token embedding of non-special tokens on validation set is selected for test set evaluation.
Table 2: Text classification dataset statistics. ∗ Statistics to comprehensively cover the popular languages
for Hashtag shows the numbers for each language. on Twitter. In addition, we evaluate on five exter-
nal benchmark datasets for tasks such as sentiment
Dataset Lang. Label Train Dev Test classification, emoji prediction, and topic classifi-
SE2017 en 3 45,389 2,000 11,906 cation in selected languages. We show the dataset
SE2018-en en 20 45,000 5,000 50,000 statistics in Table 2.
SE2018-es es 19 96,142 2,726 9,970 • Tweet Hashtag Prediction dataset is a multilin-
ASAD ar 3 137,432 15,153 16,842 gual hashtag prediction dataset we collected from
COVID-JA ja 6 147,806 16,394 16,394
Tweets. It contains Tweets of 50 most popular
SE2020-hi hi+en 3 14,000 3,000 3,000
languages on Twitter. For each language, 500
SE2020-es es+en 3 10,800 1,200 3,000
most popular hashtags were selected and 100k
Hashtag multi 500∗ 16,000∗ 2,000∗ 2,000∗
Tweets that has those hashtags were sampled. We
made sure each Tweet will only contain one of
In the test set, we pair each user with 1,000 the 500 candidate hashtags. Similar to work pro-
Tweets. The user is engaged with one of the posed in Mireshghallah et al. (2022), the task is
Tweets, the rest are randomly sampled negatives. to predict the hashtag used in the Tweet.
The model ranks the Tweets by the predicted prob- • SemEval2017 task 4A (Rosenthal et al., 2019) is
ability of engagement. We use mean reciprocal a English Tweet sentiment analysis dataset. The
rank(MRR) and HITS@10 as evaluation metrics. labels are three-point sentiments of “positive”,
We report average results from 20 runs with differ- “negative”, “neutral”.
ent initialization. • SemEval2018 task 2 (Barbieri et al., 2018) is
an emoji prediction dataset in both English and
Results. We show results for high, mid, and low-
Spanish. The objective is to predict the most
resource languages (determined by language fre-
likely used emoji in a Tweet.
quency on Twitter) in Table 1. Our TwHIN-BERT
• ASAD (Alharbi et al., 2020) is an Arabic Tweet
model demonstrates significant improvement over
sentiment dataset with three-point labels.
the baselines on the social engagement task. Com-
• COVID-JA (Suzuki, 2019) is a Japanese Tweets
paring our model to the ablation without the social
topic classification dataset. The objective is
loss, we can see the contrastive social pre-training
to classify each Tweet into one of the six pre-
provides significant lift over just MLM pre-training
defined topics around COVID-19.
for social engagement prediction. Further analysis
• SemEval2020 task 9 (Patwa et al., 2020) con-
shows that TwHIN-BERT representations result in
tains code-mixed Tweets of Hindi+English and
better predictive performance on higher resource
Spanish+English. We use the three-point senti-
languages, where engagement data is more abun-
ment analysis part of the dataset for evaluation.
dant during pre-training. Additionally, we also
observe that our method yields the most improve- Setup and Evaluation Metrics. We use the stan-
ment when using the Combined [CLS] token and dard language model fine-tuning method as de-
average non-special token embedding. We believe scribed in (Devlin et al., 2019) and apply a linear
the [CLS] token embedding from our model cap- prediction layer on top of the pooled output of the
tures social aspects of the Tweet, while averaging last transformer layer. Each model is fine-tuned for
the other token embeddings captures the semantic 30 epochs, and we evaluate the best model from
aspects of the Tweet. Naturally, utilizing both as- the training epochs on the test set based on devel-
pects is essential to better model a Tweet’s appeal opment set performance. The learning rate is set to
and a user’s inclination to engage with a Tweet. 5-e5. We report average results from 3 fine-tuning
runs with different random seeds. Results are the
3.3 Tweet Classification evaluation metrics recommended for each bench-
Our second collection of downstream tasks is Tweet mark dataset or challenge. For hashtag prediction
classification. In these tasks, we take as input the datasets, we report macro-F1 scores.
Tweet text and predict discrete labels correspond- Multilingual Hashtag Prediction. In Table 3, we
ing to the label space for each task. show macro F1 scores on our multilingual hashtag
Datasets. We curate a multilingual Tweet hashtag prediction dataset. We can see that TwHIN-BERT
prediction dataset (available via Twitter public API) significantly outperforms the multilingual baseline
Table 3: Multilingual hashtag prediction results on high, mid, and low resource languages.

High-Resource Mid-Resource Low-Resource


Method en ja ar el ur nl no da ps Avg.
mBERT 54.23 68.43 38.48 43.46 33.62 39.98 46.73 59.36 29.41 45.96
BERTweet 58.99 - - - - - - - - -
XLM-T 55.23 70.50 42.17 44.14 39.22 41.21 49.00 59.90 33.27 48.29
TwHIN-BERT-MLM 58.62 72.87 43.16 45.89 41.41 42.51 49.34 60.69 35.23 49.97
TwHIN-BERT 58.68 72.53 43.55 46.79 41.21 42.61 49.47 60.49 35.60 50.10

Table 4: External classification benchmark results.

SE2017 SE2018 ASAD COVID-JA SE2020 Avg.


Method en en es ar ja hi+en es+en
mBERT 66.44 27.73 32.14 68.98 41.87 66.44 45.55 49.88
BERTweet 72.95 33.44 - - - - - -
XLM-T 71.50 31.99 35.88 80.87 42.09 70.96 51.03 54.90
TwHIN-BERT-MLM 71.98 31.73 36.06 80.36 43.09 71.09 50.83 55.02
TwHIN-BERT 72.40 31.38 36.00 81.17 43.79 72.01 51.67 55.49

methods. Again, we see better performance gain on further scaling them with respect to number of pa-
higher-resource languages. On the English dataset, rameters and training data. As a result, PLM mod-
our model almost matches the BERTweet monolin- els have grown considerably in their sizes, from
gual language model trained exclusively on En- millions (Devlin et al., 2019; Yang et al., 2019) of
glish Tweets and with a dedicated English tok- parameters to billions (Micheli and Fleuret, 2021;
enizer. Comparing TwHIN-BERT to the ablation Raffel et al., 2020; Shoeybi et al., 2019) and even
TwHIN-BERT-MLM with no social loss, the two trillion-level (Fedus et al., 2021). Another avenue
models demonstrate similar performance. These of improvement has been improving the training
results show that while our model has a major ad- objectives used to train PLMs. A broad spectrum of
vantage on social tasks, it retains high performance pre-training objectives have been explored with dif-
on semantic understanding applications. ferent levels of success, notable examples include
External Classification Benchmarks. As shown masked language modeling (Devlin et al., 2019),
in Table 4, TwHIN-BERT matches or outperforms auto-regressive causal language modeling (Yang
the multilingual baselines on the established classi- et al., 2019), model-based denoising (Clark et al.,
fication benchmarks. The monolingual BERTweet 2020), and corrective language modeling (Meng
has the advantage on the two English datasets due et al., 2021). Despite these innovations in scaling
to its more capable English tokenizer and its large, and pre-training objectives, the vast majority of the
English-only training corpora. Similar to hashtag work has focused on text-only training objectives
prediction, TwHIN-BERT performs on par with applied to general domain corpora, e.g., Wikipedia
the MLM-only ablation. These additional bench- and CommonCrawl. In this paper, we deviate from
mark results show the versatility of TwHIN-BERT most previous works by explore PLM training us-
in language-based applications. ing solely Twitter in-domain data and training our
model based on both text-based and social-based
4 Related Works objectives.

Pre-trained Language Models: Since their in- Enriching PLMs with Additional Informa-
troduction (Peters et al., 2018; Devlin et al., 2019), tion: Several existing works introduce addi-
pre-trained language models have enjoyed tremen- tional information for language model pre-training.
dous success in all aspects of natural language pro- ERNIE (Zhang et al., 2019) and K-BERT (Liu et al.,
cessing. Follow up research has advanced PLMs by 2020) inject entities and their relations from knowl-
edge graphs to augment the pre-training corpus. large Tweet corpus. Unlike previous BERT-style
OAG-BERT (Liu et al., 2022) appends metadata of language models, TwHIN-BERT is trained using
a document to its raw text, and designs objectives two objectives: (1) a standard MLM pre-training
to jointly predict text and metadata. These works objective and (2) a contrasting social objective. We
focus on bringing additional metadata and knowl- utilize our pre-trained language model to perform
edge by injecting training instances, while our work a variety of downstream tasks on Tweet data. Our
leverage the rich social engagements embedded in experiments demonstrate that TwHIN-BERT out-
the social media platform for text relevance. performs previously released language models on
not only semantic tasks, but also on social engage-
Network Embedding: Network embedding has ment prediction tasks. We release this model to the
emerged as a valuable tool for transferring in- academic community to further research in social
formation from relational data to other tasks (El- media NLP.
Kishky et al., 2022a). Early network embedding
methods such as DeepWalk (Perozzi et al., 2014)
and node2vec (Grover and Leskovec, 2016) em- References
bed homogeneous graphs by performing random Basma Alharbi, Hind Alamro, Manal Abdulaziz
walks and applying SkipGram modeling. With Alshehri, Zuhair Khayyat, Manal Kalkatawi,
the introduction of heterogeneous information net- Inji Ibrahim Jaber, and Xiangliang Zhang. 2020.
works (Shi et al., 2016; Sun and Han, 2013) as a for- Asad: A twitter-based benchmark arabic sentiment
analysis dataset. ArXiv, abs/2011.00578.
malism to model rich multi-typed, multi-relational
networks, many heterogeneous network embedding Francesco Barbieri, Luis Espinosa Anke, and José
approaches were developed (Chang et al., 2015; Camacho-Collados. 2021. Xlm-t: A multilin-
gual language model toolkit for twitter. ArXiv,
Tang et al., 2015a; Xu et al., 2017; Chen and Sun, abs/2104.12250.
2017; Dong et al., 2017). However, many of these
techniques are difficult to scale to very large net- Francesco Barbieri, Jose Camacho-Collados,
Francesco Ronzano, Luis Espinosa-Anke, Miguel
works. In this work, we apply knowledge graph
Ballesteros, Valerio Basile, Viviana Patti, and
embeddings (Wang et al., 2017; Bordes et al., 2013; Horacio Saggion. 2018. SemEval 2018 task 2:
Trouillon et al., 2016), which have been shown Multilingual emoji prediction. In Proceedings
to be both highly scalable and flexible enough to of The 12th International Workshop on Semantic
model multiple node and edge types. Evaluation, pages 24–33, New Orleans, Louisiana.
Association for Computational Linguistics.
Tweet Language Models: While a majority of Antoine Bordes, Nicolas Usunier, Alberto Garcia-
PLMs are trained on general domain corpora, a Duran, Jason Weston, and Oksana Yakhnenko.
few language models have been proposed specif- 2013. Translating embeddings for modeling multi-
ically for Twitter and other social media plat- relational data. Advances in neural information pro-
cessing systems, 26.
forms. BERTweet (Nguyen et al., 2020) mirrors
BERT training on 850 million English Tweets. S. Chang, W. Han, J. Tang, G. Qi, C. Aggarwal, and
TimeLMs (Loureiro et al., 2022) trains a set of T. Huang. 2015. Heterogeneous network embedding
via deep architectures. In SIGKDD, pages 119–128.
RoBERTa (Liu et al., 2019) models for English
Tweets on different time ranges. XLM-T (Barbi- T. Chen and Y. Sun. 2017. Task-guided and path-
eri et al., 2021) continues the pre-training process augmented heterogeneous network embedding for
author identification. In WSDM, pages 295–304.
from an XLM-R (Conneau et al., 2020) checkpoint
on 198 million multilingual Tweets. These meth- Ting Chen, Simon Kornblith, Mohammad Norouzi,
ods mostly replicate existing general domain PLM and Geoffrey Hinton. 2020. A simple framework
for contrastive learning of visual representations. In
methods and simply substitute the training data
Proceedings of the 37th International Conference
with Tweets. However our approach utilizes ad- on Machine Learning, volume 119 of Proceedings
ditional social engagement signals to enhance the of Machine Learning Research, pages 1597–1607.
pre-trained Tweet representations. PMLR.
Justin Cheng, Lada A. Adamic, P. Alex Dow, Jon M.
5 Conclusions Kleinberg, and Jure Leskovec. 2014. Can cascades
be predicted? In 23rd International World Wide
In this work we introduce TwHIN-BERT, a mul- Web Conference, WWW ’14, Seoul, Republic of Ko-
tilingual BERT-style language model trained on a rea, April 7-11, 2014, pages 925–936. ACM.
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019.
Christopher D. Manning. 2020. ELECTRA: pre- Billion-scale similarity search with gpus. IEEE
training text encoders as discriminators rather than Transactions on Big Data, 7(3):535–547.
generators. In 8th International Conference on
Learning Representations, ICLR 2020, Addis Ababa, Adam Lerer, Ledell Wu, Jiajun Shen, Timothee
Ethiopia, April 26-30, 2020. OpenReview.net. Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex
Peysakhovich. 2019. Pytorch-biggraph: A large-
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, scale graph embedding system. arXiv preprint
Vishrav Chaudhary, Guillaume Wenzek, Francisco arXiv:1903.12287.
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Unsupervised Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju,
cross-lingual representation learning at scale. In Haotang Deng, and Ping Wang. 2020. K-bert:
ACL. Enabling language representation with knowledge
graph. In AAAI.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: pre-training of Xiao Liu, Da Yin, Jingnan Zheng, Xingjian Zhang,
deep bidirectional transformers for language under- P. Zhang, Hongxia Yang, Yuxiao Dong, and Jie Tang.
standing. In Proceedings of the 2019 Conference 2022. Oag-bert: Towards a unified backbone lan-
of the North American Chapter of the Association guage model for academic knowledge services. Pro-
for Computational Linguistics: Human Language ceedings of the 28th ACM SIGKDD Conference on
Technologies, NAACL-HLT 2019, Minneapolis, MN, Knowledge Discovery and Data Mining.
USA, June 2-7, 2019, Volume 1 (Long and Short Pa-
pers), pages 4171–4186. Association for Computa- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
tional Linguistics. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Y. Dong, N. Chawla, and A. Swami. 2017. metap- Roberta: A robustly optimized BERT pretraining ap-
ath2vec: Scalable representation learning for hetero- proach. CoRR, abs/1907.11692.
geneous networks. In SIGKDD, pages 135–144.
Daniel Loureiro, Francesco Barbieri, Leonardo Neves,
Ahmed El-Kishky, Michael Bronstein, Ying Xiao, and Luis Espinosa Anke, and José Camacho-Collados.
Aria Haghighi. 2022a. Graph-based representation 2022. Timelms: Diachronic language models from
learning for web-scale recommender systems. In twitter. In Proceedings of the 60th Annual Meet-
Proceedings of the 28th ACM SIGKDD Conference ing of the Association for Computational Linguistics,
on Knowledge Discovery and Data Mining, pages ACL 2022 - System Demonstrations, Dublin, Ireland,
4784–4785. May 22-27, 2022, pages 251–260. Association for
Computational Linguistics.
Ahmed El-Kishky, Thomas Markovich, Kenny Leung,
Frank Portman, and Aria Haghighi. 2022b. knn- Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Ti-
embed: Locally smoothed embedding mixtures for wary, Paul Bennett, Jiawei Han, and Xia Song.
multi-interest candidate retrieval. arXiv preprint 2021. COCO-LM: correcting and contrasting text
arXiv:2205.06205. sequences for language model pretraining. In Ad-
vances in Neural Information Processing Systems
Ahmed El-Kishky, Thomas Markovich, Serim Park, 34: Annual Conference on Neural Information Pro-
Chetan Verma, Baekjin Kim, Ramy Eskander, Yury cessing Systems 2021, NeurIPS 2021, December 6-
Malkov, Frank Portman, Sofía Samaniego, Ying 14, 2021, virtual, pages 23102–23114.
Xiao, and Aria Haghighi. 2022. Twhin: Embed-
ding the twitter heterogeneous information network Vincent Micheli and François Fleuret. 2021. Language
for personalized recommendation. In KDD ’22: The models are few-shot butlers. In Proceedings of the
28th ACM SIGKDD Conference on Knowledge Dis- 2021 Conference on Empirical Methods in Natural
covery and Data Mining, Washington, DC, USA, Au- Language Processing, EMNLP 2021, Virtual Event
gust 14 - 18, 2022, pages 2842–2850. ACM. / Punta Cana, Dominican Republic, 7-11 November,
2021, pages 9312–9318. Association for Computa-
William Fedus, Barret Zoph, and Noam Shazeer. 2021. tional Linguistics.
Switch transformers: Scaling to trillion parameter
models with simple and efficient sparsity. CoRR, Fatemehsadat Mireshghallah, Nikolai Vogler, Junxian
abs/2101.03961. He, Omar Florez, Ahmed El-Kishky, and Taylor
Berg-Kirkpatrick. 2022. Non-parametric tempo-
A. Grover and J. Leskovec. 2016. node2vec: Scalable ral adaptation for social media topic classification.
feature learning for networks. In SIGKDD, pages arXiv preprint arXiv:2209.05706.
855–864.
Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen.
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2020. Bertweet: A pre-trained language model for
2010. Product quantization for nearest neighbor english tweets. In Proceedings of the 2020 Con-
search. IEEE transactions on pattern analysis and ference on Empirical Methods in Natural Language
machine intelligence, 33(1):117–128. Processing: System Demonstrations, EMNLP 2020 -
Demos, Online, November 16-20, 2020, pages 9–14. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun
Association for Computational Linguistics. Yan, and Qiaozhu Mei. 2015b. LINE: large-scale in-
formation network embedding. In Proceedings of
Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj the 24th International Conference on World Wide
Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Web, WWW 2015, Florence, Italy, May 18-22, 2015,
Chakraborty, Thamar Solorio, and Amitava Das. pages 1067–1077. ACM.
2020. Semeval-2020 task 9: Overview of sentiment
analysis of code-mixed tweets. In Proceedings of T. Trouillon, J Welbl, S. Riedel, É. Gaussier, and
the Fourteenth Workshop on Semantic Evaluation, G. Bouchard. 2016. Complex embeddings for sim-
SemEval@COLING 2020, Barcelona (online), De- ple link prediction. In ICML, pages 2071–2080.
cember 12-13, 2020, pages 774–790. International PMLR.
Committee for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. Deep- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
walk: Online learning of social representations. In Kaiser, and Illia Polosukhin. 2017. Attention is all
SIGKDD, pages 701–710. you need. Advances in neural information process-
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt ing systems, 30.
Gardner, Christopher Clark, Kenton Lee, and Luke Q. Wang, Z. Mao, B. Wang, and L. Guo. 2017. Knowl-
Zettlemoyer. 2018. Deep contextualized word rep- edge graph embedding: A survey of approaches and
resentations. In Proceedings of the 2018 Confer- applications. TKDE, 29(12):2724–2743.
ence of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Lan- L. Xu, X. Wei, J. Cao, and P. Yu. 2017. Embedding of
guage Technologies, NAACL-HLT 2018, New Or- embedding (eoe) joint embedding for coupled het-
leans, Louisiana, USA, June 1-6, 2018, Volume 1 erogeneous networks. In WSDM, pages 741–749.
(Long Papers), pages 2227–2237. Association for
Computational Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-
bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Xlnet: Generalized autoregressive pretraining for
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, language understanding. In Advances in Neural
Wei Li, and Peter J. Liu. 2020. Exploring the limits Information Processing Systems 32: Annual Con-
of transfer learning with a unified text-to-text trans- ference on Neural Information Processing Systems
former. J. Mach. Learn. Res., 21:140:1–140:67. 2019, NeurIPS 2019, December 8-14, 2019, Vancou-
Sara Rosenthal, Noura Farra, and Preslav Nakov. 2019. ver, BC, Canada, pages 5754–5764.
Semeval-2017 task 4: Sentiment analysis in twitter. Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombat-
CoRR, abs/1912.00741. chai, William L. Hamilton, and Jure Leskovec. 2018.
Aravind Sankar, Xinyang Zhang, Adit Krishnan, and Graph convolutional neural networks for web-scale
Jiawei Han. 2020. Inf-vae: A variational autoen- recommender systems. In Proceedings of the 24th
coder framework to integrate homophily and influ- ACM SIGKDD International Conference on Knowl-
ence in diffusion prediction. In WSDM ’20: The edge Discovery & Data Mining, KDD 2018, London,
Thirteenth ACM International Conference on Web UK, August 19-23, 2018, pages 974–983. ACM.
Search and Data Mining, Houston, TX, USA, Febru-
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,
ary 3-7, 2020, pages 510–518. ACM.
Maosong Sun, and Qun Liu. 2019. ERNIE: en-
C. Shi, Y. Li, J. Zhang, Y. Sun, and Y. Philip. 2016. A hanced language representation with informative en-
survey of heterogeneous information network analy- tities. In Proceedings of the 57th Conference of
sis. TKDE, 29(1):17–37. the Association for Computational Linguistics, ACL
2019, Florence, Italy, July 28- August 2, 2019, Vol-
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, ume 1: Long Papers, pages 1441–1451. Association
Patrick LeGresley, Jared Casper, and Bryan Catan- for Computational Linguistics.
zaro. 2019. Megatron-lm: Training multi-billion pa-
rameter language models using model parallelism.
CoRR, abs/1909.08053.
Y. Sun and J. Han. 2013. Mining heterogeneous in-
formation networks: a structural analysis approach.
Acm Sigkdd Explorations Newsletter.
Yu Suzuki. 2019. Filtering method for twitter stream-
ing data using human-in-the-loop machine learning.
J. Inf. Process., 27:404–410.
J. Tang, M. Qu, and Q. Mei. 2015a. Pte: Predictive
text embedding through large-scale heterogeneous
text networks. In SIGKDD, pages 1165–1174.

You might also like