Professional Documents
Culture Documents
Bert para Twitter
Bert para Twitter
…
CLS MASK are MASK
Tweet Topic Classification
Embed Entities from TwHIN ANN Index
Embedding Transformer LM
Mining Socially Similar Tweets Pre-training with Text + Social Downstream Fine-tuning
Figure 2: We outline the end-to-end TwHIN-BERT process. This three-step process involves (1) mining socially
similar Tweet pairs by embedding a Twitter Heterogeneous Information Network (2) training TwHIN-BERT using
a joint social and MLM objective and finally (3) fine-tuning TwHIN-BERT on downstream tasks.
Model Architecture. We use the same Trans- XLM-T: XLM-T (Barbieri et al., 2021) is a mul-
former architecture as BERT (Devlin et al., 2019) tilingual language model based on XLM-R (Con-
for our language model. We adopt the XLM- neau et al., 2020) with continued training on a
R (Conneau et al., 2020) tokenizer, which offers Tweet corpus.
good capacity and coverage in all languages. The TwHIN-BERT-MLM: This is an ablation of our
model has a vocabulary size of 250K. The max model which has the same architecture as ours and
sequence length is set to 128 tokens. We use a is trained on the same corpus. However, this variant
[768, 768] dimensional projection head for con- is trained with only the masked language model
trastive loss. Note that although we have chosen objective without the additional social objective.
this specific architecture, our social objective can We utilize the base variant of each model (which
be used in conjunction with a wide range of lan- has around 135M to 278M parameters depending
guage model architectures. on the size of the tokenizer) in our experiments.
Pre-training Data. We collect 7 billion Tweets in
100 languages from Jan. 2020 to Jun. 2022. Ad- 3.2 Social Engagement Prediction
ditionally, we collect 100 billion user-Tweet social Our first benchmark task is social engagement pre-
engagement data covering 1 billion of our Tweets. diction. This task aims to evaluate how well the pre-
Training Procedure. Our training has two stages. trained language models capture social aspects of
In the first stage, we train the model from scratch user-generated text. In our task, we predict whether
Table 1: Engagement prediction results on high, mid, and low-resource languages.
users modeled via a user embedding vector will per- as the second part. The two parts are concatenated
form a certain social engagement on a given Tweet. to form the Combined embedding of a Tweet.
We use different pre-trained language models With LM-derived Tweet embeddings, pre-
to generate representations for Tweets, and then trained user embeddings, and the user-Tweet en-
feed these representations into a simple prediction gagement records, we build an engagement predic-
model alongside the corresponding user represen- tion model Θ = (W t , W u ). Given a user u and a
tation. The model is trained to predict whether a Tweet t, the model projects the user embedding eu
user will engage with a specific Tweet. and the Tweet embedding et into the same space,
Dataset. To curate our Tweet-Engagement dataset, and then calculates the probability of engagement:
we select the 50 most popular languages on Twitter
and sample 10,000 (or all if the total number is less
than 10,000) Tweets of each language. All Tweets hu = W Tu eu , ht = W Tt et
are available via the Twitter public API. We then P (t | u) = σ hTu ht
collect the user-Tweet engagement records asso-
ciated with these Tweets. There are, on average, The model is trained by optimizing a negative
29K engagement records per language. We ensure sampling loss on the training engagement records
that there is no overlap between the evaluation and R. For each engagement pair (u, t) ∈ R, the loss
pre-training datasets. is defined as:
Each engagement record consists of a pre-trained
256-dimensional user embedding (El-Kishky et al., log σ hTu ht + Et0 ∼Pn (R) log σ −hTu ht0
2022) and a Tweet ID that indicates the user has
engaged with the given Tweet. To ensure privacy, where Pn (R) is a negative sampling distribution.
each user embedding appears only once, however We use the frequency of each Tweet in R raised to
each tweet may be engaged by multiple users. We the power of 3/4 for this distribution.
split the Tweets into train, development, and test Our prediction model closely resembles classical
sets with a 0.8/0.1/0.1 ratio, and then collect the link prediction models such as (Tang et al., 2015b).
respective engagement records for each subset. We keep the model simple, making sure it will not
Prediction Model. Given a pre-trained language overpower the language model embeddings.
model, we use it to generate an embedding for each Evaluation Setup and Metrics. We conduct hy-
Tweet t given its content wt : et = Pool LM(wt ) perparameter search on the English validation
We apply the following pooling strategies to cal- dataset and use these hyperparameters for the other
culate the Tweet embedding from the language languages. The prediction model projects user and
model. First, we take [CLS] token embedding as Tweet embedding to 128 dimensions. We set batch
the first part of overall embedding. Then, we take size to 512, learning rate to 1e-3. The best model
the average token embedding of non-special tokens on validation set is selected for test set evaluation.
Table 2: Text classification dataset statistics. ∗ Statistics to comprehensively cover the popular languages
for Hashtag shows the numbers for each language. on Twitter. In addition, we evaluate on five exter-
nal benchmark datasets for tasks such as sentiment
Dataset Lang. Label Train Dev Test classification, emoji prediction, and topic classifi-
SE2017 en 3 45,389 2,000 11,906 cation in selected languages. We show the dataset
SE2018-en en 20 45,000 5,000 50,000 statistics in Table 2.
SE2018-es es 19 96,142 2,726 9,970 • Tweet Hashtag Prediction dataset is a multilin-
ASAD ar 3 137,432 15,153 16,842 gual hashtag prediction dataset we collected from
COVID-JA ja 6 147,806 16,394 16,394
Tweets. It contains Tweets of 50 most popular
SE2020-hi hi+en 3 14,000 3,000 3,000
languages on Twitter. For each language, 500
SE2020-es es+en 3 10,800 1,200 3,000
most popular hashtags were selected and 100k
Hashtag multi 500∗ 16,000∗ 2,000∗ 2,000∗
Tweets that has those hashtags were sampled. We
made sure each Tweet will only contain one of
In the test set, we pair each user with 1,000 the 500 candidate hashtags. Similar to work pro-
Tweets. The user is engaged with one of the posed in Mireshghallah et al. (2022), the task is
Tweets, the rest are randomly sampled negatives. to predict the hashtag used in the Tweet.
The model ranks the Tweets by the predicted prob- • SemEval2017 task 4A (Rosenthal et al., 2019) is
ability of engagement. We use mean reciprocal a English Tweet sentiment analysis dataset. The
rank(MRR) and HITS@10 as evaluation metrics. labels are three-point sentiments of “positive”,
We report average results from 20 runs with differ- “negative”, “neutral”.
ent initialization. • SemEval2018 task 2 (Barbieri et al., 2018) is
an emoji prediction dataset in both English and
Results. We show results for high, mid, and low-
Spanish. The objective is to predict the most
resource languages (determined by language fre-
likely used emoji in a Tweet.
quency on Twitter) in Table 1. Our TwHIN-BERT
• ASAD (Alharbi et al., 2020) is an Arabic Tweet
model demonstrates significant improvement over
sentiment dataset with three-point labels.
the baselines on the social engagement task. Com-
• COVID-JA (Suzuki, 2019) is a Japanese Tweets
paring our model to the ablation without the social
topic classification dataset. The objective is
loss, we can see the contrastive social pre-training
to classify each Tweet into one of the six pre-
provides significant lift over just MLM pre-training
defined topics around COVID-19.
for social engagement prediction. Further analysis
• SemEval2020 task 9 (Patwa et al., 2020) con-
shows that TwHIN-BERT representations result in
tains code-mixed Tweets of Hindi+English and
better predictive performance on higher resource
Spanish+English. We use the three-point senti-
languages, where engagement data is more abun-
ment analysis part of the dataset for evaluation.
dant during pre-training. Additionally, we also
observe that our method yields the most improve- Setup and Evaluation Metrics. We use the stan-
ment when using the Combined [CLS] token and dard language model fine-tuning method as de-
average non-special token embedding. We believe scribed in (Devlin et al., 2019) and apply a linear
the [CLS] token embedding from our model cap- prediction layer on top of the pooled output of the
tures social aspects of the Tweet, while averaging last transformer layer. Each model is fine-tuned for
the other token embeddings captures the semantic 30 epochs, and we evaluate the best model from
aspects of the Tweet. Naturally, utilizing both as- the training epochs on the test set based on devel-
pects is essential to better model a Tweet’s appeal opment set performance. The learning rate is set to
and a user’s inclination to engage with a Tweet. 5-e5. We report average results from 3 fine-tuning
runs with different random seeds. Results are the
3.3 Tweet Classification evaluation metrics recommended for each bench-
Our second collection of downstream tasks is Tweet mark dataset or challenge. For hashtag prediction
classification. In these tasks, we take as input the datasets, we report macro-F1 scores.
Tweet text and predict discrete labels correspond- Multilingual Hashtag Prediction. In Table 3, we
ing to the label space for each task. show macro F1 scores on our multilingual hashtag
Datasets. We curate a multilingual Tweet hashtag prediction dataset. We can see that TwHIN-BERT
prediction dataset (available via Twitter public API) significantly outperforms the multilingual baseline
Table 3: Multilingual hashtag prediction results on high, mid, and low resource languages.
methods. Again, we see better performance gain on further scaling them with respect to number of pa-
higher-resource languages. On the English dataset, rameters and training data. As a result, PLM mod-
our model almost matches the BERTweet monolin- els have grown considerably in their sizes, from
gual language model trained exclusively on En- millions (Devlin et al., 2019; Yang et al., 2019) of
glish Tweets and with a dedicated English tok- parameters to billions (Micheli and Fleuret, 2021;
enizer. Comparing TwHIN-BERT to the ablation Raffel et al., 2020; Shoeybi et al., 2019) and even
TwHIN-BERT-MLM with no social loss, the two trillion-level (Fedus et al., 2021). Another avenue
models demonstrate similar performance. These of improvement has been improving the training
results show that while our model has a major ad- objectives used to train PLMs. A broad spectrum of
vantage on social tasks, it retains high performance pre-training objectives have been explored with dif-
on semantic understanding applications. ferent levels of success, notable examples include
External Classification Benchmarks. As shown masked language modeling (Devlin et al., 2019),
in Table 4, TwHIN-BERT matches or outperforms auto-regressive causal language modeling (Yang
the multilingual baselines on the established classi- et al., 2019), model-based denoising (Clark et al.,
fication benchmarks. The monolingual BERTweet 2020), and corrective language modeling (Meng
has the advantage on the two English datasets due et al., 2021). Despite these innovations in scaling
to its more capable English tokenizer and its large, and pre-training objectives, the vast majority of the
English-only training corpora. Similar to hashtag work has focused on text-only training objectives
prediction, TwHIN-BERT performs on par with applied to general domain corpora, e.g., Wikipedia
the MLM-only ablation. These additional bench- and CommonCrawl. In this paper, we deviate from
mark results show the versatility of TwHIN-BERT most previous works by explore PLM training us-
in language-based applications. ing solely Twitter in-domain data and training our
model based on both text-based and social-based
4 Related Works objectives.
Pre-trained Language Models: Since their in- Enriching PLMs with Additional Informa-
troduction (Peters et al., 2018; Devlin et al., 2019), tion: Several existing works introduce addi-
pre-trained language models have enjoyed tremen- tional information for language model pre-training.
dous success in all aspects of natural language pro- ERNIE (Zhang et al., 2019) and K-BERT (Liu et al.,
cessing. Follow up research has advanced PLMs by 2020) inject entities and their relations from knowl-
edge graphs to augment the pre-training corpus. large Tweet corpus. Unlike previous BERT-style
OAG-BERT (Liu et al., 2022) appends metadata of language models, TwHIN-BERT is trained using
a document to its raw text, and designs objectives two objectives: (1) a standard MLM pre-training
to jointly predict text and metadata. These works objective and (2) a contrasting social objective. We
focus on bringing additional metadata and knowl- utilize our pre-trained language model to perform
edge by injecting training instances, while our work a variety of downstream tasks on Tweet data. Our
leverage the rich social engagements embedded in experiments demonstrate that TwHIN-BERT out-
the social media platform for text relevance. performs previously released language models on
not only semantic tasks, but also on social engage-
Network Embedding: Network embedding has ment prediction tasks. We release this model to the
emerged as a valuable tool for transferring in- academic community to further research in social
formation from relational data to other tasks (El- media NLP.
Kishky et al., 2022a). Early network embedding
methods such as DeepWalk (Perozzi et al., 2014)
and node2vec (Grover and Leskovec, 2016) em- References
bed homogeneous graphs by performing random Basma Alharbi, Hind Alamro, Manal Abdulaziz
walks and applying SkipGram modeling. With Alshehri, Zuhair Khayyat, Manal Kalkatawi,
the introduction of heterogeneous information net- Inji Ibrahim Jaber, and Xiangliang Zhang. 2020.
works (Shi et al., 2016; Sun and Han, 2013) as a for- Asad: A twitter-based benchmark arabic sentiment
analysis dataset. ArXiv, abs/2011.00578.
malism to model rich multi-typed, multi-relational
networks, many heterogeneous network embedding Francesco Barbieri, Luis Espinosa Anke, and José
approaches were developed (Chang et al., 2015; Camacho-Collados. 2021. Xlm-t: A multilin-
gual language model toolkit for twitter. ArXiv,
Tang et al., 2015a; Xu et al., 2017; Chen and Sun, abs/2104.12250.
2017; Dong et al., 2017). However, many of these
techniques are difficult to scale to very large net- Francesco Barbieri, Jose Camacho-Collados,
Francesco Ronzano, Luis Espinosa-Anke, Miguel
works. In this work, we apply knowledge graph
Ballesteros, Valerio Basile, Viviana Patti, and
embeddings (Wang et al., 2017; Bordes et al., 2013; Horacio Saggion. 2018. SemEval 2018 task 2:
Trouillon et al., 2016), which have been shown Multilingual emoji prediction. In Proceedings
to be both highly scalable and flexible enough to of The 12th International Workshop on Semantic
model multiple node and edge types. Evaluation, pages 24–33, New Orleans, Louisiana.
Association for Computational Linguistics.
Tweet Language Models: While a majority of Antoine Bordes, Nicolas Usunier, Alberto Garcia-
PLMs are trained on general domain corpora, a Duran, Jason Weston, and Oksana Yakhnenko.
few language models have been proposed specif- 2013. Translating embeddings for modeling multi-
ically for Twitter and other social media plat- relational data. Advances in neural information pro-
cessing systems, 26.
forms. BERTweet (Nguyen et al., 2020) mirrors
BERT training on 850 million English Tweets. S. Chang, W. Han, J. Tang, G. Qi, C. Aggarwal, and
TimeLMs (Loureiro et al., 2022) trains a set of T. Huang. 2015. Heterogeneous network embedding
via deep architectures. In SIGKDD, pages 119–128.
RoBERTa (Liu et al., 2019) models for English
Tweets on different time ranges. XLM-T (Barbi- T. Chen and Y. Sun. 2017. Task-guided and path-
eri et al., 2021) continues the pre-training process augmented heterogeneous network embedding for
author identification. In WSDM, pages 295–304.
from an XLM-R (Conneau et al., 2020) checkpoint
on 198 million multilingual Tweets. These meth- Ting Chen, Simon Kornblith, Mohammad Norouzi,
ods mostly replicate existing general domain PLM and Geoffrey Hinton. 2020. A simple framework
for contrastive learning of visual representations. In
methods and simply substitute the training data
Proceedings of the 37th International Conference
with Tweets. However our approach utilizes ad- on Machine Learning, volume 119 of Proceedings
ditional social engagement signals to enhance the of Machine Learning Research, pages 1597–1607.
pre-trained Tweet representations. PMLR.
Justin Cheng, Lada A. Adamic, P. Alex Dow, Jon M.
5 Conclusions Kleinberg, and Jure Leskovec. 2014. Can cascades
be predicted? In 23rd International World Wide
In this work we introduce TwHIN-BERT, a mul- Web Conference, WWW ’14, Seoul, Republic of Ko-
tilingual BERT-style language model trained on a rea, April 7-11, 2014, pages 925–936. ACM.
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019.
Christopher D. Manning. 2020. ELECTRA: pre- Billion-scale similarity search with gpus. IEEE
training text encoders as discriminators rather than Transactions on Big Data, 7(3):535–547.
generators. In 8th International Conference on
Learning Representations, ICLR 2020, Addis Ababa, Adam Lerer, Ledell Wu, Jiajun Shen, Timothee
Ethiopia, April 26-30, 2020. OpenReview.net. Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex
Peysakhovich. 2019. Pytorch-biggraph: A large-
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, scale graph embedding system. arXiv preprint
Vishrav Chaudhary, Guillaume Wenzek, Francisco arXiv:1903.12287.
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Unsupervised Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju,
cross-lingual representation learning at scale. In Haotang Deng, and Ping Wang. 2020. K-bert:
ACL. Enabling language representation with knowledge
graph. In AAAI.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: pre-training of Xiao Liu, Da Yin, Jingnan Zheng, Xingjian Zhang,
deep bidirectional transformers for language under- P. Zhang, Hongxia Yang, Yuxiao Dong, and Jie Tang.
standing. In Proceedings of the 2019 Conference 2022. Oag-bert: Towards a unified backbone lan-
of the North American Chapter of the Association guage model for academic knowledge services. Pro-
for Computational Linguistics: Human Language ceedings of the 28th ACM SIGKDD Conference on
Technologies, NAACL-HLT 2019, Minneapolis, MN, Knowledge Discovery and Data Mining.
USA, June 2-7, 2019, Volume 1 (Long and Short Pa-
pers), pages 4171–4186. Association for Computa- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
tional Linguistics. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Y. Dong, N. Chawla, and A. Swami. 2017. metap- Roberta: A robustly optimized BERT pretraining ap-
ath2vec: Scalable representation learning for hetero- proach. CoRR, abs/1907.11692.
geneous networks. In SIGKDD, pages 135–144.
Daniel Loureiro, Francesco Barbieri, Leonardo Neves,
Ahmed El-Kishky, Michael Bronstein, Ying Xiao, and Luis Espinosa Anke, and José Camacho-Collados.
Aria Haghighi. 2022a. Graph-based representation 2022. Timelms: Diachronic language models from
learning for web-scale recommender systems. In twitter. In Proceedings of the 60th Annual Meet-
Proceedings of the 28th ACM SIGKDD Conference ing of the Association for Computational Linguistics,
on Knowledge Discovery and Data Mining, pages ACL 2022 - System Demonstrations, Dublin, Ireland,
4784–4785. May 22-27, 2022, pages 251–260. Association for
Computational Linguistics.
Ahmed El-Kishky, Thomas Markovich, Kenny Leung,
Frank Portman, and Aria Haghighi. 2022b. knn- Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Ti-
embed: Locally smoothed embedding mixtures for wary, Paul Bennett, Jiawei Han, and Xia Song.
multi-interest candidate retrieval. arXiv preprint 2021. COCO-LM: correcting and contrasting text
arXiv:2205.06205. sequences for language model pretraining. In Ad-
vances in Neural Information Processing Systems
Ahmed El-Kishky, Thomas Markovich, Serim Park, 34: Annual Conference on Neural Information Pro-
Chetan Verma, Baekjin Kim, Ramy Eskander, Yury cessing Systems 2021, NeurIPS 2021, December 6-
Malkov, Frank Portman, Sofía Samaniego, Ying 14, 2021, virtual, pages 23102–23114.
Xiao, and Aria Haghighi. 2022. Twhin: Embed-
ding the twitter heterogeneous information network Vincent Micheli and François Fleuret. 2021. Language
for personalized recommendation. In KDD ’22: The models are few-shot butlers. In Proceedings of the
28th ACM SIGKDD Conference on Knowledge Dis- 2021 Conference on Empirical Methods in Natural
covery and Data Mining, Washington, DC, USA, Au- Language Processing, EMNLP 2021, Virtual Event
gust 14 - 18, 2022, pages 2842–2850. ACM. / Punta Cana, Dominican Republic, 7-11 November,
2021, pages 9312–9318. Association for Computa-
William Fedus, Barret Zoph, and Noam Shazeer. 2021. tional Linguistics.
Switch transformers: Scaling to trillion parameter
models with simple and efficient sparsity. CoRR, Fatemehsadat Mireshghallah, Nikolai Vogler, Junxian
abs/2101.03961. He, Omar Florez, Ahmed El-Kishky, and Taylor
Berg-Kirkpatrick. 2022. Non-parametric tempo-
A. Grover and J. Leskovec. 2016. node2vec: Scalable ral adaptation for social media topic classification.
feature learning for networks. In SIGKDD, pages arXiv preprint arXiv:2209.05706.
855–864.
Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen.
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2020. Bertweet: A pre-trained language model for
2010. Product quantization for nearest neighbor english tweets. In Proceedings of the 2020 Con-
search. IEEE transactions on pattern analysis and ference on Empirical Methods in Natural Language
machine intelligence, 33(1):117–128. Processing: System Demonstrations, EMNLP 2020 -
Demos, Online, November 16-20, 2020, pages 9–14. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun
Association for Computational Linguistics. Yan, and Qiaozhu Mei. 2015b. LINE: large-scale in-
formation network embedding. In Proceedings of
Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj the 24th International Conference on World Wide
Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Web, WWW 2015, Florence, Italy, May 18-22, 2015,
Chakraborty, Thamar Solorio, and Amitava Das. pages 1067–1077. ACM.
2020. Semeval-2020 task 9: Overview of sentiment
analysis of code-mixed tweets. In Proceedings of T. Trouillon, J Welbl, S. Riedel, É. Gaussier, and
the Fourteenth Workshop on Semantic Evaluation, G. Bouchard. 2016. Complex embeddings for sim-
SemEval@COLING 2020, Barcelona (online), De- ple link prediction. In ICML, pages 2071–2080.
cember 12-13, 2020, pages 774–790. International PMLR.
Committee for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. Deep- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
walk: Online learning of social representations. In Kaiser, and Illia Polosukhin. 2017. Attention is all
SIGKDD, pages 701–710. you need. Advances in neural information process-
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt ing systems, 30.
Gardner, Christopher Clark, Kenton Lee, and Luke Q. Wang, Z. Mao, B. Wang, and L. Guo. 2017. Knowl-
Zettlemoyer. 2018. Deep contextualized word rep- edge graph embedding: A survey of approaches and
resentations. In Proceedings of the 2018 Confer- applications. TKDE, 29(12):2724–2743.
ence of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Lan- L. Xu, X. Wei, J. Cao, and P. Yu. 2017. Embedding of
guage Technologies, NAACL-HLT 2018, New Or- embedding (eoe) joint embedding for coupled het-
leans, Louisiana, USA, June 1-6, 2018, Volume 1 erogeneous networks. In WSDM, pages 741–749.
(Long Papers), pages 2227–2237. Association for
Computational Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-
bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Xlnet: Generalized autoregressive pretraining for
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, language understanding. In Advances in Neural
Wei Li, and Peter J. Liu. 2020. Exploring the limits Information Processing Systems 32: Annual Con-
of transfer learning with a unified text-to-text trans- ference on Neural Information Processing Systems
former. J. Mach. Learn. Res., 21:140:1–140:67. 2019, NeurIPS 2019, December 8-14, 2019, Vancou-
Sara Rosenthal, Noura Farra, and Preslav Nakov. 2019. ver, BC, Canada, pages 5754–5764.
Semeval-2017 task 4: Sentiment analysis in twitter. Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombat-
CoRR, abs/1912.00741. chai, William L. Hamilton, and Jure Leskovec. 2018.
Aravind Sankar, Xinyang Zhang, Adit Krishnan, and Graph convolutional neural networks for web-scale
Jiawei Han. 2020. Inf-vae: A variational autoen- recommender systems. In Proceedings of the 24th
coder framework to integrate homophily and influ- ACM SIGKDD International Conference on Knowl-
ence in diffusion prediction. In WSDM ’20: The edge Discovery & Data Mining, KDD 2018, London,
Thirteenth ACM International Conference on Web UK, August 19-23, 2018, pages 974–983. ACM.
Search and Data Mining, Houston, TX, USA, Febru-
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,
ary 3-7, 2020, pages 510–518. ACM.
Maosong Sun, and Qun Liu. 2019. ERNIE: en-
C. Shi, Y. Li, J. Zhang, Y. Sun, and Y. Philip. 2016. A hanced language representation with informative en-
survey of heterogeneous information network analy- tities. In Proceedings of the 57th Conference of
sis. TKDE, 29(1):17–37. the Association for Computational Linguistics, ACL
2019, Florence, Italy, July 28- August 2, 2019, Vol-
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, ume 1: Long Papers, pages 1441–1451. Association
Patrick LeGresley, Jared Casper, and Bryan Catan- for Computational Linguistics.
zaro. 2019. Megatron-lm: Training multi-billion pa-
rameter language models using model parallelism.
CoRR, abs/1909.08053.
Y. Sun and J. Han. 2013. Mining heterogeneous in-
formation networks: a structural analysis approach.
Acm Sigkdd Explorations Newsletter.
Yu Suzuki. 2019. Filtering method for twitter stream-
ing data using human-in-the-loop machine learning.
J. Inf. Process., 27:404–410.
J. Tang, M. Qu, and Q. Mei. 2015a. Pte: Predictive
text embedding through large-scale heterogeneous
text networks. In SIGKDD, pages 1165–1174.