Professional Documents
Culture Documents
Nud 2021 FINT Field-Aware INTeraction Neural Network For CTR Prediction
Nud 2021 FINT Field-Aware INTeraction Neural Network For CTR Prediction
Prediction
Zhishan Zhao Sen Yang Guohui Liu
iQIYI Inc. National University of Defense liuguohui@qiyi.com
Beijing, China Technology iQIYI Inc.
Changsha, China Beijing, China
y
v00 v10
<latexit sha1_base64="nbvLxOJ15Bd4TOXKH3mcuQPGAak=">AAACyHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdVl0I64qmLZQa0mm0zo0L5JJpRQ3/oBb/TLxD/QvvDOmoBbRCUnOnHvPmbn3erEvUmlZrwVjbn5hcam4XFpZXVvfKG9uNdIoSxh3WORHSctzU+6LkDtSSJ+34oS7gefzpjc8U/HmiCepiMIrOY55J3AHoegL5kqinFHXurG65YpVtfQyZ4GdgwryVY/KL7hGDxEYMgTgCCEJ+3CR0tOGDQsxcR1MiEsICR3nuEeJtBllccpwiR3Sd0C7ds6GtFeeqVYzOsWnNyGliT3SRJSXEFanmTqeaWfF/uY90Z7qbmP6e7lXQKzELbF/6aaZ/9WpWiT6ONE1CKop1oyqjuUume6Kurn5pSpJDjFxCvconhBmWjnts6k1qa5d9dbV8TedqVi1Z3luhnd1Sxqw/XOcs6BxULWPqgeXh5XaaT7qInawi32a5zFqOEcdDnkLPOIJz8aFERt3xvgz1Sjkmm18W8bDB6hykMc=</latexit>
<latexit sha1_base64="y3P9SkaYhh7sTIEg37vAqHSNzLY=">AAACyHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdVl0I64qmLZQa0mm0zo0L5JJpRQ3/oBb/TLxD/QvvDOmoBbRCUnOnHvPmbn3erEvUmlZrwVjbn5hcam4XFpZXVvfKG9uNdIoSxh3WORHSctzU+6LkDtSSJ+34oS7gefzpjc8U/HmiCepiMIrOY55J3AHoegL5kqinFHXvrG65YpVtfQyZ4GdgwryVY/KL7hGDxEYMgTgCCEJ+3CR0tOGDQsxcR1MiEsICR3nuEeJtBllccpwiR3Sd0C7ds6GtFeeqVYzOsWnNyGliT3SRJSXEFanmTqeaWfF/uY90Z7qbmP6e7lXQKzELbF/6aaZ/9WpWiT6ONE1CKop1oyqjuUume6Kurn5pSpJDjFxCvconhBmWjnts6k1qa5d9dbV8TedqVi1Z3luhnd1Sxqw/XOcs6BxULWPqgeXh5XaaT7qInawi32a5zFqOEcdDnkLPOIJz8aFERt3xvgz1Sjkmm18W8bDB6rUkMg=</latexit>
Residual Connection
Vl 1
<latexit sha1_base64="YhHGCvAEiRRdBhH5dUJG0FE7hJk=">AAACy3icjVHLSsNAFD2Nr1pfVZdugkVwY0mKqMuiGzdCBfuAWiVJp3VoXiQTodYu/QG3+l/iH+hfeGecglpEJyQ5c+45d+be68Y+T4VlveaMmdm5+YX8YmFpeWV1rbi+0UijLPFY3Yv8KGm5Tsp8HrK64MJnrThhTuD6rOkOTmS8ecuSlEfhhRjGrBM4/ZD3uOcIolqNq5G/Z48L18WSVbbUMqeBrUEJetWi4gsu0UUEDxkCMIQQhH04SOlpw4aFmLgORsQlhLiKM4xRIG9GKkYKh9gBffu0a2s2pL3MmSq3R6f49CbkNLFDnoh0CWF5mqnimcos2d9yj1ROebch/V2dKyBW4IbYv3wT5X99shaBHo5UDZxqihUjq/N0lkx1Rd7c/FKVoAwxcRJ3KZ4Q9pRz0mdTeVJVu+yto+JvSilZufe0NsO7vCUN2P45zmnQqJTtg3LlfL9UPdajzmML29ileR6iilPUUFdzfMQTno0zIzXujPtPqZHTnk18W8bDB1M1kdI=</latexit>
Sigmoid
Embedder0 Embedder1
Channel-wise FNN
Hadamard
Product Weighted Sum
(V 0 )|
<latexit sha1_base64="DYTx2UNQ1hf5K5igdJGC9/jvQVE=">AAAC2XicjVHLSsNAFD2Nr1pf9bFzEyyCbkpaRF0W3bisYB9QrSTjWIemSZhMhFq6cCdu/QG3+kPiH+hfeGdMQS2iE5KcOfeeM3Pv9SJfxMpxXjPWxOTU9Ex2Njc3v7C4lF9eqcdhIhmvsdAPZdNzY+6LgNeUUD5vRpK7Pc/nDa97qOONay5jEQYnqh/xs57bCcSlYK4i6jy/tlVvD5zhdntwKgLFJXP9Ye48X3CKjln2OCiloIB0VcP8C05xgRAMCXrgCKAI+3AR09NCCQ4i4s4wIE4SEibOMUSOtAllccpwie3St0O7VsoGtNeesVEzOsWnV5LSxiZpQsqThPVptoknxlmzv3kPjKe+W5/+XurVI1bhiti/dKPM/+p0LQqX2Dc1CKopMoyujqUuiemKvrn9pSpFDhFxGl9QXBJmRjnqs200sald99Y18TeTqVm9Z2lugnd9Sxpw6ec4x0G9XCztFsvHO4XKQTrqLNaxgS2a5x4qOEIVNfK+wSOe8Gy1rFvrzrr/TLUyqWYV35b18AEiG5c2</latexit>
Pooling
x0 x1
Vl
<latexit sha1_base64="gGyFlbBKimkkD15RfFrL8pGVzd4=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkVwVZIi6rLopsuK9gG1SjKd1tA0CZOJUorgD7jVTxP/QP/CO+MU1CI6IcmZc+85M/dePwmDVDrOa86am19YXMovF1ZW19Y3iptbzTTOBOMNFoexaPteysMg4g0ZyJC3E8G9kR/ylj88VfHWLRdpEEcXcpzw7sgbREE/YJ4k6rx5FV4XS07Z0cueBa4BJZhVj4svuEQPMRgyjMARQRIO4SGlpwMXDhLiupgQJwgFOs5xjwJpM8rilOERO6TvgHYdw0a0V56pVjM6JaRXkNLGHmliyhOE1Wm2jmfaWbG/eU+0p7rbmP6+8RoRK3FD7F+6aeZ/daoWiT6OdQ0B1ZRoRlXHjEumu6Jubn+pSpJDQpzCPYoLwkwrp322tSbVtaveejr+pjMVq/bM5GZ4V7ekAbs/xzkLmpWye1iunB2Uqidm1HnsYBf7NM8jVFFDHQ3yHuART3i2alZkZdbdZ6qVM5ptfFvWwwdJEJBA</latexit>
In the training stage, we use binary cross entropy as the training hand, the field-aware interaction layer makes FINT distinguishable
loss for FINT. In the inference stage, we take the output 𝑦^ as the from others. Unlike with previous approaches that cast all feature
probability of a user clicking the given item. representations into a single scalar or vector during feature interac-
Time Complexity Equation 3 shows that each feature interac- tion [9, 13, 18], the field-aware interaction layer maintains a vector
tion layer can be efficiently computed in O (𝑀 2 𝐷). For the DNN for each field, retaining their boundaries, to allows the following
layer, the vector-matrix multiplication is the main operation which DNN module to further mine nonlinear interaction. AutoINT [20]
can be done in O (𝑀𝐷𝐷 𝐹 + 𝐷 𝐹2 ). Since there are 𝐾 field-aware in- is the most related work to FINT in paradigm because it also re-
teraction layers(𝐾 is usually small), the overall time complexity of tains the feature boundary in linear feature interaction and exploits
FINT is O (𝐾𝑀 2 𝐷 + 𝑀𝐷𝐷 𝐹 + 𝐷 𝐹2 ). Although the feature interaction nonlinear high-order feature interaction. However, it is based on
layer complexity is inferior to the one of traditional machine learn- the Transformer model [22] and uses the self-attention mechanism
ing based FM [18] and NFM [9], which is O (𝐾𝑀𝐷), it surpasses to learn feature weights, while FINT uses the Hadamard product,
a variety of deep learning based peers, such as xDeepFM [8, 13] which is a more general and effective method in recommendation
which is O (𝐾𝑀 2 𝐷𝑇 ), where 𝑇 is the number of multiple pooling systems.
operations. Moreover, as FINT conducts most operations on ma-
trixes and requires no sequence operations, it can achieve a higher
GPU acceleration ratio. 4 EXPERIMENTS
In this section, we aim to provide the experimental results of FINT,
from the perspective of efficiency and effectiveness. The prototype
3.2 Relationship with Other Models of FINT is implemented by Python 3.7 + TensorFlow 1.14.0 and run
FINT shares a similar paradigm with several factorization based with a Nvidia Tesla P40 GPU. Two metrics Logloss and AUC (Area
models [9, 13, 18, 20]. For example, they explore feature interaction Under the ROC Curve) are used in our experimental studies. Here,
in the vector space and exploit pooling operation to reduce dimen- a smaller Logloss or a larger AUC represents better CTR prediction
sion and promote further classification. NFM [11] has shown the performance. For the training of FINT model, we set the learning
advantage of integrating linear feature combination and nonlin- rate as 1e-3 and Adam optimizer is employed in our experiments.
ear high-order feature combination. Therefore, FINT also exploits The batch size is set 1024, while the embedding size is setting as 16.
nonlinear feature interaction through the DNN layer. On the other We use 3 field-aware interaction layers. For the DNN layers, the
CIKM ’21, 1-5 November, 2021, Gold Coast, Queensland, Australia Zhishan, et al.
number of hidden layers is set as [300, 300, 300]. We replace the KDD2012 dataset is similar to the one of Criteo. Overall, these
features that appear less than 10 times as “unknown”. Numerical results illustrate the effectiveness of the FINT for the CTR task.
features be normalized with 𝑧 ∗ = 𝑙𝑜𝑔(𝑧 + 1) + 1. The settings of the FINT employs the deep FC layers, to capture the complex implicit
experimental part are basically kept the same as AutoINT. information and improve the performance. We leave the exploration
towards proper feature representation and parameterization scheme
4.1 Dataset for future work.
We evaluate the proposed method on three publicly available datasets
including the KDD12 1 , Criteo 2 , and Avazu 3 . Criteo and Avazu
contain chronologically ordered click-through records from Criteo
and Avazu which are two online advertisement companies. For
Avazu and Cretio dataset, we We randomly split the dataset into
training (80%), validation (10%), and test (10%) sets. While for the
KDD12, we follow the official public and private split.
4.2 Baselines
We implemented 9 widely used CTR prediction approaches using
TensorFlow, and compared them in the experiment. Below is a brief
introduction of these models:
• LR, we employ the LR only with basic features as our first
baseline.
• FM, we employ the original FM model, which has demon-
strated its effectiveness in many CTR prediction tasks.
• NFM, which aims to encode all feature interactions, through
a multi-layer neural network coupled with a bit-wise bi-
interaction pooling layer. Figure 2: Time comparison (in seconds).
• PNN, which applies a product layer and multiple fully con-
nected layers to explore the high-order feature interactions.
• Wide & Deep, which aims to model low- and high-order
feature interactions simultaneously. 4.4 Effectiveness Comparison
• DeepFM, which explores the integration of the of FM and In Figure 2, we conducted a quantitative comparison on the run-
deep neural networks (DNN). Through the modeling of low- time between FINT and seven state-of-the-art models with GPU
order feature interactions like FM and models high-order implementations on Criteo and Avazu. In the Figure, the y-axis
feature interactions like DNN. provides an average runtime per epoch over five training epochs
• AutoInt, which employs a multi-head self-attentive neural after which all models start to converge observably. We keep the
network as the core module and can automatically learn the hardware settings as identical to aforementioned in the experiment
high-order interactions of input features. setting session. From the figure, we observe that FINT displays an
• DCN, which makes use of the deep cross network and takes superior efficiency by spending the minimum time for each epoch
the outer product of concatenated feature embeddings to among the ten models, while retaining best prediction performance.
explicitly model feature interaction. The main property of FINT enable the huge speedup: the Hadamard
• xDeepFM, which has a compressed interaction network to product operations across the features can reduce the problem scale
model vector-wise feature interactions for CTR prediction. (from exponential to linear). Thus, FINT can avoid the enumeration
of all possible feature combinations in the k orders.
4.3 Performance Evaluation
We show the performance of all models on CTR prediction with 4.5 Result from online A/B testing
Table 1. As can be observed from Table 1: the proposed FINT model Careful online A/B testing in the advertising display system was
can provide better performance compared to other models on Criteo, conducted. Being a large video app, it has more than 100 million
KDD2012, and Avazu on both metrics. The FINT obtains different users using it to watch videos every day. Ads are distributed in mul-
improvements on three datasets. For example, on the Criteo dataset, tiple locations of the app, including video pre-post ads(video general
FINT surpasses the previous best model (PNN) over 0.08% point roll), startup screen ads when the app is opened(open screen), short
on AUC, which is already a considerable improvement in the CTR video feed flow ads(infeed), and long video feed flow ads(semi-feed).
task. On the Avazu dataset, the FINT achieves a comparable AUC During almost a month’s testing, FINT trained with the proposed
with XDeepFM, only superior with 0.01% point. On the other hand, FINT contributes up to 2.92% CTR and 3.18% RPM(Revenue Per
it wins XDeepFM on the Logloss metric. The improvement on the Mille) promotion compared with the baseline models (Wide & Deep).
1 https://www.kaggle.com/c/kddcup2012-track2 This is a significant improvement and demonstrates the effective-
2 https://www.kaggle.com/c/criteo-display-ad-challenge ness of our proposed approaches. Now FINT has been deployed
3 https://www.kaggle.com/c/avazu-ctr-prediction online and serves the main traffic. Details of A/B testing results
FINT: Field-aware INTeraction Neural Network For CTR Prediction CIKM ’21, 1-5 November, 2021, Gold Coast, Queensland, Australia
conducted at different advertising positions of the app are shown [14] Ze Lyu, Yu Dong, Chengfu Huo, and Weijun Ren. 2020. Deep Match to Rank
in Table 2. Model for Personalized Click-Through Rate Prediction. In Proceedings of the AAAI
Conference on Artificial Intelligence, Vol. 34. 156–163.
[15] Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun,
Table 2: A/B testing on different advertising positions. and Quan Lu. 2018. Field-weighted factorization machines for click-through
rate prediction in display advertising. In Proceedings of the 2018 World Wide Web
Conference. 1349–1357.
position revenue click rate [16] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang.
overall +2.72% +2.92% 2016. Product-based neural networks for user response prediction. In 2016 IEEE
16th International Conference on Data Mining (ICDM). IEEE, 1149–1154.
video general roll +1.53% +0.41% [17] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
open screen +2.67% +4.11% Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI
blog 1, 8 (2019), 9.
infeed +3.39% +5.38% [18] Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Confer-
semi-feed +4.81% +4.69% ence on Data Mining. IEEE, 995–1000.
[19] Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting
clicks: estimating the click-through rate for new ads. In Proceedings of the 16th
international conference on World Wide Web. 521–530.
[20] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang,
5 CONCLUSION and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self-
attentive neural networks. In Proceedings of the 28th ACM International Conference
In this paper, we proposed FINT, an efficient and effective CTR on Information and Knowledge Management. 1161–1170.
predictor. FINT aims to learn the high-order feature interactions [21] Yixin Su, Rui Zhang, Sarah Erfani, and Zhenghua Xu. 2021. Detecting Beneficial
by employing the Field-aware interaction layer, which captures Feature Interactions for Recommender Systems. In Proceedings of the 34th AAAI
Conference on Artificial Intelligence (AAAI).
high-order feature interactions without losing the low-order field [22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
information. We have conducted extensive experiments on public Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information processing systems. 5998–6008.
realistic datasets and an A/B testing on a large online systems. [23] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network
The obtained results suggest that FINT can learn effective high- for ad click predictions. In Proceedings of the ADKDD’17. 1–7.
order feature interactions, while runs faster than state-of-the-art [24] Kai Zhang, Hao Qian, Qing Cui, Qi Liu, Longfei Li, Jun Zhou, Jianhui Ma, and
Enhong Chen. 2021. Multi-Interactive Attention Network for Fine-grained Fea-
models, meaning a high efficiency on CTR prediction and achieves ture Learning in CTR Prediction. In Proceedings of the 14th ACM International
comparable or even better performances. Conference on Web Search and Data Mining. 984–992.
[25] Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep learning over multi-field
categorical data. In European conference on information retrieval. Springer, 45–57.
REFERENCES [26] Weinan Zhang, Jiarui Qin, Wei Guo, Ruiming Tang, and Xiuqiang He. 2021. Deep
[1] Mathieu Blondel, Akinori Fujino, Naonori Ueda, and Masakazu Ishihata. 2016. Learning for Click-Through Rate Estimation. arXiv preprint arXiv:2104.10584
Higher-order factorization machines. arXiv preprint arXiv:1607.07195 (2016). (2021).
[2] Chen Cheng, Fen Xia, Tong Zhang, Irwin King, and Michael R Lyu. 2014. Gradient [27] Xiangyu Zhao, Chong Wang, Ming Chen, Xudong Zheng, Xiaobing Liu, and
boosting factorization machines. In Proceedings of the 8th ACM Conference on Jiliang Tang. 2020. Autoemb: Automated embedding dimensionality search in
Recommender systems. 265–272. streaming recommendations. arXiv preprint arXiv:2002.11252 (2020).
[3] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, [28] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through
2016. Wide & deep learning for recommender systems. In Proceedings of the 1st rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference
workshop on deep learning for recommender systems. 7–10. on Knowledge Discovery & Data Mining. 1059–1068.
[4] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks
for youtube recommendations. In Proceedings of the 10th ACM conference on
recommender systems. 191–198.
[5] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting
machine. Annals of statistics (2001), 1189–1232.
[6] Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James
Zou. 2019. Mixed dimension embeddings with application to memory-efficient
recommendation systems. arXiv preprint arXiv:1909.11810 (2019).
[7] Huifeng Guo, Bo Chen, Ruiming Tang, Zhenguo Li, and Xiuqiang He. 2020.
AutoDis: Automatic Discretization for Embedding Numerical Features in CTR
Prediction. arXiv preprint arXiv:2012.08986 (2020).
[8] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.
DeepFM: a factorization-machine based neural network for CTR prediction. In
Proceedings of the 26th International Joint Conference on Artificial Intelligence.
1725–1731.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[10] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image
Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). 770–778.
[11] Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse
predictive analytics. In Proceedings of the 40th International ACM SIGIR conference
on Research and Development in Information Retrieval. 355–364.
[12] Manas R Joglekar, Cong Li, Mei Chen, Taibai Xu, Xiaoming Wang, Jay K Adams,
Pranav Khaitan, Jiahui Liu, and Quoc V Le. 2020. Neural input search for large
scale recommendation models. In Proceedings of the 26th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery & Data Mining. 2387–2397.
[13] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and
Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature in-
teractions for recommender systems. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining. 1754–1763.