Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

FINT: Field-aware INTeraction Neural Network For CTR

Prediction
Zhishan Zhao Sen Yang Guohui Liu
iQIYI Inc. National University of Defense liuguohui@qiyi.com
Beijing, China Technology iQIYI Inc.
Changsha, China Beijing, China

Dawei Feng Kele Xu∗


davyfeng.c@gmail.com kelele.xu@Gmail.com
National University of Defense National University of Defense
Technology Technology
arXiv:2107.01999v2 [cs.IR] 30 Jul 2021

Changsha, China Changsha, China

ABSTRACT ACM Reference Format:


As a critical component for online advertising and marking, click- Zhishan Zhao, Sen Yang, Guohui Liu, Dawei Feng, and Kele Xu. 2021. FINT:
Field-aware INTeraction Neural Network For CTR Prediction. In CIKM ’21:
through rate (CTR) prediction has draw lots of attentions from both
ACM International Conference on Information and Knowledge Management
industry and academia field. Recently, the deep learning has become 1-5 November, 2021, Gold Coast, Queensland, Australia. ACM, New York, NY,
the mainstream methodological choice for CTR. Despite of sustain- USA, 5 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
able efforts have been made, existing approaches still pose several
challenges. On the one hand, high-order interaction between the
features is under-explored. On the other hand, high-order inter- 1 INTRODUCTION
actions may neglect the semantic information from the low-order Click-through rate (CTR) prediction aims to forecast the probability
fields. In this paper, we proposed a novel prediction method, named that a user will click a particular recommended item or an adver-
FINT, that employs the Field-aware INTeraction layer which cap- tisement on a web page [19]. The applications of CTR seem to be
tures high-order feature interactions while retaining the low-order evident in several different fields, such as recommendation systems,
field information. To empirically investigate the effectiveness and online advertising and product search. During last decades, CTR
robustness of the FINT, we perform extensive experiments on the has drawn dramatically interests, due to its important roles in both
three realistic databases: KDD2012, Criteo and Avazu. The obtained the academic and industry. Unlike other data types, such as images
results demonstrate that the FINT can significantly improve the and texts, data used in CTR are usually of high sparsity and large
performance compared to the existing methods, without increas- scale. Making an accurate and robust prediction is still far from been
ing the amount of computation required. Moreover, the proposed solved. In the early years, logistic regression (LR) and factorization
method brought about 2.72% increase to the advertising revenue machines (FM) are widely explored. Recently, deep learning-based
of a big online video app through A/B testing. To better promote approaches have been the mainstream methodological choice for
the research in CTR field, we released our code as well as reference CTR, such as, Wide&Deep [3], DeepFM [8], DCN [23] and xDeepFM
implementation at: https://github.com/zhishan01/FINT. [13]. Based on previous studies, how to fully utilizing both the low-
and high-order feature interactions simultaneously can bring extra
CCS CONCEPTS performance improvements, compared to the cases of considering
• Information systems → Retrieval models and ranking; In- either alone.
formation retrieval. By leveraging the high-order feature interaction, the perfor-
mance can be further improved. Despite of sustainable efforts
KEYWORDS have been made, existing high-order feature interaction based ap-
proaches (e.g. Wide&Deep [3], DeepFM [8] and xDeepFM [13])
CTR, Factorization Machines, Feature Interaction
still confront with a significant challenge. The field-level semantic
∗ Corresponding information is lost during the high-order feature interaction. Con-
author.
sequently, the subsequent online deep models cannot fully employ
the explicit features.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed In this paper, we propose a novel framework to capture high-
for profit or commercial advantage and that copies bear this notice and the full citation order feature interactions while preserving the low-order field se-
on the first page. Copyrights for components of this work owned by others than ACM mantic information by introducing a field-aware interaction layer in
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a the model. Specifically, a unique feature interaction vector is main-
fee. Request permissions from permissions@acm.org. tained for each field, which can further facilitate the subsequent
CIKM ’21, 1-5 November, 2021, Gold Coast, Queensland, Australia DNN model to better explore the non-linear high-order relation-
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 ship between fields. We employed this method to improve the CTR
https://doi.org/10.1145/nnnnnnn.nnnnnnn prediction accuracy on an online recommendation service, a system
CIKM ’21, 1-5 November, 2021, Gold Coast, Queensland, Australia Zhishan, et al.

recommends ads to users at a video app. Extensive experiments 3.1 FINT


on both public datasets and online A/B testing demonstrate the Figure 1 shows the overall architecture of our proposed model FINT.
effectiveness and robustness of the proposed method. It mainly contains three modules: embedding layer, field-aware
The remainder of this paper is structured as follows. Firstly, we interaction layer, and DNN layer (dense neural network). The first
brief related works for CTR in Sections 2. After that, in Section 3, we one aims to embed features into dense vectors. The second one
elaborately describe the proposed methodology in detail. Then, in maintains a unique interaction vector for each field, it captures high-
Section 4, we perform comprehensive experiments on three widely order field-aware feature interactions while retaining the low-order
used databases, to assess the performance of the proposed approach field information. The last module targets to exploit non-linear high
in for CTR task. Finally, in Section 5, we draw conclusions and point order feature interaction and predict the user’s clicking behavior.
out some promising related research directions. Embedding Layer The FINT first embeds each feature 𝑥𝑖 ∈ 𝑋
into a dense vector 𝑣𝑖0 of dimension 𝐷 as its initial representation.
If the field is multivalent, the sum of feature embedding is used
2 RELATED WORK as the field embedding. Here, "0" in the subscript denotes the ini-
Due to its evident applications, CTR has drawn great attention. tial one. We denote the output of embedding layer, i.e., the initial
With the success application of deep learning in computer vision representation of 𝑋 , as:
and natural language processing [9, 17], the mainstream methods 𝑉0 = [𝑣 00, 𝑣 10, · · · , 𝑣 𝑀−
0 ⊺
(1)
1] .
for CTR transforms from earlier linear [19] or non-linear [5, 18] to
Field-aware Interaction Layer This layer aims to promote
current deep neural model based [3, 4, 21, 24, 28].
field-aware interaction between features to explore more possible
In earlier approaches, logistic regression [19] is the most basic
combinations. Field-awareness means maintaining a unique inter-
one that models the linear combination of features. Factorization
action vector for each field, making further non-linear interaction
machine (FM) [18] makes use of the advantages of Support Vector
among field possible. We stack 𝐾 field-aware interaction layers in
Machines (SVM), and model all the interactions between variables
the FINT to achieve high-order feature interaction. As shown in
using factorized parameters. Multiple variants of FM including
Figure 1, to achieve field-aware interaction, each interaction layer
Gradient Boosting FM [2], Higher-Order FM [1], and Field-weighted
conducts two steps, computing Hadamard product and channel-
FM [15] have been proposed after that. By enumerating possible
wise weighted sum pooling. The first step takes the initial repre-
feature interactions, these FM based models reduce the cost of
feature engineering [26]. sentation 𝑉 0 and representation of the last step 𝑉 𝑙−1 as input and
The deep model has become the mainstream methodological computes the Hadamard product between all pairs of 𝑣𝑖𝑙−1 ∈ 𝑉 𝑙−1
choice, as it is more powerful for feature interaction learning. and 𝑣 0𝑗 ∈ 𝑉 0 . Then channel-wise weighted sum pooling is con-
More specifically, current deep learning approaches typically fol- ducted along features with learnable parameter 𝑊 𝑙 ∈ R𝐷×𝐷 . To
lows an embedding and feature interaction paradigm. In most ap- avoid training collapse, we also add residual connection [10] in
proaches, the embedding module allocates variable-length embed- each field-aware interaction layer. Specifically, representation 𝑣𝑖𝑙 is
dings or multi embedding to different features. Such methods in- computed as:
cludes NIS [12], Mixed-dimension [6], and AutoEmb [27]. More
∑︁1
𝑀−
models pay attention to model interaction between features in an 𝑣𝑖𝑙 = 𝑙
𝑤𝑖,𝑗 × (𝑣𝑖𝑙−1 ⊙ 𝑣 0𝑗 ) + 𝑢𝑖𝑙 𝑣𝑖𝑙−1, (2)
implicit or an explicit manner based on the basic embedding al- 𝑗=0
gorithms. One classical instance of the former is FNN [25], which
models feature interaction directly based on learn able embeddings. where, 𝑤𝑖,𝑗
𝑙 is the element of 𝑊 𝑙 , 𝑢𝑙 is a scalar, ⊙ denotes Hadamard
𝑖
More researches focus on modeling both explicit and implicit fea- product. Equation 2 indicates 𝑣𝑖𝑙 contains combination information
ture interactions. PNN [16] applies a product layer and multiple of field 𝑖 and other fields in any order within 𝑙. If we consider
fully connected layers to exploring the high-order feature interac- all representations, Equation 2 can be re-organized in the matrix
tions. NFM [11] stacks multiple non-linear layers over their pro- format:
posed bi-linear interaction layer to capture feature interactions. 𝑉 𝑙 = 𝑉 𝑙−1 ⊙ (𝑊 𝑙 ⊗ 𝑉 0 ) + 𝑈 𝑙 × 𝑉 𝑙−1, (3)
DIN [28] introduces an attention mechanism to activate the his- in which 𝑈 𝑙 = [𝑢𝑙0, · · · , 𝑢𝑙𝑀−1 ] ⊺ is a learnable vector parameter for
torical behaviors and capture characteristic of user interests. In residual connection, ⊗ indicates matrix multiplication.
addition, AutoDis [7] and DMR [14] also belong to this category. DNN Layer As the last module of FINT, the DNN layer is to
explore non-linear interactions and predict the clicking probability
according to the final representations. Concretely, the DNN layer
3 APPROACH works as:
We take CTR as a binary classification task to predict whether a 𝑦^ = (𝜎 ◦ FFN)(𝑣 0𝐾 ∥ · · · ∥𝑣 𝑀−
𝐾
1 ), (4)
user will click a given item. Specifically, we denote the model input where ∥ indicates vector concatenation, ◦ indicates function com-
as 𝑋 = {𝑥 0, · · · , 𝑥 𝑀−1 } which contains 𝑀 features. 𝑋 includes not position, 𝜎 is the sigmoid function, FFN is a feed-forward network
only user features but also item features and context features. Such that contains multiple fully-connection layers with hidden size 𝐷 𝐹
features could be categorical, such as age and gender, or continuous, and active function RELU. Such a architecture allows the DNN layer
such as item price. The gold target of CTR model is a scalar 𝑦 (𝑦 = 1 to explore high-order non-linear feature interaction in the semantic
means the user will click the given item, otherwise 𝑦 = 0). space.
FINT: Field-aware INTeraction Neural Network For CTR Prediction CIKM ’21, 1-5 November, 2021, Gold Coast, Queensland, Australia

y
v00 v10
<latexit sha1_base64="nbvLxOJ15Bd4TOXKH3mcuQPGAak=">AAACyHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdVl0I64qmLZQa0mm0zo0L5JJpRQ3/oBb/TLxD/QvvDOmoBbRCUnOnHvPmbn3erEvUmlZrwVjbn5hcam4XFpZXVvfKG9uNdIoSxh3WORHSctzU+6LkDtSSJ+34oS7gefzpjc8U/HmiCepiMIrOY55J3AHoegL5kqinFHXurG65YpVtfQyZ4GdgwryVY/KL7hGDxEYMgTgCCEJ+3CR0tOGDQsxcR1MiEsICR3nuEeJtBllccpwiR3Sd0C7ds6GtFeeqVYzOsWnNyGliT3SRJSXEFanmTqeaWfF/uY90Z7qbmP6e7lXQKzELbF/6aaZ/9WpWiT6ONE1CKop1oyqjuUume6Kurn5pSpJDjFxCvconhBmWjnts6k1qa5d9dbV8TedqVi1Z3luhnd1Sxqw/XOcs6BxULWPqgeXh5XaaT7qInawi32a5zFqOEcdDnkLPOIJz8aFERt3xvgz1Sjkmm18W8bDB6hykMc=</latexit>

<latexit sha1_base64="y3P9SkaYhh7sTIEg37vAqHSNzLY=">AAACyHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdVl0I64qmLZQa0mm0zo0L5JJpRQ3/oBb/TLxD/QvvDOmoBbRCUnOnHvPmbn3erEvUmlZrwVjbn5hcam4XFpZXVvfKG9uNdIoSxh3WORHSctzU+6LkDtSSJ+34oS7gefzpjc8U/HmiCepiMIrOY55J3AHoegL5kqinFHXvrG65YpVtfQyZ4GdgwryVY/KL7hGDxEYMgTgCCEJ+3CR0tOGDQsxcR1MiEsICR3nuEeJtBllccpwiR3Sd0C7ds6GtFeeqVYzOsWnNyGliT3SRJSXEFanmTqeaWfF/uY90Z7qbmP6e7lXQKzELbF/6aaZ/9WpWiT6ONE1CKop1oyqjuUume6Kurn5pSpJDjFxCvconhBmWjnts6k1qa5d9dbV8TedqVi1Z3luhnd1Sxqw/XOcs6BxULWPqgeXh5XaaT7qInawi32a5zFqOEcdDnkLPOIJz8aFERt3xvgz1Sjkmm18W8bDB6rUkMg=</latexit>

Residual Connection
Vl 1
<latexit sha1_base64="YhHGCvAEiRRdBhH5dUJG0FE7hJk=">AAACy3icjVHLSsNAFD2Nr1pfVZdugkVwY0mKqMuiGzdCBfuAWiVJp3VoXiQTodYu/QG3+l/iH+hfeGecglpEJyQ5c+45d+be68Y+T4VlveaMmdm5+YX8YmFpeWV1rbi+0UijLPFY3Yv8KGm5Tsp8HrK64MJnrThhTuD6rOkOTmS8ecuSlEfhhRjGrBM4/ZD3uOcIolqNq5G/Z48L18WSVbbUMqeBrUEJetWi4gsu0UUEDxkCMIQQhH04SOlpw4aFmLgORsQlhLiKM4xRIG9GKkYKh9gBffu0a2s2pL3MmSq3R6f49CbkNLFDnoh0CWF5mqnimcos2d9yj1ROebch/V2dKyBW4IbYv3wT5X99shaBHo5UDZxqihUjq/N0lkx1Rd7c/FKVoAwxcRJ3KZ4Q9pRz0mdTeVJVu+yto+JvSilZufe0NsO7vCUN2P45zmnQqJTtg3LlfL9UPdajzmML29ileR6iilPUUFdzfMQTno0zIzXujPtPqZHTnk18W8bDB1M1kdI=</latexit>

Sigmoid

Embedder0 Embedder1
Channel-wise FNN
Hadamard
Product Weighted Sum

(V 0 )|
<latexit sha1_base64="DYTx2UNQ1hf5K5igdJGC9/jvQVE=">AAAC2XicjVHLSsNAFD2Nr1pf9bFzEyyCbkpaRF0W3bisYB9QrSTjWIemSZhMhFq6cCdu/QG3+kPiH+hfeGdMQS2iE5KcOfeeM3Pv9SJfxMpxXjPWxOTU9Ex2Njc3v7C4lF9eqcdhIhmvsdAPZdNzY+6LgNeUUD5vRpK7Pc/nDa97qOONay5jEQYnqh/xs57bCcSlYK4i6jy/tlVvD5zhdntwKgLFJXP9Ye48X3CKjln2OCiloIB0VcP8C05xgRAMCXrgCKAI+3AR09NCCQ4i4s4wIE4SEibOMUSOtAllccpwie3St0O7VsoGtNeesVEzOsWnV5LSxiZpQsqThPVptoknxlmzv3kPjKe+W5/+XurVI1bhiti/dKPM/+p0LQqX2Dc1CKopMoyujqUuiemKvrn9pSpFDhFxGl9QXBJmRjnqs200sald99Y18TeTqVm9Z2lugnd9Sxpw6ec4x0G9XCztFsvHO4XKQTrqLNaxgS2a5x4qOEIVNfK+wSOe8Gy1rFvrzrr/TLUyqWYV35b18AEiG5c2</latexit>

Pooling
x0 x1
Vl
<latexit sha1_base64="gGyFlbBKimkkD15RfFrL8pGVzd4=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkVwVZIi6rLopsuK9gG1SjKd1tA0CZOJUorgD7jVTxP/QP/CO+MU1CI6IcmZc+85M/dePwmDVDrOa86am19YXMovF1ZW19Y3iptbzTTOBOMNFoexaPteysMg4g0ZyJC3E8G9kR/ylj88VfHWLRdpEEcXcpzw7sgbREE/YJ4k6rx5FV4XS07Z0cueBa4BJZhVj4svuEQPMRgyjMARQRIO4SGlpwMXDhLiupgQJwgFOs5xjwJpM8rilOERO6TvgHYdw0a0V56pVjM6JaRXkNLGHmliyhOE1Wm2jmfaWbG/eU+0p7rbmP6+8RoRK3FD7F+6aeZ/daoWiT6OdQ0B1ZRoRlXHjEumu6Jubn+pSpJDQpzCPYoLwkwrp322tSbVtaveejr+pjMVq/bM5GZ4V7ekAbs/xzkLmpWye1iunB2Uqidm1HnsYBf7NM8jVFFDHQ3yHuART3i2alZkZdbdZ6qVM5ptfFvWwwdJEJBA</latexit>

Embedding Layer Field-Aware Interaction Layer xK DNN Layer

Figure 1: The architecture of FINT.

Table 1: Effectiveness comparison of different algorithms.

Criteo Avazu KDD12


Model
AUC Logloss AUC Logloss AUC Logloss
LR 0.7846 0.4670 0.7616 0.3901 0.7352 0.1385
FM 0.7912 0.4627 0.7753 0.3826 0.7419 0.1383
NFM 0.7991 0.4541 0.7761 0.3820 0.7419 0.1378
PNN 0.8069 0.4473 0.7793 0.3802 0.7571 0.1357
DeepFM 0.8014 0.4524 0.7785 0.3806 0.7517 0.1366
Wide & Deep 0.8042 0.4495 0.7776 0.3811 0.7509 0.1366
AutoInt 0.8053 0.4482 0.7770 0.3813 0.7613 0.1356
DCN 0.8053 0.4483 0.7777 0.3811 0.7546 0.1360
XDeepFM 0.8055 0.4484 0.7796 0.3801 0.7531 0.1360
FINT 0.8077 0.4461 0.7795 0.3800 0.7618 0.1355

In the training stage, we use binary cross entropy as the training hand, the field-aware interaction layer makes FINT distinguishable
loss for FINT. In the inference stage, we take the output 𝑦^ as the from others. Unlike with previous approaches that cast all feature
probability of a user clicking the given item. representations into a single scalar or vector during feature interac-
Time Complexity Equation 3 shows that each feature interac- tion [9, 13, 18], the field-aware interaction layer maintains a vector
tion layer can be efficiently computed in O (𝑀 2 𝐷). For the DNN for each field, retaining their boundaries, to allows the following
layer, the vector-matrix multiplication is the main operation which DNN module to further mine nonlinear interaction. AutoINT [20]
can be done in O (𝑀𝐷𝐷 𝐹 + 𝐷 𝐹2 ). Since there are 𝐾 field-aware in- is the most related work to FINT in paradigm because it also re-
teraction layers(𝐾 is usually small), the overall time complexity of tains the feature boundary in linear feature interaction and exploits
FINT is O (𝐾𝑀 2 𝐷 + 𝑀𝐷𝐷 𝐹 + 𝐷 𝐹2 ). Although the feature interaction nonlinear high-order feature interaction. However, it is based on
layer complexity is inferior to the one of traditional machine learn- the Transformer model [22] and uses the self-attention mechanism
ing based FM [18] and NFM [9], which is O (𝐾𝑀𝐷), it surpasses to learn feature weights, while FINT uses the Hadamard product,
a variety of deep learning based peers, such as xDeepFM [8, 13] which is a more general and effective method in recommendation
which is O (𝐾𝑀 2 𝐷𝑇 ), where 𝑇 is the number of multiple pooling systems.
operations. Moreover, as FINT conducts most operations on ma-
trixes and requires no sequence operations, it can achieve a higher
GPU acceleration ratio. 4 EXPERIMENTS
In this section, we aim to provide the experimental results of FINT,
from the perspective of efficiency and effectiveness. The prototype
3.2 Relationship with Other Models of FINT is implemented by Python 3.7 + TensorFlow 1.14.0 and run
FINT shares a similar paradigm with several factorization based with a Nvidia Tesla P40 GPU. Two metrics Logloss and AUC (Area
models [9, 13, 18, 20]. For example, they explore feature interaction Under the ROC Curve) are used in our experimental studies. Here,
in the vector space and exploit pooling operation to reduce dimen- a smaller Logloss or a larger AUC represents better CTR prediction
sion and promote further classification. NFM [11] has shown the performance. For the training of FINT model, we set the learning
advantage of integrating linear feature combination and nonlin- rate as 1e-3 and Adam optimizer is employed in our experiments.
ear high-order feature combination. Therefore, FINT also exploits The batch size is set 1024, while the embedding size is setting as 16.
nonlinear feature interaction through the DNN layer. On the other We use 3 field-aware interaction layers. For the DNN layers, the
CIKM ’21, 1-5 November, 2021, Gold Coast, Queensland, Australia Zhishan, et al.

number of hidden layers is set as [300, 300, 300]. We replace the KDD2012 dataset is similar to the one of Criteo. Overall, these
features that appear less than 10 times as “unknown”. Numerical results illustrate the effectiveness of the FINT for the CTR task.
features be normalized with 𝑧 ∗ = 𝑙𝑜𝑔(𝑧 + 1) + 1. The settings of the FINT employs the deep FC layers, to capture the complex implicit
experimental part are basically kept the same as AutoINT. information and improve the performance. We leave the exploration
towards proper feature representation and parameterization scheme
4.1 Dataset for future work.
We evaluate the proposed method on three publicly available datasets
including the KDD12 1 , Criteo 2 , and Avazu 3 . Criteo and Avazu
contain chronologically ordered click-through records from Criteo
and Avazu which are two online advertisement companies. For
Avazu and Cretio dataset, we We randomly split the dataset into
training (80%), validation (10%), and test (10%) sets. While for the
KDD12, we follow the official public and private split.

4.2 Baselines
We implemented 9 widely used CTR prediction approaches using
TensorFlow, and compared them in the experiment. Below is a brief
introduction of these models:
• LR, we employ the LR only with basic features as our first
baseline.
• FM, we employ the original FM model, which has demon-
strated its effectiveness in many CTR prediction tasks.
• NFM, which aims to encode all feature interactions, through
a multi-layer neural network coupled with a bit-wise bi-
interaction pooling layer. Figure 2: Time comparison (in seconds).
• PNN, which applies a product layer and multiple fully con-
nected layers to explore the high-order feature interactions.
• Wide & Deep, which aims to model low- and high-order
feature interactions simultaneously. 4.4 Effectiveness Comparison
• DeepFM, which explores the integration of the of FM and In Figure 2, we conducted a quantitative comparison on the run-
deep neural networks (DNN). Through the modeling of low- time between FINT and seven state-of-the-art models with GPU
order feature interactions like FM and models high-order implementations on Criteo and Avazu. In the Figure, the y-axis
feature interactions like DNN. provides an average runtime per epoch over five training epochs
• AutoInt, which employs a multi-head self-attentive neural after which all models start to converge observably. We keep the
network as the core module and can automatically learn the hardware settings as identical to aforementioned in the experiment
high-order interactions of input features. setting session. From the figure, we observe that FINT displays an
• DCN, which makes use of the deep cross network and takes superior efficiency by spending the minimum time for each epoch
the outer product of concatenated feature embeddings to among the ten models, while retaining best prediction performance.
explicitly model feature interaction. The main property of FINT enable the huge speedup: the Hadamard
• xDeepFM, which has a compressed interaction network to product operations across the features can reduce the problem scale
model vector-wise feature interactions for CTR prediction. (from exponential to linear). Thus, FINT can avoid the enumeration
of all possible feature combinations in the k orders.
4.3 Performance Evaluation
We show the performance of all models on CTR prediction with 4.5 Result from online A/B testing
Table 1. As can be observed from Table 1: the proposed FINT model Careful online A/B testing in the advertising display system was
can provide better performance compared to other models on Criteo, conducted. Being a large video app, it has more than 100 million
KDD2012, and Avazu on both metrics. The FINT obtains different users using it to watch videos every day. Ads are distributed in mul-
improvements on three datasets. For example, on the Criteo dataset, tiple locations of the app, including video pre-post ads(video general
FINT surpasses the previous best model (PNN) over 0.08% point roll), startup screen ads when the app is opened(open screen), short
on AUC, which is already a considerable improvement in the CTR video feed flow ads(infeed), and long video feed flow ads(semi-feed).
task. On the Avazu dataset, the FINT achieves a comparable AUC During almost a month’s testing, FINT trained with the proposed
with XDeepFM, only superior with 0.01% point. On the other hand, FINT contributes up to 2.92% CTR and 3.18% RPM(Revenue Per
it wins XDeepFM on the Logloss metric. The improvement on the Mille) promotion compared with the baseline models (Wide & Deep).
1 https://www.kaggle.com/c/kddcup2012-track2 This is a significant improvement and demonstrates the effective-
2 https://www.kaggle.com/c/criteo-display-ad-challenge ness of our proposed approaches. Now FINT has been deployed
3 https://www.kaggle.com/c/avazu-ctr-prediction online and serves the main traffic. Details of A/B testing results
FINT: Field-aware INTeraction Neural Network For CTR Prediction CIKM ’21, 1-5 November, 2021, Gold Coast, Queensland, Australia

conducted at different advertising positions of the app are shown [14] Ze Lyu, Yu Dong, Chengfu Huo, and Weijun Ren. 2020. Deep Match to Rank
in Table 2. Model for Personalized Click-Through Rate Prediction. In Proceedings of the AAAI
Conference on Artificial Intelligence, Vol. 34. 156–163.
[15] Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun,
Table 2: A/B testing on different advertising positions. and Quan Lu. 2018. Field-weighted factorization machines for click-through
rate prediction in display advertising. In Proceedings of the 2018 World Wide Web
Conference. 1349–1357.
position revenue click rate [16] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang.
overall +2.72% +2.92% 2016. Product-based neural networks for user response prediction. In 2016 IEEE
16th International Conference on Data Mining (ICDM). IEEE, 1149–1154.
video general roll +1.53% +0.41% [17] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
open screen +2.67% +4.11% Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI
blog 1, 8 (2019), 9.
infeed +3.39% +5.38% [18] Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Confer-
semi-feed +4.81% +4.69% ence on Data Mining. IEEE, 995–1000.
[19] Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting
clicks: estimating the click-through rate for new ads. In Proceedings of the 16th
international conference on World Wide Web. 521–530.
[20] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang,
5 CONCLUSION and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self-
attentive neural networks. In Proceedings of the 28th ACM International Conference
In this paper, we proposed FINT, an efficient and effective CTR on Information and Knowledge Management. 1161–1170.
predictor. FINT aims to learn the high-order feature interactions [21] Yixin Su, Rui Zhang, Sarah Erfani, and Zhenghua Xu. 2021. Detecting Beneficial
by employing the Field-aware interaction layer, which captures Feature Interactions for Recommender Systems. In Proceedings of the 34th AAAI
Conference on Artificial Intelligence (AAAI).
high-order feature interactions without losing the low-order field [22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
information. We have conducted extensive experiments on public Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information processing systems. 5998–6008.
realistic datasets and an A/B testing on a large online systems. [23] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network
The obtained results suggest that FINT can learn effective high- for ad click predictions. In Proceedings of the ADKDD’17. 1–7.
order feature interactions, while runs faster than state-of-the-art [24] Kai Zhang, Hao Qian, Qing Cui, Qi Liu, Longfei Li, Jun Zhou, Jianhui Ma, and
Enhong Chen. 2021. Multi-Interactive Attention Network for Fine-grained Fea-
models, meaning a high efficiency on CTR prediction and achieves ture Learning in CTR Prediction. In Proceedings of the 14th ACM International
comparable or even better performances. Conference on Web Search and Data Mining. 984–992.
[25] Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep learning over multi-field
categorical data. In European conference on information retrieval. Springer, 45–57.
REFERENCES [26] Weinan Zhang, Jiarui Qin, Wei Guo, Ruiming Tang, and Xiuqiang He. 2021. Deep
[1] Mathieu Blondel, Akinori Fujino, Naonori Ueda, and Masakazu Ishihata. 2016. Learning for Click-Through Rate Estimation. arXiv preprint arXiv:2104.10584
Higher-order factorization machines. arXiv preprint arXiv:1607.07195 (2016). (2021).
[2] Chen Cheng, Fen Xia, Tong Zhang, Irwin King, and Michael R Lyu. 2014. Gradient [27] Xiangyu Zhao, Chong Wang, Ming Chen, Xudong Zheng, Xiaobing Liu, and
boosting factorization machines. In Proceedings of the 8th ACM Conference on Jiliang Tang. 2020. Autoemb: Automated embedding dimensionality search in
Recommender systems. 265–272. streaming recommendations. arXiv preprint arXiv:2002.11252 (2020).
[3] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, [28] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through
2016. Wide & deep learning for recommender systems. In Proceedings of the 1st rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference
workshop on deep learning for recommender systems. 7–10. on Knowledge Discovery & Data Mining. 1059–1068.
[4] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks
for youtube recommendations. In Proceedings of the 10th ACM conference on
recommender systems. 191–198.
[5] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting
machine. Annals of statistics (2001), 1189–1232.
[6] Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James
Zou. 2019. Mixed dimension embeddings with application to memory-efficient
recommendation systems. arXiv preprint arXiv:1909.11810 (2019).
[7] Huifeng Guo, Bo Chen, Ruiming Tang, Zhenguo Li, and Xiuqiang He. 2020.
AutoDis: Automatic Discretization for Embedding Numerical Features in CTR
Prediction. arXiv preprint arXiv:2012.08986 (2020).
[8] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.
DeepFM: a factorization-machine based neural network for CTR prediction. In
Proceedings of the 26th International Joint Conference on Artificial Intelligence.
1725–1731.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[10] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image
Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). 770–778.
[11] Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse
predictive analytics. In Proceedings of the 40th International ACM SIGIR conference
on Research and Development in Information Retrieval. 355–364.
[12] Manas R Joglekar, Cong Li, Mei Chen, Taibai Xu, Xiaoming Wang, Jay K Adams,
Pranav Khaitan, Jiahui Liu, and Quoc V Le. 2020. Neural input search for large
scale recommendation models. In Proceedings of the 26th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery & Data Mining. 2387–2397.
[13] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and
Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature in-
teractions for recommender systems. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining. 1754–1763.

You might also like