Professional Documents
Culture Documents
A Neural Words Encoding Model: Dayiheng Liu Jiancheng LV Xiaofeng Qi and Jiangshu Wei
A Neural Words Encoding Model: Dayiheng Liu Jiancheng LV Xiaofeng Qi and Jiangshu Wei
Abstract—This paper proposes a neural network model and the RBM parameters such that the probability distribution
learning algorithm that can be applied to encode words. The represented by the hidden neurons of RBM fits the training
model realizes the function of words encoding and decoding data as well as possible. They can be viewed as non-linear
which can be applied to text encryption/decryption and word-
based compression. The model is based on Deep Belief Networks feature detectors [6]. Applying contrastive divergence [7] to
(DBNs) and it differs from traditional DBNs in that it is the layer-by-layer unsupervised training procedure of each
asymmetric structured and the output of it is a binary vector. RBM in turn, DBNs is able to generate training data with
With pre-training of multi-layer Restricted Boltzmann Machines the maximum probability [8].
(RBMs) and fine-tuning to reconstruct word set, the output of The neural words encoding model differs from traditional
code layer can be used as a kind of representation code of words.
We can change the number of neurons of code layer to control DBNs in that it is asymmetric structured and the output of it is
the length of representation code for different applications. This a binary vector. The model consists of two parts, encoder and
paper reports on experiments using English words of American decoder. The part of encoder can encode and encrypt words
National Corpus to train a neural words encoding model which fast by nonlinear transformation while the decoder part can
can be used to encode/decode English words, realizing text decode and decrypt words rapidly. Encoder and decoder are
encryption and data compression.
connected with code layer whose outputs are representation
I. I NTRODUCTION codes of words which are m-dimensions binary vectors, that
means the code of one word only needs m-bits to store.
Text encryption and compression have a wide range of Because neuron number of code layer is flexible, we can
application. An interesting approach to text compression is change the length of word code for different applications. In
not taking this textual data as sequences of characters or this paper, approximate 60,000 English words of American
bytes, but as sequences of words, these words may be real National Corpus are used to train a neural words encoding
words from spoken language, but also sequences of characters model. The learning procedure of the model can be divided
[1]. Word-based compression algorithm is a new type of into two stages: pre-training with unlabelled samples and fine-
text compression algorithm while most traditional algorithms tuning the whole network with labeled samples. After training,
perform compression at the character level [2]. It can fast English words of training data were encoded uniformly, the
compression and decompression natural language texts and outputs of code layer are representation codes of English
allows fast and flexible searching of words [3]. words which are 64-dimensions binary vectors. Besides, many
We proposes a neural network model called neural words English words which are not in training data and many random
encoding model which can be applied to encode words. It strings can also be encoded and decoded by the model.
has advantages in fast compression/decompression of natural
language text and words searching as well as word-based II. A N EURAL W ORDS E NCODING M ODEL
compression algorithm, besides, by nonlinear transformation The training set V is a set of words, for example, the vocabu-
of data in neural network, the model can also encrypt and lary of American National Corpus which includes approximate
decrypt words fast. 60,000 different English words. The goal is to learn a good
Neural words encoding model is a neural network model model to
based on Deep Belief Networks. Deep belief networks (DBNs) 1) Reconstruct word set: To obtain a unified encoding and
model is a typical deep learning structure which was first decoding method of word set, reconstructing every word
proposed by Hinton [4]. It is a probabilistic model consisting is necessary. After training a model to learn features of
of multiple layers of stochastic hidden variables. DBNs can the word set, the output of code layer is a kind of unified
be viewed as a composition of several Restricted Boltz- presentation code.
mann Machines (RBMs). Restricted Boltzmann Machines is 2) Output binary code: The output of code layer is
a parameterized generative model representing a probability encoding of word set which are m-dimensions vectors.
distribution which consist of two types of units called visible As binary code can be converted into integers by binary
and hidden neurons [5]. Given some observations in visible conversion, if the output vectors are binary vectors, the
neurons, the training data, learning a RBM means adjusting code of one word only needs m-bits to store. Compared
978-1-5090-0620-5/16/$31.00 2016
c IEEE 532
Fig. 2. Layer-wise pre-training
progressively combined from loose low-level representations δil =σ (zil ) · δjl+1 · wj,i
l
. (12)
into more compact high-level representations. j=1
in which y is the input itself which we want to reconstruct Fig. 3. Distribution of English word lengths
and ŷ is the real output of our model, |Y | is neuron number
of target layer. in concordance with the Gaussian distribution, besides, the
At last, the whole neural network is refined using back proportion of words whose length are less than thirteen is
propagation algorithm via a global gradient-based optimization 96.37% in this vocabulary. We take the whole words of the
strategy. We use steepest descent method to update weights vocabulary whose length are less than thirteen as training data
θ(W 1 , W 2 , ..., W N ) of our model which includes 58,194 different English words.
∂J
wj,i
l
← wj,i
l
−α . (8) A. Vectorizing Input
∂wj,i
l
Before training the proposed model, as English words string
Here α is learning rate which can be a constant or dynamically cannot be input of the network, preprocess to vectorize English
set [13, 14]. To compute ∂w∂J
l we define zkl and δkl as follows: words is essential. In the stage of pre-training, first group(V ,
j,i
H 1 ) are modeled as RBM and we need binary vectors as its
|H l−1 |
input.
zkl = wkj aj + ckl−1 ,
l−1 l−1
(9) Ignoring case, there are twenty-six letters (‘a’-‘z’) in all
j=1 so that each letter can be represented by 5 bits (25 > 26).
∂J For example, letter ‘a’ can be represented as [00001] while
δkl = . (10) letter ‘z’ can be represented as [11010]. Because the maximum
∂zkl
TABLE I
P ERFORMANCE OF MODELS