Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Sentence Reconstruction

The purpose of this project is to take in input a sequence of words corresponding to a random
permutation of a given english sentence, and reconstruct the original sentence.

The otuput can be either produced in a single shot, or through an iterative (autoregressive) loop
generating a single token at a time.

CONSTRAINTS:

• No pretrained model can be used.


• The neural network models should have less the 20M parameters.
• No postprocessing should be done (e.g. no beamsearch)
• You cannot use additional training data.

BONUS PARAMETERS:

A bonus of 0-2 points will be attributed to incentivate the adoption of models with a low number
of parameters.

Dataset
The dataset is composed by sentences taken from the generics_kb dataset of hugging face. We
restricted the vocabolary to the 10K most frequent words, and only took sentences making use
of this vocabulary.

!pip install datasets

Collecting datasets
Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 542.1/542.1 kB 7.5 MB/s eta
0:00:00
ent already satisfied: filelock in /usr/local/lib/python3.10/dist-
packages (from datasets) (3.14.0)
Requirement already satisfied: numpy>=1.17 in
/usr/local/lib/python3.10/dist-packages (from datasets) (1.25.2)
Requirement already satisfied: pyarrow>=12.0.0 in
/usr/local/lib/python3.10/dist-packages (from datasets) (14.0.2)
Requirement already satisfied: pyarrow-hotfix in
/usr/local/lib/python3.10/dist-packages (from datasets) (0.6)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 6.5 MB/s eta
0:00:00
ent already satisfied: pandas in /usr/local/lib/python3.10/dist-
packages (from datasets) (2.0.3)
Collecting requests>=2.32.1 (from datasets)
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.9/64.9 kB 5.0 MB/s eta
0:00:00
ent already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-
packages (from datasets) (4.66.4)
Collecting xxhash (from datasets)
Downloading xxhash-3.4.1-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 7.3 MB/s eta
0:00:00
ultiprocess (from datasets)
Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 8.1 MB/s eta
0:00:00
ent already satisfied: fsspec[http]<=2024.3.1,>=2023.1.0 in
/usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)
Requirement already satisfied: aiohttp in
/usr/local/lib/python3.10/dist-packages (from datasets) (3.9.5)
Requirement already satisfied: huggingface-hub>=0.21.2 in
/usr/local/lib/python3.10/dist-packages (from datasets) (0.23.3)
Requirement already satisfied: packaging in
/usr/local/lib/python3.10/dist-packages (from datasets) (24.1)
Requirement already satisfied: pyyaml>=5.1 in
/usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)
Requirement already satisfied: aiosignal>=1.1.2 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(1.3.1)
Requirement already satisfied: attrs>=17.3.0 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(6.0.5)
Requirement already satisfied: yarl<2.0,>=1.0 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(1.9.4)
Requirement already satisfied: async-timeout<5.0,>=4.0 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(4.0.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in
/usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.2-
>datasets) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.1-
>datasets) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.1-
>datasets) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.1-
>datasets) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.1-
>datasets) (2024.6.2)
Requirement already satisfied: python-dateutil>=2.8.2 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2.8.2)
Requirement already satisfied: pytz>=2020.1 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2023.4)
Requirement already satisfied: tzdata>=2022.1 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2024.1)
Requirement already satisfied: six>=1.5 in
/usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2-
>pandas->datasets) (1.16.0)
Installing collected packages: xxhash, requests, dill, multiprocess,
datasets
Attempting uninstall: requests
Found existing installation: requests 2.31.0
Uninstalling requests-2.31.0:
Successfully uninstalled requests-2.31.0
ERROR: pip's dependency resolver does not currently take into account
all the packages that are installed. This behaviour is the source of
the following dependency conflicts.
google-colab 1.0.0 requires requests==2.31.0, but you have requests
2.32.3 which is incompatible.
Successfully installed datasets-2.19.2 dill-0.3.8 multiprocess-0.70.16
requests-2.32.3 xxhash-3.4.1

!pip install --upgrade tensorflow


!pip install --upgrade keras

Requirement already satisfied: tensorflow in


/opt/conda/lib/python3.10/site-packages (2.15.0)
Collecting tensorflow
Downloading tensorflow-2.16.1-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Requirement already satisfied: absl-py>=1.0.0 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (1.4.0)
Requirement already satisfied: astunparse>=1.6.0 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (1.6.3)
Requirement already satisfied: flatbuffers>=23.5.26 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (23.5.26)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1
in /opt/conda/lib/python3.10/site-packages (from tensorflow) (0.5.4)
Requirement already satisfied: google-pasta>=0.1.1 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (0.2.0)
Requirement already satisfied: h5py>=3.10.0 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (3.10.0)
Requirement already satisfied: libclang>=13.0.0 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (16.0.6)
Collecting ml-dtypes~=0.3.1 (from tensorflow)
Downloading ml_dtypes-0.3.2-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Requirement already satisfied: opt-einsum>=2.3.2 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (3.3.0)
Requirement already satisfied: packaging in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (21.3)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!
=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (3.20.3)
Requirement already satisfied: requests<3,>=2.21.0 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (2.32.3)
Requirement already satisfied: setuptools in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (69.0.3)
Requirement already satisfied: six>=1.12.0 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (1.16.0)
Requirement already satisfied: termcolor>=1.1.0 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (2.4.0)
Requirement already satisfied: typing-extensions>=3.6.6 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (4.9.0)
Requirement already satisfied: wrapt>=1.11.0 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (1.14.1)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (1.59.3)
Collecting tensorboard<2.17,>=2.16 (from tensorflow)
Downloading tensorboard-2.16.2-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: keras>=3.0.0 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (3.3.3)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (0.35.0)
Requirement already satisfied: numpy<2.0.0,>=1.23.5 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (1.26.4)
Requirement already satisfied: wheel<1.0,>=0.23.0 in
/opt/conda/lib/python3.10/site-packages (from astunparse>=1.6.0-
>tensorflow) (0.42.0)
Requirement already satisfied: rich in /opt/conda/lib/python3.10/site-
packages (from keras>=3.0.0->tensorflow) (13.7.0)
Requirement already satisfied: namex in
/opt/conda/lib/python3.10/site-packages (from keras>=3.0.0-
>tensorflow) (0.0.8)
Requirement already satisfied: optree in
/opt/conda/lib/python3.10/site-packages (from keras>=3.0.0-
>tensorflow) (0.11.0)
Requirement already satisfied: charset-normalizer<4,>=2 in
/opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0-
>tensorflow) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in
/opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0-
>tensorflow) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0-
>tensorflow) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in
/opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0-
>tensorflow) (2024.2.2)
Requirement already satisfied: markdown>=2.6.8 in
/opt/conda/lib/python3.10/site-packages (from tensorboard<2.17,>=2.16-
>tensorflow) (3.5.2)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0
in /opt/conda/lib/python3.10/site-packages (from
tensorboard<2.17,>=2.16->tensorflow) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in
/opt/conda/lib/python3.10/site-packages (from tensorboard<2.17,>=2.16-
>tensorflow) (3.0.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in
/opt/conda/lib/python3.10/site-packages (from packaging->tensorflow)
(3.1.1)
Requirement already satisfied: MarkupSafe>=2.1.1 in
/opt/conda/lib/python3.10/site-packages (from werkzeug>=1.0.1-
>tensorboard<2.17,>=2.16->tensorflow) (2.1.3)
Requirement already satisfied: markdown-it-py>=2.2.0 in
/opt/conda/lib/python3.10/site-packages (from rich->keras>=3.0.0-
>tensorflow) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in
/opt/conda/lib/python3.10/site-packages (from rich->keras>=3.0.0-
>tensorflow) (2.17.2)
Requirement already satisfied: mdurl~=0.1 in
/opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0-
>rich->keras>=3.0.0->tensorflow) (0.1.2)
Downloading tensorflow-2.16.1-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (589.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 589.8/589.8 MB 2.1 MB/s eta
0:00:00:00:0100:01
l_dtypes-0.3.2-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 3.6 MB/s eta
0:00:000:00:01
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 11.8 MB/s eta
0:00:0000:0100:01
l-dtypes, tensorboard, tensorflow
Attempting uninstall: ml-dtypes
Found existing installation: ml-dtypes 0.2.0
Uninstalling ml-dtypes-0.2.0:
Successfully uninstalled ml-dtypes-0.2.0
Attempting uninstall: tensorboard
Found existing installation: tensorboard 2.15.1
Uninstalling tensorboard-2.15.1:
Successfully uninstalled tensorboard-2.15.1
Attempting uninstall: tensorflow
Found existing installation: tensorflow 2.15.0
Uninstalling tensorflow-2.15.0:
Successfully uninstalled tensorflow-2.15.0
ERROR: pip's dependency resolver does not currently take into account
all the packages that are installed. This behaviour is the source of
the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not
installed.
tensorflow-decision-forests 1.8.1 requires tensorflow~=2.15.0, but you
have tensorflow 2.16.1 which is incompatible.
tensorflow-text 2.15.0 requires tensorflow<2.16,>=2.15.0;
platform_machine != "arm64" or platform_system != "Darwin", but you
have tensorflow 2.16.1 which is incompatible.
tf-keras 2.15.1 requires tensorflow<2.16,>=2.15, but you have
tensorflow 2.16.1 which is incompatible.
Successfully installed ml-dtypes-0.3.2 tensorboard-2.16.2 tensorflow-
2.16.1
Requirement already satisfied: keras in
/opt/conda/lib/python3.10/site-packages (3.3.3)
Requirement already satisfied: absl-py in
/opt/conda/lib/python3.10/site-packages (from keras) (1.4.0)
Requirement already satisfied: numpy in
/opt/conda/lib/python3.10/site-packages (from keras) (1.26.4)
Requirement already satisfied: rich in /opt/conda/lib/python3.10/site-
packages (from keras) (13.7.0)
Requirement already satisfied: namex in
/opt/conda/lib/python3.10/site-packages (from keras) (0.0.8)
Requirement already satisfied: h5py in /opt/conda/lib/python3.10/site-
packages (from keras) (3.10.0)
Requirement already satisfied: optree in
/opt/conda/lib/python3.10/site-packages (from keras) (0.11.0)
Requirement already satisfied: ml-dtypes in
/opt/conda/lib/python3.10/site-packages (from keras) (0.3.2)
Requirement already satisfied: typing-extensions>=4.0.0 in
/opt/conda/lib/python3.10/site-packages (from optree->keras) (4.9.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in
/opt/conda/lib/python3.10/site-packages (from rich->keras) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in
/opt/conda/lib/python3.10/site-packages (from rich->keras) (2.17.2)
Requirement already satisfied: mdurl~=0.1 in
/opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0-
>rich->keras) (0.1.2)

Download the dataset

from datasets import load_dataset


from keras.layers import TextVectorization
import tensorflow as tf
import numpy as np
np.random.seed(42)
ds = load_dataset('generics_kb',trust_remote_code=True)['train']

{"model_id":"32f594dad7bb462483aa9caa0096d7c1","version_major":2,"vers
ion_minor":0}

{"model_id":"2bbb85d1a022486d816a2f45dfa10600","version_major":2,"vers
ion_minor":0}

{"model_id":"2342d6b9aba94dbc8303bc3d63c12d2f","version_major":2,"vers
ion_minor":0}

{"model_id":"fcff0647b7384af0bd83efde31bc248f","version_major":2,"vers
ion_minor":0}

Filter row with length greater than 8.

ds = ds.filter(lambda row: len(row["generic_sentence"].split(" ")) > 8


)
corpus = [ '<start> ' + row['generic_sentence'].replace(",","
<comma>") + ' <end>' for row in ds ]
corpus = np.array(corpus)

{"model_id":"36c5d4de6a8f4f97836e832102855de6","version_major":2,"vers
ion_minor":0}

Create a tokenizer and Detokenizer

tokenizer=TextVectorization( max_tokens=10000,
standardize="lower_and_strip_punctuation", encoding="utf-8",) #con il
max prende le piu frequenti. ordina i token del vocab dal piu
frequente al meno frequente
tokenizer.adapt(corpus)

class TextDetokenizer:
def __init__(self, vectorize_layer):
self.vectorize_layer = vectorize_layer
vocab = self.vectorize_layer.get_vocabulary()
self.index_to_word = {index: word for index, word in
enumerate(vocab)}
def __detokenize_tokens(self, tokens):
def check_token(t):
if t == 3:
s="<start>"
elif t == 2:
s="<end>"
elif t == 7:
s="<comma>"
else:
s=self.index_to_word.get(t, '[UNK]')
return s

return ' '.join([ check_token(token) for token in tokens if


token != 0])

def __call__(self, batch_tokens):


return [self.__detokenize_tokens(tokens) for tokens in
batch_tokens]

detokenizer = TextDetokenizer( tokenizer )


sentences = tokenizer( corpus ).numpy()

Remove from corpus the sentences where any unknow word appears

mask = np.sum( (sentences==1), axis=1) >= 1


original_data = np.delete( sentences, mask , axis=0)

original_data.shape

(241236, 28)

Shuffle the sentences

from tensorflow.keras.utils import PyDataset

class DataGenerator(PyDataset):
def __init__(self, data, batch_size=32, shuffle=True, seed=42):
self.data = data
self.batch_size = batch_size
self.shuffle = shuffle
self.seed = seed
self.on_epoch_end()

def __len__(self):
return int(np.floor(len(self.data) / self.batch_size))

def __getitem__(self, index):


indexes = self.indexes[index*self.batch_size:
(index+1)*self.batch_size]

data_batch = np.array([self.data[k] for k in indexes])


#copy of ordered sequences
result = np.copy(data_batch)
#shuffle only the relevant positions for each batch
for i in range(data_batch.shape[0]):
np.random.shuffle(data_batch[i,1:data_batch[i].argmin() -
1])

return (data_batch , np.array([[result[i][j] for j in


range(1,len(result[i]))] for i in range(len(result))] )),
np.array([[result[i][j] for j in range(len(result[i])-1)] for i in
range(len(result))])

def on_epoch_end(self):
self.indexes = np.arange(len(self.data))
if self.shuffle:
if self.seed is not None:
np.random.seed(self.seed)
np.random.shuffle(self.indexes)

# Make a random permutation of training and test set


np.random.seed(42)
# Shuffle the all data
shuffled_indices = np.random.permutation(len(original_data))
shuffled_data = original_data[shuffled_indices]

#split the dataset


train_generator = DataGenerator(shuffled_data[:220000])
test_generator = DataGenerator(shuffled_data[220000:])

(x, y), z = test_generator.__getitem__(1)


x = detokenizer(x)
y = detokenizer(y)
z = detokenizer(z)

for i in range(7):
print("shuffled: ", x[i])
print("original shifted: ", y[i])
print("original: ", z[i])
print("\n")

shuffled: <start> large their areas for cattle ranchers rainforest


clear pastures become to of <end>
original shifted: ranchers clear large areas of rainforest to become
pastures for their cattle <end>
original: <start> ranchers clear large areas of rainforest to become
pastures for their cattle <end>
shuffled: <start> stripes thorax some and the earwigs on abdomen have
<end>
original shifted: some earwigs have stripes on the thorax and abdomen
<end>
original: <start> some earwigs have stripes on the thorax and abdomen
<end>

shuffled: <start> into in magnetic such a liquid molecules can


manipulation computing turn devices <end>
original shifted: magnetic manipulation can turn molecules in a
liquid into computing such devices <end>
original: <start> magnetic manipulation can turn molecules in a
liquid into computing such devices <end>

shuffled: <start> reduced wetlands and recreation for water places


healthy cleaner flooding <comma> means more <end>
original shifted: healthy wetlands means cleaner water <comma>
reduced flooding and more places for recreation <end>
original: <start> healthy wetlands means cleaner water <comma>
reduced flooding and more places for recreation <end>

shuffled: <start> company percent share one controls a sales in


market is share the particular in market <end>
original shifted: market share is the percent share in sales one
company controls in a particular market <end>
original: <start> market share is the percent share in sales one
company controls in a particular market <end>

shuffled: <start> of on animal only a the small flies time amount


spend face <end>
original shifted: face flies spend only a small amount of time on the
animal <end>
original: <start> face flies spend only a small amount of time on the
animal <end>

shuffled: <start> extremely management in of foods are prevention and


cancer important organic <end>
original shifted: organic foods are extremely important in prevention
and management of cancer <end>
original: <start> organic foods are extremely important in prevention
and management of cancer <end>
Metrics
Let s be the source string and p your prediction. The quality of the results will be measured
according to the following metric:

1. look for the longest substring w between s and p


2. compute |w|/max(|s|,|p|)

If the match is exact, the score is 1.

When computing the score, you should NOT consider the start and end tokens.

The longest common substring can be computed with the SequenceMatcher function of difflib,
that allows a simple definition of our metric.

from difflib import SequenceMatcher

def score(s,p):
match = SequenceMatcher(None, s, p).find_longest_match()
#print(match.size)
return (match.size/max(len(p),len(s)))

Let's do an example.

original = "at first henry wanted to be friends with the king of


france"
generated = "henry wanted to be friends with king of france at the
first"

print("your score is ",score(original,generated))

your score is 0.5423728813559322

The score must be computed as an average of at least 3K random examples taken form the test
set.

What to deliver
You are supposed to deliver a single notebook, suitably commented. The notebook should
describe a single model, although you may briefly discuss additional attempts you did.

The notebook should contain a full trace of the training. Weights should be made available on
request.

You must also give a clear assesment of the performance of the model, computed with the
metric that has been given to you.
Good work!
For this task it has been decided to implement a Transformer, for this reason we will define a
Decoder and Encoder Block working with multiple attention heads in an attention based
Transformer.

Model Definition
Useful parameters:

1. LEN_VOC: Length of the vocabulary considered


2. LEN_SENT: Maximum length of the input sentence
3. LEN_TARGET_SENT: Maximum length of the input sentence
4. LEN_EMBED: Dimension of the embedding space
5. HEADS_1: Number of heads for the first Attention layer
6. HEADS_2: Number of heads for the second Attention layer
7. LEN_FF: Dimension of the Feed-Forward layer
8. DROPOUT: Dropout rate
LEN_VOC = 10000
LEN_SENT = 28
LEN_TARGET_SENT= 27
LEN_EMBED = 64
HEADS_1 = 6
HEADS_2 = 6
LEN_FF = 256
DROPOUT=0.01

Encoder Layer
The Encoder block is constituted by the Embedding layer (Token and Positional encoding the
token itslef and the position it occupies in the sentence) and several EncoderLayers (stacking
more layers is considered to offer a more accurate solution). The EncoderLayer is composed by
two MultiHeadAttention layers and a Feed-Forward one, after every layer we apply a
normalization to avoid gradient explosion.

class EncoderLayer(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size,
embed_size):
super().__init__()

#First self attention layer + normalization layer


self.multihead1 =
tf.keras.layers.MultiHeadAttention(num_heads=num_heads_1,
key_dim=LEN_EMBED)
self.layernorm1 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)

#Second self attention layer + normalization layer


self.multihead2=
tf.keras.layers.MultiHeadAttention(num_heads=num_heads_2,
key_dim=LEN_EMBED)
self.layernorm2 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)

#Feed-forward dense layer + normalization layer


self.ffn = tf.keras.Sequential([tf.keras.layers.Dense(ff_size,
activation="leaky_relu"), tf.keras.layers.Dense(LEN_EMBED),])
self.layernorm3 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)

#Dropout layer
self.dropout = tf.keras.layers.Dropout(rate=DROPOUT)

def call(self, inputs):

#Self Attention Section 1


attn_output_1 = self.multihead1(inputs, inputs)
out_1 = self.layernorm1(inputs + attn_output_1)

#Self Attention Section 2


attn_output_2 = self.multihead2(out_1, out_1)
out_2 = self.layernorm2(out_1 + attn_output_2)

#Feed Forward Section


ffn_output = self.ffn(out_2)
ffn_output = self.dropout(ffn_output)
out_3 = self.layernorm3(out_2 + ffn_output)

return out_3

class Encoder(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size, embed_size):
super().__init__()
self.token_embedding =
tf.keras.layers.Embedding(input_dim=LEN_VOC, output_dim=LEN_EMBED,
mask_zero=True)
self.pos_embedding = tf.keras.layers.Embedding(input_dim=LEN_SENT,
output_dim=LEN_EMBED)
self.encoder_1 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.encoder_2 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.encoder_3 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.encoder_4 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.encoder_5 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)

def call(self, inputs):

x=inputs
#Token and Position Encoding Section
maxlen = tf.shape(x)[-1]
positions = tf.keras.ops.arange(start=0, stop=maxlen, step=1)
positions = self.pos_embedding(positions)
x = self.token_embedding(x) #shape (batch_size, len_sen,
len_emb)

out_1 = x + positions #positional encoding in the beginning


not taken into account

out_2 = self.encoder_1(out_1)
out_3 = self.encoder_2(out_2)
out_4 = self.encoder_3(out_3)
out_5 = self.encoder_4(out_4)
out_6 = self.encoder_5(out_5)
return out_6

Decoder Layer
The Decoder block is similarly structured, in fact we have the embedding followed by a series of
DecoderLayers as in the Encoder described above. In addition to the two SelfAttention layers we
have also a CausalAttention layer here, considerin in this way only the tokens already
predicted/observed. In this block, as in the previous one, defining the layers we considered
normalizing to avoid gradient explosion.

class DecoderLayer(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size, embed_size):
super().__init__()

self.multihead_mask =
tf.keras.layers.MultiHeadAttention(num_heads=num_heads_1,
key_dim=LEN_EMBED)
self.layernorm1 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)

self.multihead1 =
tf.keras.layers.MultiHeadAttention(num_heads=num_heads_1,
key_dim=LEN_EMBED)
self.layernorm2 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)

self.multihead2 =
tf.keras.layers.MultiHeadAttention(num_heads=num_heads_2,
key_dim=LEN_EMBED)
self.layernorm3 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)

self.ffn = tf.keras.Sequential([tf.keras.layers.Dense(ff_size,
activation="leaky_relu"), tf.keras.layers.Dense(LEN_EMBED),])
self.layernorm4 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)

self.dropout = tf.keras.layers.Dropout(rate=DROPOUT)

def call(self, encoder_out, decoder_inp_embed):

#Causal Attention Section


causal_output = self.multihead_mask(decoder_inp_embed,
decoder_inp_embed, use_causal_mask=True)
out_1 = self.layernorm1(decoder_inp_embed + causal_output)

#Self Attention Section 1


attn_output_1 = self.multihead1(out_1, encoder_out)
out_2= self.layernorm2(out_1 + attn_output_1)

#Self Attention Section 2


attn_output_2 = self.multihead2(out_2, encoder_out)
out_3 = self.layernorm3(out_2 + attn_output_2)

#Feed Forward Section


ffn_output = self.ffn(out_3)
ffn_output = self.dropout(ffn_output)
out_4 = self.layernorm4(out_3 + ffn_output)

return out_4

class Decoder(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size,embed_size):
super().__init__()
self.token_embedding =
tf.keras.layers.Embedding(input_dim=LEN_VOC, output_dim=LEN_EMBED,
mask_zero=True)
self.pos_embedding =
tf.keras.layers.Embedding(input_dim=LEN_TARGET_SENT,
output_dim=LEN_EMBED)
self.decoder_1 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.decoder_2 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.decoder_3 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.decoder_4 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.decoder_5 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.outlayer = tf.keras.layers.Dense(LEN_VOC,
activation='softmax')

def call(self, encoder_out, decoder_inp):

x=decoder_inp
#Token and Position Encoding Section
maxlen = tf.shape(x)[-1]
positions = tf.keras.ops.arange(start=0, stop=maxlen, step=1)
positions = self.pos_embedding(positions)
x = self.token_embedding(x)
out_1 = x + positions

out_2 = self.decoder_1(encoder_out, out_1)


out_3 = self.decoder_2(encoder_out, out_2)
out_4 = self.decoder_3(encoder_out, out_3)
out_5 = self.decoder_4(encoder_out, out_4)
out_6 = self.decoder_5(encoder_out, out_5)

return self.outlayer(out_6)

Transformer
We merge together the Encoder and the Decoder block. We also override the predict method of
the class Model to predict the ordered sentences.

class Transformer(tf.keras.Model):
def __init__(self, num_heads_1, num_heads_2, ff_size, embed_size):
super().__init__()
self.encoder = Encoder(num_heads_1, num_heads_2, ff_size,
embed_size)

self.decoder = Decoder(num_heads_1, num_heads_2, ff_size,


embed_size)
def generate_initial_decoder_input(self, batch_size):
start_token = tf.constant([3], dtype=tf.int32) # Assuming 3
is the start token
return tf.tile(tf.expand_dims(start_token, 0), [batch_size,
1])

def call(self, encoder_inp, training):

encoder_input, decoder_inp = encoder_inp


encoder_out = self.encoder(encoder_input)

decoder_out = self.decoder(encoder_out, decoder_inp)

return decoder_out

def predict(self, x, *args, **kwargs):


encoder_input, decoder_inputs = x

max_length = 28

batch_size = encoder_input.shape[0]
output_array = tf.TensorArray(dtype=tf.int64, size=0,
dynamic_size=True)

start = np.array(tokenizer.word_index[''], ndmin=1)


output_array = output_array.write(0, tf.tile(start,
[batch_size]))

for i in tf.range(max_length-1):
output = tf.transpose(output_array.stack())
predictions = self([encoder_input, output],
training=False)

# Select the last token from the seq_len dimension.


predictions = predictions[:, -1:, :] # Shape (batch_size,
1, vocab_size).

predicted_id = tf.argmax(predictions, axis=-1)

# Concatenate the predicted_id to the output which is


given to the
# decoder as its input.
output_array = output_array.write(i+1, predicted_id[:, 0])

end_mask = tf.reduce_any(tf.equal(predicted_id,
tokenizer.word_index['']), axis=-1)
if tf.reduce_all(end_mask):
break

output = tf.transpose(output_array.stack())
self([encoder_input, output[:,:-1]], training=False)

return output

We instantiate the model (the number of trainable parameters is really low in order to stay
below the maximum parameters limit set at 20M)

training=False
inputs = tf.keras.Input(shape=(LEN_SENT,))
target = tf.keras.Input(shape=(LEN_TARGET_SENT,))
outputs = Transformer(HEADS_1,HEADS_2, LEN_FF, LEN_EMBED)
(encoder_inp=[inputs, target], training=training)
model = tf.keras.Model(inputs=[inputs, target], outputs=outputs)
model.summary()

Model: "functional_23"

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━
━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to

┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━
━━━━━┩
│ input_layer_12 │ (None, 28) │ 0 │ -

│ (InputLayer) │ │ │

├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ input_layer_13 │ (None, 27) │ 0 │ -

│ (InputLayer) │ │ │

├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ transformer_1 │ (None, 27, 10000) │ 4,756,880 │
input_layer_12[0… │
│ (Transformer) │ │ │
input_layer_13[0… │
└─────────────────────┴───────────────────┴────────────┴──────────────
─────┘

Total params: 4,756,880 (18.15 MB)

Trainable params: 4,756,880 (18.15 MB)


Non-trainable params: 0 (0.00 B)

Training
Now, after creating the generators for the training, the validation and the testing (all mantaining
the proportions between training and testing) we procede with the training of the model.

train_generator = DataGenerator(original_data[:210000], batch_size=32)


validation_generator = DataGenerator(original_data[210000:220000],
batch_size=32)
test_generator = DataGenerator(original_data[220000:], batch_size=32)

opt = tf.keras.optimizers.AdamW(0.00005,
gradient_accumulation_steps=4)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(train_generator, batch_size=32, epochs=10,
validation_data=validation_generator)

model.summary()

Epoch 1/10

W0000 00:00:1718178213.958220 125 assert_op.cc:38] Ignoring Assert


operator
compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropy
WithLogits/assert_equal_1/Assert/Assert

6562/6562 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - accuracy: 0.4718 - loss:


6.3240

W0000 00:00:1718178494.752006 126 assert_op.cc:38] Ignoring Assert


operator
compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropy
WithLogits/assert_equal_1/Assert/Assert

6562/6562 ━━━━━━━━━━━━━━━━━━━━ 352s 37ms/step - accuracy: 0.4718 -


loss: 6.3237 - val_accuracy: 0.6671 - val_loss: 2.7680
Epoch 2/10
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 236s 36ms/step - accuracy: 0.6872 -
loss: 2.4084 - val_accuracy: 0.7609 - val_loss: 1.7253
Epoch 3/10
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 235s 36ms/step - accuracy: 0.7876 -
loss: 1.5353 - val_accuracy: 0.8444 - val_loss: 1.2168
Epoch 4/10
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 238s 36ms/step - accuracy: 0.8699 -
loss: 1.0521 - val_accuracy: 0.8996 - val_loss: 0.8641
Epoch 5/10
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 237s 36ms/step - accuracy: 0.9210 -
loss: 0.7203 - val_accuracy: 0.9359 - val_loss: 0.6172
Epoch 6/10
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 235s 36ms/step - accuracy: 0.9524 -
loss: 0.4955 - val_accuracy: 0.9590 - val_loss: 0.4497
Epoch 7/10
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 235s 36ms/step - accuracy: 0.9717 -
loss: 0.3472 - val_accuracy: 0.9733 - val_loss: 0.3293
Epoch 8/10
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 237s 36ms/step - accuracy: 0.9837 -
loss: 0.2410 - val_accuracy: 0.9829 - val_loss: 0.2404
Epoch 9/10
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 235s 36ms/step - accuracy: 0.9913 -
loss: 0.1662 - val_accuracy: 0.9878 - val_loss: 0.1776
Epoch 10/10
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 235s 36ms/step - accuracy: 0.9956 -
loss: 0.1142 - val_accuracy: 0.9920 - val_loss: 0.1313

Model: "functional_23"

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━
━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to

┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━
━━━━━┩
│ input_layer_12 │ (None, 28) │ 0 │ -

│ (InputLayer) │ │ │

├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ input_layer_13 │ (None, 27) │ 0 │ -

│ (InputLayer) │ │ │

├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ transformer_1 │ (None, 27, 10000) │ 4,756,880 │
input_layer_12[0… │
│ (Transformer) │ │ │
input_layer_13[0… │
└─────────────────────┴───────────────────┴────────────┴──────────────
─────┘

Total params: 19,027,522 (72.58 MB)

Trainable params: 4,756,880 (18.15 MB)

Non-trainable params: 0 (0.00 B)

Optimizer params: 14,270,642 (54.44 MB)


As we can see from the summary the total number of parameters is slightly less than the 20M
parameters constraint.

Testing
#function to determine the average score
def calc_score(num_batches, generator, myModel, detokenizier,
score_function):
list_scores = []
for k in range(num_batches):
x, y = generator.__getitem__(k)
predictions = myModel.predict(x, batch_size=32, verbose=False)
best = [[np.argmax(predictions[t][:][r]) for r in
range(len(predictions[t]))] for t in range(len(predictions))]
for i in range(len(x)):
list_scores.append(score(detokenizer(y)[i], detokenizer(best)
[i]))

return np.average(list_scores), np.std(list_scores), list_scores

#Testing the model


batches = round((len(original_data)-220000)/32)
score_value, std, scores = calc_score(batches, test_generator, model,
detokenizer, score)
print(f"Std is: {std}")
print(f"Average Score is: {score_value}")

I0000 00:00:1718180725.228800 5877 asm_compiler.cc:369] ptxas


warning : Registers are spilled to local memory in function
'triton_gemm_dot_149', 108 bytes spill stores, 108 bytes spill loads

I0000 00:00:1718180725.877044 5874 asm_compiler.cc:369] ptxas


warning : Registers are spilled to local memory in function
'triton_gemm_dot_147', 108 bytes spill stores, 108 bytes spill loads

Std is: 0.10079521260642532


Average Score is: 0.9667937337147346

We save the weights at the end of the run in order to have them disposable at every time

model.save_weights("model_weights.weights.h5")

Discussion about other possible models or configurations


During the project it has been considered using other models like LSMT Transformers, also
pretty effective in this kind of situations according to the literature, but the training time was
way higher than this, so it has been chosen the Transformer architecture. It has also being tested
to use different number of heads for different layers, but that caused a mismatch between the
two attention layers that obviously affected the performance of the model.

You might also like