64951d72719663a3be05cf248b57e0026c9c03ec050754250f6aec81c6efec27

Sentence Reconstruction
The purpose of this project is to take in input a sequence of words corresponding to a random
permutation of a given english sentence, and reconstruct the original sentence.
The otuput can be either produced in a single shot, or through an iterative (autoregressive) loop
generating a single token at a time.
CONSTRAINTS:
• No pretrained model can be used.

• The neural network models should have less the 20M parameters.
• No postprocessing should be done (e.g. no beamsearch)
• You cannot use additional training data.
BONUS PARAMETERS:
A bonus of 0-2 points will be attributed to incentivate the adoption of models with a low number
of parameters.
Dataset
The dataset is composed by sentences taken from the generics_kb dataset of hugging face. We
restricted the vocabolary to the 10K most frequent words, and only took sentences making use
of this vocabulary.
!pip install datasets
Collecting datasets
Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 542.1/542.1 kB 7.5 MB/s eta
0:00:00
ent already satisfied: filelock in /usr/local/lib/python3.10/dist-
packages (from datasets) (3.14.0)
Requirement already satisfied: numpy>=1.17 in
/usr/local/lib/python3.10/dist-packages (from datasets) (1.25.2)
Requirement already satisfied: pyarrow>=12.0.0 in
Requirement already satisfied: pyarrow-hotfix in
/usr/local/lib/python3.10/dist-packages (from datasets) (0.6)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 6.5 MB/s eta
0:00:00
ent already satisfied: pandas in /usr/local/lib/python3.10/dist-
Collecting requests>=2.32.1 (from datasets)
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.9/64.9 kB 5.0 MB/s eta
0:00:00
ent already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-
Collecting xxhash (from datasets)
Downloading xxhash-3.4.1-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 7.3 MB/s eta
0:00:00
ultiprocess (from datasets)
Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 8.1 MB/s eta
0:00:00
ent already satisfied: fsspec[http]<=2024.3.1,>=2023.1.0 in
Requirement already satisfied: aiohttp in
Requirement already satisfied: huggingface-hub>=0.21.2 in
Requirement already satisfied: packaging in
/usr/local/lib/python3.10/dist-packages (from datasets) (24.1)
Requirement already satisfied: pyyaml>=5.1 in
Requirement already satisfied: aiosignal>=1.1.2 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(1.3.1)
Requirement already satisfied: attrs>=17.3.0 in
(23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in
(1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in
(6.0.5)
Requirement already satisfied: yarl<2.0,>=1.0 in
(1.9.4)
Requirement already satisfied: async-timeout<5.0,>=4.0 in
(4.0.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in
/usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.2-
>datasets) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.1-
>datasets) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in
>datasets) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in
>datasets) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in
>datasets) (2024.6.2)
Requirement already satisfied: python-dateutil>=2.8.2 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2.8.2)
Requirement already satisfied: pytz>=2020.1 in
(2023.4)
Requirement already satisfied: tzdata>=2022.1 in
(2024.1)
Requirement already satisfied: six>=1.5 in
/usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2-
>pandas->datasets) (1.16.0)
Installing collected packages: xxhash, requests, dill, multiprocess,
datasets
Attempting uninstall: requests
Found existing installation: requests 2.31.0
Uninstalling requests-2.31.0:
Successfully uninstalled requests-2.31.0
ERROR: pip's dependency resolver does not currently take into account
all the packages that are installed. This behaviour is the source of
the following dependency conflicts.
google-colab 1.0.0 requires requests==2.31.0, but you have requests
2.32.3 which is incompatible.
Successfully installed datasets-2.19.2 dill-0.3.8 multiprocess-0.70.16
requests-2.32.3 xxhash-3.4.1
!pip install --upgrade tensorflow

!pip install --upgrade keras
Requirement already satisfied: tensorflow in

/opt/conda/lib/python3.10/site-packages (2.15.0)
Collecting tensorflow
Downloading tensorflow-2.16.1-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Requirement already satisfied: absl-py>=1.0.0 in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (1.4.0)
Requirement already satisfied: astunparse>=1.6.0 in
Requirement already satisfied: flatbuffers>=23.5.26 in
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1
in /opt/conda/lib/python3.10/site-packages (from tensorflow) (0.5.4)
Requirement already satisfied: google-pasta>=0.1.1 in
Requirement already satisfied: h5py>=3.10.0 in
Requirement already satisfied: libclang>=13.0.0 in
Collecting ml-dtypes~=0.3.1 (from tensorflow)
Downloading ml_dtypes-0.3.2-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Requirement already satisfied: opt-einsum>=2.3.2 in
Requirement already satisfied: packaging in
/opt/conda/lib/python3.10/site-packages (from tensorflow) (21.3)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!
=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in
Requirement already satisfied: requests<3,>=2.21.0 in
Requirement already satisfied: setuptools in
Requirement already satisfied: six>=1.12.0 in
Requirement already satisfied: termcolor>=1.1.0 in
Requirement already satisfied: typing-extensions>=3.6.6 in
Requirement already satisfied: wrapt>=1.11.0 in
Requirement already satisfied: grpcio<2.0,>=1.24.3 in
Collecting tensorboard<2.17,>=2.16 (from tensorflow)
Downloading tensorboard-2.16.2-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: keras>=3.0.0 in
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in
Requirement already satisfied: numpy<2.0.0,>=1.23.5 in
Requirement already satisfied: wheel<1.0,>=0.23.0 in
/opt/conda/lib/python3.10/site-packages (from astunparse>=1.6.0-
>tensorflow) (0.42.0)
Requirement already satisfied: rich in /opt/conda/lib/python3.10/site-
packages (from keras>=3.0.0->tensorflow) (13.7.0)
Requirement already satisfied: namex in
/opt/conda/lib/python3.10/site-packages (from keras>=3.0.0-
Requirement already satisfied: optree in
/opt/conda/lib/python3.10/site-packages (from keras>=3.0.0-
Requirement already satisfied: charset-normalizer<4,>=2 in
/opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0-
Requirement already satisfied: idna<4,>=2.5 in
>tensorflow) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in
Requirement already satisfied: certifi>=2017.4.17 in
Requirement already satisfied: markdown>=2.6.8 in
/opt/conda/lib/python3.10/site-packages (from tensorboard<2.17,>=2.16-
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0
in /opt/conda/lib/python3.10/site-packages (from
tensorboard<2.17,>=2.16->tensorflow) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in
/opt/conda/lib/python3.10/site-packages (from tensorboard<2.17,>=2.16-
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in
/opt/conda/lib/python3.10/site-packages (from packaging->tensorflow)
(3.1.1)
Requirement already satisfied: MarkupSafe>=2.1.1 in
/opt/conda/lib/python3.10/site-packages (from werkzeug>=1.0.1-
>tensorboard<2.17,>=2.16->tensorflow) (2.1.3)
Requirement already satisfied: markdown-it-py>=2.2.0 in
/opt/conda/lib/python3.10/site-packages (from rich->keras>=3.0.0-
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in
/opt/conda/lib/python3.10/site-packages (from rich->keras>=3.0.0-
Requirement already satisfied: mdurl~=0.1 in
/opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0-
>rich->keras>=3.0.0->tensorflow) (0.1.2)
Downloading tensorflow-2.16.1-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (589.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 589.8/589.8 MB 2.1 MB/s eta
0:00:00:00:0100:01
l_dtypes-0.3.2-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 3.6 MB/s eta
0:00:000:00:01
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 11.8 MB/s eta
0:00:0000:0100:01
l-dtypes, tensorboard, tensorflow
Attempting uninstall: ml-dtypes
Found existing installation: ml-dtypes 0.2.0
Uninstalling ml-dtypes-0.2.0:
Successfully uninstalled ml-dtypes-0.2.0
Attempting uninstall: tensorboard
Found existing installation: tensorboard 2.15.1
Uninstalling tensorboard-2.15.1:
Successfully uninstalled tensorboard-2.15.1
Attempting uninstall: tensorflow
Found existing installation: tensorflow 2.15.0
Uninstalling tensorflow-2.15.0:
Successfully uninstalled tensorflow-2.15.0
ERROR: pip's dependency resolver does not currently take into account
all the packages that are installed. This behaviour is the source of
the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not
installed.
tensorflow-decision-forests 1.8.1 requires tensorflow~=2.15.0, but you
have tensorflow 2.16.1 which is incompatible.
tensorflow-text 2.15.0 requires tensorflow<2.16,>=2.15.0;
platform_machine != "arm64" or platform_system != "Darwin", but you
have tensorflow 2.16.1 which is incompatible.
tf-keras 2.15.1 requires tensorflow<2.16,>=2.15, but you have
tensorflow 2.16.1 which is incompatible.
Successfully installed ml-dtypes-0.3.2 tensorboard-2.16.2 tensorflow-
2.16.1
Requirement already satisfied: keras in
/opt/conda/lib/python3.10/site-packages (3.3.3)
Requirement already satisfied: absl-py in
/opt/conda/lib/python3.10/site-packages (from keras) (1.4.0)
Requirement already satisfied: numpy in
Requirement already satisfied: rich in /opt/conda/lib/python3.10/site-
packages (from keras) (13.7.0)
Requirement already satisfied: namex in
Requirement already satisfied: h5py in /opt/conda/lib/python3.10/site-
packages (from keras) (3.10.0)
Requirement already satisfied: optree in
Requirement already satisfied: ml-dtypes in
Requirement already satisfied: typing-extensions>=4.0.0 in
/opt/conda/lib/python3.10/site-packages (from optree->keras) (4.9.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in
/opt/conda/lib/python3.10/site-packages (from rich->keras) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in
/opt/conda/lib/python3.10/site-packages (from rich->keras) (2.17.2)
Requirement already satisfied: mdurl~=0.1 in
/opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0-
>rich->keras) (0.1.2)
Download the dataset
from datasets import load_dataset

from keras.layers import TextVectorization
import tensorflow as tf
import numpy as np
np.random.seed(42)
ds = load_dataset('generics_kb',trust_remote_code=True)['train']
{"model_id":"32f594dad7bb462483aa9caa0096d7c1","version_major":2,"vers
ion_minor":0}
{"model_id":"2bbb85d1a022486d816a2f45dfa10600","version_major":2,"vers
ion_minor":0}
{"model_id":"2342d6b9aba94dbc8303bc3d63c12d2f","version_major":2,"vers
ion_minor":0}
{"model_id":"fcff0647b7384af0bd83efde31bc248f","version_major":2,"vers
ion_minor":0}
Filter row with length greater than 8.
ds = ds.filter(lambda row: len(row["generic_sentence"].split(" ")) > 8

)
corpus = [ '<start> ' + row['generic_sentence'].replace(",","
<comma>") + ' <end>' for row in ds ]
corpus = np.array(corpus)
{"model_id":"36c5d4de6a8f4f97836e832102855de6","version_major":2,"vers
ion_minor":0}
Create a tokenizer and Detokenizer
tokenizer=TextVectorization( max_tokens=10000,
standardize="lower_and_strip_punctuation", encoding="utf-8",) #con il
max prende le piu frequenti. ordina i token del vocab dal piu
frequente al meno frequente
tokenizer.adapt(corpus)
class TextDetokenizer:
def __init__(self, vectorize_layer):
self.vectorize_layer = vectorize_layer
vocab = self.vectorize_layer.get_vocabulary()
self.index_to_word = {index: word for index, word in
enumerate(vocab)}
def __detokenize_tokens(self, tokens):
def check_token(t):
if t == 3:
s="<start>"
elif t == 2:
s="<end>"
elif t == 7:
s="<comma>"
else:
s=self.index_to_word.get(t, '[UNK]')
return s
return ' '.join([ check_token(token) for token in tokens if

token != 0])
def __call__(self, batch_tokens):

return [self.__detokenize_tokens(tokens) for tokens in
batch_tokens]
detokenizer = TextDetokenizer( tokenizer )

sentences = tokenizer( corpus ).numpy()
Remove from corpus the sentences where any unknow word appears
mask = np.sum( (sentences==1), axis=1) >= 1

original_data = np.delete( sentences, mask , axis=0)
original_data.shape
(241236, 28)
Shuffle the sentences
from tensorflow.keras.utils import PyDataset
class DataGenerator(PyDataset):
def __init__(self, data, batch_size=32, shuffle=True, seed=42):
self.data = data
self.batch_size = batch_size
self.shuffle = shuffle
self.seed = seed
self.on_epoch_end()
def __len__(self):
return int(np.floor(len(self.data) / self.batch_size))
def __getitem__(self, index):

indexes = self.indexes[index*self.batch_size:
(index+1)*self.batch_size]
data_batch = np.array([self.data[k] for k in indexes])

#copy of ordered sequences
result = np.copy(data_batch)
#shuffle only the relevant positions for each batch
for i in range(data_batch.shape[0]):
np.random.shuffle(data_batch[i,1:data_batch[i].argmin() -
1])
return (data_batch , np.array([[result[i][j] for j in

range(1,len(result[i]))] for i in range(len(result))] )),
np.array([[result[i][j] for j in range(len(result[i])-1)] for i in
range(len(result))])
def on_epoch_end(self):
self.indexes = np.arange(len(self.data))
if self.shuffle:
if self.seed is not None:
np.random.seed(self.seed)
np.random.shuffle(self.indexes)
# Make a random permutation of training and test set

np.random.seed(42)
# Shuffle the all data
shuffled_indices = np.random.permutation(len(original_data))
shuffled_data = original_data[shuffled_indices]
#split the dataset

train_generator = DataGenerator(shuffled_data[:220000])
test_generator = DataGenerator(shuffled_data[220000:])
(x, y), z = test_generator.__getitem__(1)

x = detokenizer(x)
y = detokenizer(y)
z = detokenizer(z)
for i in range(7):
print("shuffled: ", x[i])
print("original shifted: ", y[i])
print("original: ", z[i])
print("\n")
shuffled: <start> large their areas for cattle ranchers rainforest

clear pastures become to of <end>
original shifted: ranchers clear large areas of rainforest to become
pastures for their cattle <end>
original: <start> ranchers clear large areas of rainforest to become
pastures for their cattle <end>
shuffled: <start> stripes thorax some and the earwigs on abdomen have
<end>
original shifted: some earwigs have stripes on the thorax and abdomen
<end>
original: <start> some earwigs have stripes on the thorax and abdomen
<end>
shuffled: <start> into in magnetic such a liquid molecules can

manipulation computing turn devices <end>
original shifted: magnetic manipulation can turn molecules in a
liquid into computing such devices <end>
original: <start> magnetic manipulation can turn molecules in a
liquid into computing such devices <end>
shuffled: <start> reduced wetlands and recreation for water places

healthy cleaner flooding <comma> means more <end>
original shifted: healthy wetlands means cleaner water <comma>
reduced flooding and more places for recreation <end>
original: <start> healthy wetlands means cleaner water <comma>
reduced flooding and more places for recreation <end>
shuffled: <start> company percent share one controls a sales in

market is share the particular in market <end>
original shifted: market share is the percent share in sales one
company controls in a particular market <end>
original: <start> market share is the percent share in sales one
company controls in a particular market <end>
shuffled: <start> of on animal only a the small flies time amount

spend face <end>
original shifted: face flies spend only a small amount of time on the
animal <end>
original: <start> face flies spend only a small amount of time on the
animal <end>
shuffled: <start> extremely management in of foods are prevention and

cancer important organic <end>
original shifted: organic foods are extremely important in prevention
and management of cancer <end>
original: <start> organic foods are extremely important in prevention
and management of cancer <end>
Metrics
Let s be the source string and p your prediction. The quality of the results will be measured
according to the following metric:
1. look for the longest substring w between s and p

2. compute |w|/max(|s|,|p|)
If the match is exact, the score is 1.
When computing the score, you should NOT consider the start and end tokens.
The longest common substring can be computed with the SequenceMatcher function of difflib,
that allows a simple definition of our metric.
from difflib import SequenceMatcher
def score(s,p):
match = SequenceMatcher(None, s, p).find_longest_match()
#print(match.size)
return (match.size/max(len(p),len(s)))
Let's do an example.
original = "at first henry wanted to be friends with the king of

france"
generated = "henry wanted to be friends with king of france at the
first"
print("your score is ",score(original,generated))
your score is 0.5423728813559322
The score must be computed as an average of at least 3K random examples taken form the test
set.
What to deliver
You are supposed to deliver a single notebook, suitably commented. The notebook should
describe a single model, although you may briefly discuss additional attempts you did.
The notebook should contain a full trace of the training. Weights should be made available on
request.
You must also give a clear assesment of the performance of the model, computed with the
metric that has been given to you.
Good work!
For this task it has been decided to implement a Transformer, for this reason we will define a
Decoder and Encoder Block working with multiple attention heads in an attention based
Transformer.
Model Definition
Useful parameters:
1. LEN_VOC: Length of the vocabulary considered

2. LEN_SENT: Maximum length of the input sentence
3. LEN_TARGET_SENT: Maximum length of the input sentence
4. LEN_EMBED: Dimension of the embedding space
5. HEADS_1: Number of heads for the first Attention layer
6. HEADS_2: Number of heads for the second Attention layer
7. LEN_FF: Dimension of the Feed-Forward layer
8. DROPOUT: Dropout rate
LEN_VOC = 10000
LEN_SENT = 28
LEN_TARGET_SENT= 27
LEN_EMBED = 64
HEADS_1 = 6
HEADS_2 = 6
LEN_FF = 256
DROPOUT=0.01
Encoder Layer
The Encoder block is constituted by the Embedding layer (Token and Positional encoding the
token itslef and the position it occupies in the sentence) and several EncoderLayers (stacking
more layers is considered to offer a more accurate solution). The EncoderLayer is composed by
two MultiHeadAttention layers and a Feed-Forward one, after every layer we apply a
normalization to avoid gradient explosion.
class EncoderLayer(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size,
embed_size):
super().__init__()
#First self attention layer + normalization layer

self.multihead1 =
tf.keras.layers.MultiHeadAttention(num_heads=num_heads_1,
key_dim=LEN_EMBED)
self.layernorm1 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)
#Second self attention layer + normalization layer

self.multihead2=
key_dim=LEN_EMBED)
self.layernorm2 =
#Feed-forward dense layer + normalization layer

self.ffn = tf.keras.Sequential([tf.keras.layers.Dense(ff_size,
activation="leaky_relu"), tf.keras.layers.Dense(LEN_EMBED),])
self.layernorm3 =
#Dropout layer
self.dropout = tf.keras.layers.Dropout(rate=DROPOUT)
def call(self, inputs):
#Self Attention Section 1

attn_output_1 = self.multihead1(inputs, inputs)
out_1 = self.layernorm1(inputs + attn_output_1)

attn_output_2 = self.multihead2(out_1, out_1)
out_2 = self.layernorm2(out_1 + attn_output_2)
#Feed Forward Section

ffn_output = self.ffn(out_2)
ffn_output = self.dropout(ffn_output)
out_3 = self.layernorm3(out_2 + ffn_output)
return out_3
class Encoder(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size, embed_size):
super().__init__()
self.token_embedding =
tf.keras.layers.Embedding(input_dim=LEN_VOC, output_dim=LEN_EMBED,
mask_zero=True)
self.pos_embedding = tf.keras.layers.Embedding(input_dim=LEN_SENT,
output_dim=LEN_EMBED)
self.encoder_1 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
embed_size)
embed_size)
embed_size)
embed_size)
def call(self, inputs):
x=inputs
#Token and Position Encoding Section
maxlen = tf.shape(x)[-1]
positions = tf.keras.ops.arange(start=0, stop=maxlen, step=1)
positions = self.pos_embedding(positions)
x = self.token_embedding(x) #shape (batch_size, len_sen,
len_emb)
out_1 = x + positions #positional encoding in the beginning

not taken into account
out_2 = self.encoder_1(out_1)
return out_6
Decoder Layer
The Decoder block is similarly structured, in fact we have the embedding followed by a series of
DecoderLayers as in the Encoder described above. In addition to the two SelfAttention layers we
have also a CausalAttention layer here, considerin in this way only the tokens already
predicted/observed. In this block, as in the previous one, defining the layers we considered
normalizing to avoid gradient explosion.
class DecoderLayer(tf.keras.layers.Layer):
super().__init__()
self.multihead_mask =
key_dim=LEN_EMBED)
self.layernorm1 =
self.multihead1 =
key_dim=LEN_EMBED)
self.layernorm2 =
self.multihead2 =
key_dim=LEN_EMBED)
self.layernorm3 =
self.ffn = tf.keras.Sequential([tf.keras.layers.Dense(ff_size,
activation="leaky_relu"), tf.keras.layers.Dense(LEN_EMBED),])
self.layernorm4 =
self.dropout = tf.keras.layers.Dropout(rate=DROPOUT)
def call(self, encoder_out, decoder_inp_embed):
#Causal Attention Section

causal_output = self.multihead_mask(decoder_inp_embed,
decoder_inp_embed, use_causal_mask=True)
out_1 = self.layernorm1(decoder_inp_embed + causal_output)

attn_output_1 = self.multihead1(out_1, encoder_out)
out_2= self.layernorm2(out_1 + attn_output_1)

attn_output_2 = self.multihead2(out_2, encoder_out)
out_3 = self.layernorm3(out_2 + attn_output_2)
#Feed Forward Section

ffn_output = self.ffn(out_3)
ffn_output = self.dropout(ffn_output)
out_4 = self.layernorm4(out_3 + ffn_output)
return out_4
class Decoder(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size,embed_size):
super().__init__()
self.token_embedding =
tf.keras.layers.Embedding(input_dim=LEN_VOC, output_dim=LEN_EMBED,
mask_zero=True)
self.pos_embedding =
tf.keras.layers.Embedding(input_dim=LEN_TARGET_SENT,
output_dim=LEN_EMBED)
self.decoder_1 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
embed_size)
embed_size)
embed_size)
embed_size)
self.outlayer = tf.keras.layers.Dense(LEN_VOC,
activation='softmax')
def call(self, encoder_out, decoder_inp):
x=decoder_inp
#Token and Position Encoding Section
maxlen = tf.shape(x)[-1]
positions = tf.keras.ops.arange(start=0, stop=maxlen, step=1)
positions = self.pos_embedding(positions)
x = self.token_embedding(x)
out_1 = x + positions
out_2 = self.decoder_1(encoder_out, out_1)

return self.outlayer(out_6)
Transformer
We merge together the Encoder and the Decoder block. We also override the predict method of
the class Model to predict the ordered sentences.
class Transformer(tf.keras.Model):
super().__init__()
self.encoder = Encoder(num_heads_1, num_heads_2, ff_size,
embed_size)
self.decoder = Decoder(num_heads_1, num_heads_2, ff_size,

embed_size)
def generate_initial_decoder_input(self, batch_size):
start_token = tf.constant([3], dtype=tf.int32) # Assuming 3
is the start token
return tf.tile(tf.expand_dims(start_token, 0), [batch_size,
1])
def call(self, encoder_inp, training):
encoder_input, decoder_inp = encoder_inp

encoder_out = self.encoder(encoder_input)
decoder_out = self.decoder(encoder_out, decoder_inp)
return decoder_out
def predict(self, x, *args, **kwargs):

encoder_input, decoder_inputs = x
max_length = 28
batch_size = encoder_input.shape[0]
output_array = tf.TensorArray(dtype=tf.int64, size=0,
dynamic_size=True)
start = np.array(tokenizer.word_index[''], ndmin=1)

output_array = output_array.write(0, tf.tile(start,
[batch_size]))
for i in tf.range(max_length-1):
output = tf.transpose(output_array.stack())
predictions = self([encoder_input, output],
training=False)
# Select the last token from the seq_len dimension.

predictions = predictions[:, -1:, :] # Shape (batch_size,
1, vocab_size).
predicted_id = tf.argmax(predictions, axis=-1)
# Concatenate the predicted_id to the output which is

given to the
# decoder as its input.
output_array = output_array.write(i+1, predicted_id[:, 0])
end_mask = tf.reduce_any(tf.equal(predicted_id,
tokenizer.word_index['']), axis=-1)
if tf.reduce_all(end_mask):
break
output = tf.transpose(output_array.stack())
self([encoder_input, output[:,:-1]], training=False)
return output
We instantiate the model (the number of trainable parameters is really low in order to stay
below the maximum parameters limit set at 20M)
training=False
inputs = tf.keras.Input(shape=(LEN_SENT,))
target = tf.keras.Input(shape=(LEN_TARGET_SENT,))
outputs = Transformer(HEADS_1,HEADS_2, LEN_FF, LEN_EMBED)
(encoder_inp=[inputs, target], training=training)
model = tf.keras.Model(inputs=[inputs, target], outputs=outputs)
model.summary()
Model: "functional_23"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━
━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to
┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━
━━━━━┩
│ input_layer_12 │ (None, 28) │ 0 │ -
│
│ (InputLayer) │ │ │
│
├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ input_layer_13 │ (None, 27) │ 0 │ -
│
│
├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ transformer_1 │ (None, 27, 10000) │ 4,756,880 │
input_layer_12[0… │
│ (Transformer) │ │ │
└─────────────────────┴───────────────────┴────────────┴──────────────
─────┘
Total params: 4,756,880 (18.15 MB)
Trainable params: 4,756,880 (18.15 MB)

Non-trainable params: 0 (0.00 B)
Training
Now, after creating the generators for the training, the validation and the testing (all mantaining
the proportions between training and testing) we procede with the training of the model.
train_generator = DataGenerator(original_data[:210000], batch_size=32)

validation_generator = DataGenerator(original_data[210000:220000],
batch_size=32)
test_generator = DataGenerator(original_data[220000:], batch_size=32)
opt = tf.keras.optimizers.AdamW(0.00005,
gradient_accumulation_steps=4)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(train_generator, batch_size=32, epochs=10,
validation_data=validation_generator)
model.summary()
Epoch 1/10
W0000 00:00:1718178213.958220 125 assert_op.cc:38] Ignoring Assert

operator
compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropy
WithLogits/assert_equal_1/Assert/Assert
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - accuracy: 0.4718 - loss:

6.3240
W0000 00:00:1718178494.752006 126 assert_op.cc:38] Ignoring Assert

operator
compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropy
WithLogits/assert_equal_1/Assert/Assert
6562/6562 ━━━━━━━━━━━━━━━━━━━━ 352s 37ms/step - accuracy: 0.4718 -

loss: 6.3237 - val_accuracy: 0.6671 - val_loss: 2.7680
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: "functional_23"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━
━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to
┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━
━━━━━┩
│ input_layer_12 │ (None, 28) │ 0 │ -
│
│
├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ input_layer_13 │ (None, 27) │ 0 │ -
│
│
├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ transformer_1 │ (None, 27, 10000) │ 4,756,880 │
│ (Transformer) │ │ │
└─────────────────────┴───────────────────┴────────────┴──────────────
─────┘
Total params: 19,027,522 (72.58 MB)
Trainable params: 4,756,880 (18.15 MB)
Non-trainable params: 0 (0.00 B)
Optimizer params: 14,270,642 (54.44 MB)

As we can see from the summary the total number of parameters is slightly less than the 20M
parameters constraint.
Testing
#function to determine the average score
def calc_score(num_batches, generator, myModel, detokenizier,
score_function):
list_scores = []
for k in range(num_batches):
x, y = generator.__getitem__(k)
predictions = myModel.predict(x, batch_size=32, verbose=False)
best = [[np.argmax(predictions[t][:][r]) for r in
range(len(predictions[t]))] for t in range(len(predictions))]
for i in range(len(x)):
list_scores.append(score(detokenizer(y)[i], detokenizer(best)
[i]))
return np.average(list_scores), np.std(list_scores), list_scores
#Testing the model

batches = round((len(original_data)-220000)/32)
score_value, std, scores = calc_score(batches, test_generator, model,
detokenizer, score)
print(f"Std is: {std}")
print(f"Average Score is: {score_value}")
I0000 00:00:1718180725.228800 5877 asm_compiler.cc:369] ptxas

warning : Registers are spilled to local memory in function
'triton_gemm_dot_149', 108 bytes spill stores, 108 bytes spill loads
I0000 00:00:1718180725.877044 5874 asm_compiler.cc:369] ptxas

warning : Registers are spilled to local memory in function
'triton_gemm_dot_147', 108 bytes spill stores, 108 bytes spill loads
Std is: 0.10079521260642532

Average Score is: 0.9667937337147346
We save the weights at the end of the run in order to have them disposable at every time
model.save_weights("model_weights.weights.h5")
Discussion about other possible models or configurations

During the project it has been considered using other models like LSMT Transformers, also
pretty effective in this kind of situations according to the literature, but the training time was
way higher than this, so it has been chosen the Transformer architecture. It has also being tested
to use different number of heads for different layers, but that caused a mismatch between the
two attention layers that obviously affected the performance of the model.

64951d72719663a3be05cf248b57e0026c9c03ec050754250f6aec81c6efec27

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

64951d72719663a3be05cf248b57e0026c9c03ec050754250f6aec81c6efec27

Uploaded by

Copyright:

Available Formats

Sentence Reconstruction

• No pretrained model can be used.

!pip install datasets

!pip install --upgrade tensorflow

Requirement already satisfied: tensorflow in

Download the dataset

from datasets import load_dataset

Filter row with length greater than 8.

ds = ds.filter(lambda row: len(row["generic_sentence"].split(" ")) > 8

Create a tokenizer and Detokenizer

return ' '.join([ check_token(token) for token in tokens if

def __call__(self, batch_tokens):

detokenizer = TextDetokenizer( tokenizer )

mask = np.sum( (sentences==1), axis=1) >= 1

Shuffle the sentences

from tensorflow.keras.utils import PyDataset

def __getitem__(self, index):

data_batch = np.array([self.data[k] for k in indexes])

return (data_batch , np.array([[result[i][j] for j in

# Make a random permutation of training and test set

#split the dataset

(x, y), z = test_generator.__getitem__(1)

shuffled: <start> large their areas for cattle ranchers rainforest

shuffled: <start> into in magnetic such a liquid molecules can

shuffled: <start> reduced wetlands and recreation for water places

shuffled: <start> company percent share one controls a sales in

shuffled: <start> of on animal only a the small flies time amount

shuffled: <start> extremely management in of foods are prevention and

1. look for the longest substring w between s and p

If the match is exact, the score is 1.

from difflib import SequenceMatcher

original = "at first henry wanted to be friends with the king of

print("your score is ",score(original,generated))

your score is 0.5423728813559322

1. LEN_VOC: Length of the vocabulary considered

#First self attention layer + normalization layer

#Second self attention layer + normalization layer

#Feed-forward dense layer + normalization layer

def call(self, inputs):

#Self Attention Section 1

#Self Attention Section 2

#Feed Forward Section

def call(self, inputs):

out_1 = x + positions #positional encoding in the beginning

def call(self, encoder_out, decoder_inp_embed):

#Causal Attention Section

#Self Attention Section 1

#Self Attention Section 2

#Feed Forward Section

def call(self, encoder_out, decoder_inp):

out_2 = self.decoder_1(encoder_out, out_1)

self.decoder = Decoder(num_heads_1, num_heads_2, ff_size,

def call(self, encoder_inp, training):

encoder_input, decoder_inp = encoder_inp

decoder_out = self.decoder(encoder_out, decoder_inp)

def predict(self, x, *args, **kwargs):

start = np.array(tokenizer.word_index[''], ndmin=1)

# Select the last token from the seq_len dimension.

predicted_id = tf.argmax(predictions, axis=-1)

# Concatenate the predicted_id to the output which is

Total params: 4,756,880 (18.15 MB)

Trainable params: 4,756,880 (18.15 MB)

train_generator = DataGenerator(original_data[:210000], batch_size=32)

W0000 00:00:1718178213.958220 125 assert_op.cc:38] Ignoring Assert

def call(self, batch_tokens):

def getitem(self, index):

(x, y), z = test_generator.getitem(1)