Professional Documents
Culture Documents
64951d72719663a3be05cf248b57e0026c9c03ec050754250f6aec81c6efec27
64951d72719663a3be05cf248b57e0026c9c03ec050754250f6aec81c6efec27
The purpose of this project is to take in input a sequence of words corresponding to a random
permutation of a given english sentence, and reconstruct the original sentence.
The otuput can be either produced in a single shot, or through an iterative (autoregressive) loop
generating a single token at a time.
CONSTRAINTS:
BONUS PARAMETERS:
A bonus of 0-2 points will be attributed to incentivate the adoption of models with a low number
of parameters.
Dataset
The dataset is composed by sentences taken from the generics_kb dataset of hugging face. We
restricted the vocabolary to the 10K most frequent words, and only took sentences making use
of this vocabulary.
Collecting datasets
Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 542.1/542.1 kB 7.5 MB/s eta
0:00:00
ent already satisfied: filelock in /usr/local/lib/python3.10/dist-
packages (from datasets) (3.14.0)
Requirement already satisfied: numpy>=1.17 in
/usr/local/lib/python3.10/dist-packages (from datasets) (1.25.2)
Requirement already satisfied: pyarrow>=12.0.0 in
/usr/local/lib/python3.10/dist-packages (from datasets) (14.0.2)
Requirement already satisfied: pyarrow-hotfix in
/usr/local/lib/python3.10/dist-packages (from datasets) (0.6)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 6.5 MB/s eta
0:00:00
ent already satisfied: pandas in /usr/local/lib/python3.10/dist-
packages (from datasets) (2.0.3)
Collecting requests>=2.32.1 (from datasets)
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.9/64.9 kB 5.0 MB/s eta
0:00:00
ent already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-
packages (from datasets) (4.66.4)
Collecting xxhash (from datasets)
Downloading xxhash-3.4.1-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 7.3 MB/s eta
0:00:00
ultiprocess (from datasets)
Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 8.1 MB/s eta
0:00:00
ent already satisfied: fsspec[http]<=2024.3.1,>=2023.1.0 in
/usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)
Requirement already satisfied: aiohttp in
/usr/local/lib/python3.10/dist-packages (from datasets) (3.9.5)
Requirement already satisfied: huggingface-hub>=0.21.2 in
/usr/local/lib/python3.10/dist-packages (from datasets) (0.23.3)
Requirement already satisfied: packaging in
/usr/local/lib/python3.10/dist-packages (from datasets) (24.1)
Requirement already satisfied: pyyaml>=5.1 in
/usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)
Requirement already satisfied: aiosignal>=1.1.2 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(1.3.1)
Requirement already satisfied: attrs>=17.3.0 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(6.0.5)
Requirement already satisfied: yarl<2.0,>=1.0 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(1.9.4)
Requirement already satisfied: async-timeout<5.0,>=4.0 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(4.0.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in
/usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.2-
>datasets) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.1-
>datasets) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.1-
>datasets) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.1-
>datasets) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.1-
>datasets) (2024.6.2)
Requirement already satisfied: python-dateutil>=2.8.2 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2.8.2)
Requirement already satisfied: pytz>=2020.1 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2023.4)
Requirement already satisfied: tzdata>=2022.1 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2024.1)
Requirement already satisfied: six>=1.5 in
/usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2-
>pandas->datasets) (1.16.0)
Installing collected packages: xxhash, requests, dill, multiprocess,
datasets
Attempting uninstall: requests
Found existing installation: requests 2.31.0
Uninstalling requests-2.31.0:
Successfully uninstalled requests-2.31.0
ERROR: pip's dependency resolver does not currently take into account
all the packages that are installed. This behaviour is the source of
the following dependency conflicts.
google-colab 1.0.0 requires requests==2.31.0, but you have requests
2.32.3 which is incompatible.
Successfully installed datasets-2.19.2 dill-0.3.8 multiprocess-0.70.16
requests-2.32.3 xxhash-3.4.1
{"model_id":"32f594dad7bb462483aa9caa0096d7c1","version_major":2,"vers
ion_minor":0}
{"model_id":"2bbb85d1a022486d816a2f45dfa10600","version_major":2,"vers
ion_minor":0}
{"model_id":"2342d6b9aba94dbc8303bc3d63c12d2f","version_major":2,"vers
ion_minor":0}
{"model_id":"fcff0647b7384af0bd83efde31bc248f","version_major":2,"vers
ion_minor":0}
{"model_id":"36c5d4de6a8f4f97836e832102855de6","version_major":2,"vers
ion_minor":0}
tokenizer=TextVectorization( max_tokens=10000,
standardize="lower_and_strip_punctuation", encoding="utf-8",) #con il
max prende le piu frequenti. ordina i token del vocab dal piu
frequente al meno frequente
tokenizer.adapt(corpus)
class TextDetokenizer:
def __init__(self, vectorize_layer):
self.vectorize_layer = vectorize_layer
vocab = self.vectorize_layer.get_vocabulary()
self.index_to_word = {index: word for index, word in
enumerate(vocab)}
def __detokenize_tokens(self, tokens):
def check_token(t):
if t == 3:
s="<start>"
elif t == 2:
s="<end>"
elif t == 7:
s="<comma>"
else:
s=self.index_to_word.get(t, '[UNK]')
return s
Remove from corpus the sentences where any unknow word appears
original_data.shape
(241236, 28)
class DataGenerator(PyDataset):
def __init__(self, data, batch_size=32, shuffle=True, seed=42):
self.data = data
self.batch_size = batch_size
self.shuffle = shuffle
self.seed = seed
self.on_epoch_end()
def __len__(self):
return int(np.floor(len(self.data) / self.batch_size))
def on_epoch_end(self):
self.indexes = np.arange(len(self.data))
if self.shuffle:
if self.seed is not None:
np.random.seed(self.seed)
np.random.shuffle(self.indexes)
for i in range(7):
print("shuffled: ", x[i])
print("original shifted: ", y[i])
print("original: ", z[i])
print("\n")
When computing the score, you should NOT consider the start and end tokens.
The longest common substring can be computed with the SequenceMatcher function of difflib,
that allows a simple definition of our metric.
def score(s,p):
match = SequenceMatcher(None, s, p).find_longest_match()
#print(match.size)
return (match.size/max(len(p),len(s)))
Let's do an example.
The score must be computed as an average of at least 3K random examples taken form the test
set.
What to deliver
You are supposed to deliver a single notebook, suitably commented. The notebook should
describe a single model, although you may briefly discuss additional attempts you did.
The notebook should contain a full trace of the training. Weights should be made available on
request.
You must also give a clear assesment of the performance of the model, computed with the
metric that has been given to you.
Good work!
For this task it has been decided to implement a Transformer, for this reason we will define a
Decoder and Encoder Block working with multiple attention heads in an attention based
Transformer.
Model Definition
Useful parameters:
Encoder Layer
The Encoder block is constituted by the Embedding layer (Token and Positional encoding the
token itslef and the position it occupies in the sentence) and several EncoderLayers (stacking
more layers is considered to offer a more accurate solution). The EncoderLayer is composed by
two MultiHeadAttention layers and a Feed-Forward one, after every layer we apply a
normalization to avoid gradient explosion.
class EncoderLayer(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size,
embed_size):
super().__init__()
#Dropout layer
self.dropout = tf.keras.layers.Dropout(rate=DROPOUT)
return out_3
class Encoder(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size, embed_size):
super().__init__()
self.token_embedding =
tf.keras.layers.Embedding(input_dim=LEN_VOC, output_dim=LEN_EMBED,
mask_zero=True)
self.pos_embedding = tf.keras.layers.Embedding(input_dim=LEN_SENT,
output_dim=LEN_EMBED)
self.encoder_1 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.encoder_2 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.encoder_3 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.encoder_4 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.encoder_5 = EncoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
x=inputs
#Token and Position Encoding Section
maxlen = tf.shape(x)[-1]
positions = tf.keras.ops.arange(start=0, stop=maxlen, step=1)
positions = self.pos_embedding(positions)
x = self.token_embedding(x) #shape (batch_size, len_sen,
len_emb)
out_2 = self.encoder_1(out_1)
out_3 = self.encoder_2(out_2)
out_4 = self.encoder_3(out_3)
out_5 = self.encoder_4(out_4)
out_6 = self.encoder_5(out_5)
return out_6
Decoder Layer
The Decoder block is similarly structured, in fact we have the embedding followed by a series of
DecoderLayers as in the Encoder described above. In addition to the two SelfAttention layers we
have also a CausalAttention layer here, considerin in this way only the tokens already
predicted/observed. In this block, as in the previous one, defining the layers we considered
normalizing to avoid gradient explosion.
class DecoderLayer(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size, embed_size):
super().__init__()
self.multihead_mask =
tf.keras.layers.MultiHeadAttention(num_heads=num_heads_1,
key_dim=LEN_EMBED)
self.layernorm1 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.multihead1 =
tf.keras.layers.MultiHeadAttention(num_heads=num_heads_1,
key_dim=LEN_EMBED)
self.layernorm2 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.multihead2 =
tf.keras.layers.MultiHeadAttention(num_heads=num_heads_2,
key_dim=LEN_EMBED)
self.layernorm3 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.ffn = tf.keras.Sequential([tf.keras.layers.Dense(ff_size,
activation="leaky_relu"), tf.keras.layers.Dense(LEN_EMBED),])
self.layernorm4 =
tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.dropout = tf.keras.layers.Dropout(rate=DROPOUT)
return out_4
class Decoder(tf.keras.layers.Layer):
def __init__(self, num_heads_1, num_heads_2, ff_size,embed_size):
super().__init__()
self.token_embedding =
tf.keras.layers.Embedding(input_dim=LEN_VOC, output_dim=LEN_EMBED,
mask_zero=True)
self.pos_embedding =
tf.keras.layers.Embedding(input_dim=LEN_TARGET_SENT,
output_dim=LEN_EMBED)
self.decoder_1 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.decoder_2 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.decoder_3 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.decoder_4 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.decoder_5 = DecoderLayer(num_heads_1, num_heads_2, ff_size,
embed_size)
self.outlayer = tf.keras.layers.Dense(LEN_VOC,
activation='softmax')
x=decoder_inp
#Token and Position Encoding Section
maxlen = tf.shape(x)[-1]
positions = tf.keras.ops.arange(start=0, stop=maxlen, step=1)
positions = self.pos_embedding(positions)
x = self.token_embedding(x)
out_1 = x + positions
return self.outlayer(out_6)
Transformer
We merge together the Encoder and the Decoder block. We also override the predict method of
the class Model to predict the ordered sentences.
class Transformer(tf.keras.Model):
def __init__(self, num_heads_1, num_heads_2, ff_size, embed_size):
super().__init__()
self.encoder = Encoder(num_heads_1, num_heads_2, ff_size,
embed_size)
return decoder_out
max_length = 28
batch_size = encoder_input.shape[0]
output_array = tf.TensorArray(dtype=tf.int64, size=0,
dynamic_size=True)
for i in tf.range(max_length-1):
output = tf.transpose(output_array.stack())
predictions = self([encoder_input, output],
training=False)
end_mask = tf.reduce_any(tf.equal(predicted_id,
tokenizer.word_index['']), axis=-1)
if tf.reduce_all(end_mask):
break
output = tf.transpose(output_array.stack())
self([encoder_input, output[:,:-1]], training=False)
return output
We instantiate the model (the number of trainable parameters is really low in order to stay
below the maximum parameters limit set at 20M)
training=False
inputs = tf.keras.Input(shape=(LEN_SENT,))
target = tf.keras.Input(shape=(LEN_TARGET_SENT,))
outputs = Transformer(HEADS_1,HEADS_2, LEN_FF, LEN_EMBED)
(encoder_inp=[inputs, target], training=training)
model = tf.keras.Model(inputs=[inputs, target], outputs=outputs)
model.summary()
Model: "functional_23"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━
━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to
┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━
━━━━━┩
│ input_layer_12 │ (None, 28) │ 0 │ -
│
│ (InputLayer) │ │ │
│
├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ input_layer_13 │ (None, 27) │ 0 │ -
│
│ (InputLayer) │ │ │
│
├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ transformer_1 │ (None, 27, 10000) │ 4,756,880 │
input_layer_12[0… │
│ (Transformer) │ │ │
input_layer_13[0… │
└─────────────────────┴───────────────────┴────────────┴──────────────
─────┘
Training
Now, after creating the generators for the training, the validation and the testing (all mantaining
the proportions between training and testing) we procede with the training of the model.
opt = tf.keras.optimizers.AdamW(0.00005,
gradient_accumulation_steps=4)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(train_generator, batch_size=32, epochs=10,
validation_data=validation_generator)
model.summary()
Epoch 1/10
Model: "functional_23"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━
━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to
┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━
━━━━━┩
│ input_layer_12 │ (None, 28) │ 0 │ -
│
│ (InputLayer) │ │ │
│
├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ input_layer_13 │ (None, 27) │ 0 │ -
│
│ (InputLayer) │ │ │
│
├─────────────────────┼───────────────────┼────────────┼──────────────
─────┤
│ transformer_1 │ (None, 27, 10000) │ 4,756,880 │
input_layer_12[0… │
│ (Transformer) │ │ │
input_layer_13[0… │
└─────────────────────┴───────────────────┴────────────┴──────────────
─────┘
Testing
#function to determine the average score
def calc_score(num_batches, generator, myModel, detokenizier,
score_function):
list_scores = []
for k in range(num_batches):
x, y = generator.__getitem__(k)
predictions = myModel.predict(x, batch_size=32, verbose=False)
best = [[np.argmax(predictions[t][:][r]) for r in
range(len(predictions[t]))] for t in range(len(predictions))]
for i in range(len(x)):
list_scores.append(score(detokenizer(y)[i], detokenizer(best)
[i]))
We save the weights at the end of the run in order to have them disposable at every time
model.save_weights("model_weights.weights.h5")