Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

HW #1 NLP

Niccolo Campolungo 1770766

04/02/2016

1 System Overview
My model has been implemented in Python, using the sklearn crfsuite package, as suggested
by the assignment. The features have been encoded, again, as suggested in the paper: each
character of each word has a feature set containing the sequences of (up to) k characters to its
left and right; furthermore, a bias function was added, whose output has been set to 1 regardless
of input. More features have been tried to improve the model, but with little success; this part
will be explained in more detail in section 2.
Below is an example of a featureset for a single two-characters word, by (using = 1):

[{bias: 1.0, right_<w>: 1}, {bias: 1.0, left_<w>: 1, right_b: 1},


{bias: 1.0, left_b: 1, right_y: 1}, {bias: 1.0, left_y: 1, right_</w>: 1}]

2 Results and Analysis


2.0 Features addition
Throughout the phases reported more in detail below, I tried adding new features to enhance
the performance of the model. I tried adding simple features like word length, left and right
segments lengths, and index of character, aiming at an overall increase of score, without much
luck. Some of the features really decreased the models performance by a lot (up to 0.08),
whereas others just reduced it slightly, hence I opted to remove all of them.

2.1 First phase, tuning


The first thing to do was to tune the parameter. I implemented a simple for loop that
iterated over 20 values of (1..20) to find the one with highest f1 score. Without tuning other
parameters, the highest score was achieved with = 11 (f 1 0.882, p 0.936, r 0.834).

2.2 Second phase, parameters tuning


The real problem was finding the best combination among all the parameters; in general, this
can be done by Cross Validating our training data with the development set. The parameters
of our Grid Search were  and M I, whereas the Grid Search type chosen was K-Fold, with k
ranging from 4 to 6. As for  and M I, the former was set to be 10i , i = 1..5, while the latter
ranged from 1 to 100. All of this was iterated over different values of , going from 3 to 15
(since values lower than 3 and higher than 14 always gave much worse scores with some fixed
good parameters).

2.3 Third phase, training set subsets


As requested in the assignment, some [4, 5, 6]-Fold Grid Searches were executed using subsets
of the training set, specifically a quarter (Q), a half (H), the training set (T) and T + the
crowd sourced dataset (F). The table below reports the results of the above runs. One weird

1
thing is that T and F had the exact same scores across the three K-Folds, but I dont really
know how to explain this (I ran all the validations twice to make sure that the data was right).

K MI CV P R F1
Q 4 7 36 0.907 0.892 0.821 0.855
H 6 5 6 0.913 0.89 0.86 0.87
T 6 5 16 0.921 0.936 0.899 0.917
F 6 5 16 0.921 0.936 0.899 0.917

2.4 Fourth phase, crowd sourced training


Using the crowd sourced dataset along with the training set, I ended up re-executing the 2-
dimensional grid search (with 6-Fold Cross Validation and the same assumptions as before),
after repeating the first three phases performed on the original training set. All those steps
led to a final model that, with = 6, M I = 21 and  = 0.0001 could score an f1 of 0.9018 on
the development set, whereas the scores on the test set were p 0.891, r 0.864, p 0.878.

2.5 Overall scores


The following table shows more data obtained throughout the evaluation of the model, in the
steps described above.

 MI Precision Recall F1
Q 11 105 100 0.884 0.722 0.795
H 11 105 100 0.906 0.815 0.858
T 11 105 100 0.936 0.834 0.882
T 8 104 67 0.896 0.877 0.886
T 6 104 21 0.920 0.842 0.880
F 6 104 21 0.911 0.892 0.902

You might also like