Professional Documents
Culture Documents
NLP Report Ass2
NLP Report Ass2
1. Data Loading: I loaded two CSV files (train and test) containing text data and fake flag
using pd.read_csv() function from the pandas library.
2. Data Preprocessing: I transformed all text into lowercase , I splitted the train data into
train and test sets using train_test_split() function from sklearn.model_selection
module. I tokenized the text data in X_train using word_tokenize() function from
nltk.tokenize module. I removed stop words from the tokenized text using a lambda
function.
3. Word Embeddings:
With Word2Vec: Word2Vec model is initialized and trained using Word2Vec class from
gensim.models module. I trained the sentences list which contains tokenized sentences from
the train data using Word2Vec model.
With TF-IDF: TF-IDF vectorizer is initialized and trained using TfidfVectorizer class from
sklearn.feature_extraction.text module. I transformed the text data in X_train into TF-
IDF vectors using the trained vectorizer.
With FastText: FastText model is initialized and trained using FastText class from
gensim.models module. I trained the sentences list which contains tokenized sentences
from the train data using the FastText model.
5. Model Evaluation: I used the trained Logistic Regression model to predict the labels for
the validation set using predict() method.
Word2Vec 0.6766
TF-IDF 0.6797
FastText 0.6782