Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Here's the steps that I did to get the result :

1. Data Loading: I loaded two CSV files (train and test) containing text data and fake flag
using pd.read_csv() function from the pandas library.

2. Data Preprocessing: I transformed all text into lowercase , I splitted the train data into
train and test sets using train_test_split() function from sklearn.model_selection
module. I tokenized the text data in X_train using word_tokenize() function from
nltk.tokenize module. I removed stop words from the tokenized text using a lambda
function.

3. Word Embeddings:

With Word2Vec: Word2Vec model is initialized and trained using Word2Vec class from
gensim.models module. I trained the sentences list which contains tokenized sentences from
the train data using Word2Vec model.

With TF-IDF: TF-IDF vectorizer is initialized and trained using TfidfVectorizer class from
sklearn.feature_extraction.text module. I transformed the text data in X_train into TF-
IDF vectors using the trained vectorizer.

With FastText: FastText model is initialized and trained using FastText class from
gensim.models module. I trained the sentences list which contains tokenized sentences
from the train data using the FastText model.

4. Model Training: Logistic Regression model is initialized and trained using


LogisticRegression class from sklearn.linear_model module. I used the word embeddings
obtained from (Word2Vec, TF-IDF, FastText ) as features for training the Logistic
Regression model.

5. Model Evaluation: I used the trained Logistic Regression model to predict the labels for
the validation set using predict() method.

The accuracy score is calculated using accuracy_score() function from sklearn.metrics


module to evaluate the performance of the model.
The result based on the accuracy was :

Word2Vec  0.6766
TF-IDF  0.6797
FastText  0.6782

So TF-IDF is the best way to represent the text

You might also like