M1 Sample

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Course Project True/Fake Claims Analysis

Milestone 1_Sample submission

Research Goal

As of writing this, the current focus is to clear my given data of tweets that are not relevant to the

topic the tweets have been categorized in.

Abstract

Dictionaries contain the exact definitions of words to help people add to their own vocabulary

and permit them to express themselves in new, meaningful ways. Despite this, many individuals

or groups uses words in ways that are different in comparison to the hard-written descriptions

given to them to make conversation easier or for other means. The disaster tweet data set, located

at [Kraggle link], contains over 11,300 tweets categorized as describing disasters. But thanks to

the evolution of human communication, these disaster words may have mean words. For

example, while someone may tweet about a blizzard ravaging their company, a gamer may be

speaking about the poor decisions made by the Blizzard company. A person may also be

advertising the new (and delicious) Oreo Mint blizzard at Dairy Queen. The goal of my project is

to separate the real definitions of disasters from the false ones (although the use of disaster

keywords is not necessarily “fake”, they can be used as descriptive words besides their literal

definition).

Research Questions
The following refers to the amalgamation of questions that can be answered by this project (this

list has room to expand):

1. Can machine learning identify the differences between literal disasters and the use of the

disaster keywords?

2. Is it easier to toss out “fake” tweets, or tweets unrelated to real disasters/events?

3. Can a prediction algorithm provide appropriate scores that accurately predict the polarity

of tweets in order to avoid messy code to achieve the same goal?

Introduction

To make the data from the data set reliable, the primary focus in the beginning is to break the

tweets down into their simplest form.

Duplicate Date

Removing duplicate tuples helps cut down on unnecessary data that may cost time and

resources. Thankfully, my data set does not contain duplicates, but checking is certainly

worth it.

Missing Data
Tuples with missing values may cause problems further into the project. After reviewing

the disaster tweets I noticed the only empty (null) values were the location, which is not a

primary focus at this time (maybe using the location might reveal information about who

tweets what? Perhaps…). For now the code involving missing data will be commented

out.

Cleaning Data

To “clean” data means to remove extra characters that have no real meaning. This mostly

covers punctuation, capitals, and any additional whitespace created when removing

extraneous characters. At the time of writing, I am on the fence on including the cleaning

of accent characters.
After reviewing my tweets there are certain locations of places in other countries

mentioned in the posted tweet of which possess accent marks in their names. Removing

these fancy letters makes all the words simple to view but makes it difficult to read and

pronounce (this may be only a human problem, which is not necessarily a focus of the

project as of now).

Lemmatization

Lemmatization is still being tinkered with at the time of writing.

Stopwords
Removing stopwords is a part of lemmatization – to break down sentences and words to

their root. Stopwords are recognized as unnecessary words, and certain projects require

their removal. However, they may cause damage to sentiment analysis – the analyzing of

the positive or negative connotation of sentences. Does the removal of these words cause

sentences to become more positive or negative? On a related note, maybe lemmatization

causes the same effect.

By taking apart these sentences from the disaster tweets data set, it will be easier to

programmatically distinguish the differences between the use of disaster keywords.


This milestone is geared more towards the analysis of the tweets rather than cleaning them for

use. This milestone asks a few questions and subsequently solves a previous worry that will be

mentioned later.

Visualizations

Ngrams

A more organized way of visualizing the usage of words, this was taken from a previous

assignment with the added quadgram. A note to be made is to use the graphing abilities

presented by the module to plot other information.


WordCloud

Although not the focus of this portion of the semester, it was added into this milestone as I

recognized the usefulness of visualizing my outputs. This may act as a potential method of

comparing my other processed texts to see how far I have come in cleaning up the words.

You might also like