Professional Documents
Culture Documents
M1 Sample
M1 Sample
M1 Sample
Research Goal
As of writing this, the current focus is to clear my given data of tweets that are not relevant to the
Abstract
Dictionaries contain the exact definitions of words to help people add to their own vocabulary
and permit them to express themselves in new, meaningful ways. Despite this, many individuals
or groups uses words in ways that are different in comparison to the hard-written descriptions
given to them to make conversation easier or for other means. The disaster tweet data set, located
at [Kraggle link], contains over 11,300 tweets categorized as describing disasters. But thanks to
the evolution of human communication, these disaster words may have mean words. For
example, while someone may tweet about a blizzard ravaging their company, a gamer may be
speaking about the poor decisions made by the Blizzard company. A person may also be
advertising the new (and delicious) Oreo Mint blizzard at Dairy Queen. The goal of my project is
to separate the real definitions of disasters from the false ones (although the use of disaster
keywords is not necessarily “fake”, they can be used as descriptive words besides their literal
definition).
Research Questions
The following refers to the amalgamation of questions that can be answered by this project (this
1. Can machine learning identify the differences between literal disasters and the use of the
disaster keywords?
3. Can a prediction algorithm provide appropriate scores that accurately predict the polarity
Introduction
To make the data from the data set reliable, the primary focus in the beginning is to break the
Duplicate Date
Removing duplicate tuples helps cut down on unnecessary data that may cost time and
resources. Thankfully, my data set does not contain duplicates, but checking is certainly
worth it.
Missing Data
Tuples with missing values may cause problems further into the project. After reviewing
the disaster tweets I noticed the only empty (null) values were the location, which is not a
primary focus at this time (maybe using the location might reveal information about who
tweets what? Perhaps…). For now the code involving missing data will be commented
out.
Cleaning Data
To “clean” data means to remove extra characters that have no real meaning. This mostly
covers punctuation, capitals, and any additional whitespace created when removing
extraneous characters. At the time of writing, I am on the fence on including the cleaning
of accent characters.
After reviewing my tweets there are certain locations of places in other countries
mentioned in the posted tweet of which possess accent marks in their names. Removing
these fancy letters makes all the words simple to view but makes it difficult to read and
pronounce (this may be only a human problem, which is not necessarily a focus of the
project as of now).
Lemmatization
Stopwords
Removing stopwords is a part of lemmatization – to break down sentences and words to
their root. Stopwords are recognized as unnecessary words, and certain projects require
their removal. However, they may cause damage to sentiment analysis – the analyzing of
the positive or negative connotation of sentences. Does the removal of these words cause
By taking apart these sentences from the disaster tweets data set, it will be easier to
use. This milestone asks a few questions and subsequently solves a previous worry that will be
mentioned later.
Visualizations
Ngrams
A more organized way of visualizing the usage of words, this was taken from a previous
assignment with the added quadgram. A note to be made is to use the graphing abilities
Although not the focus of this portion of the semester, it was added into this milestone as I
recognized the usefulness of visualizing my outputs. This may act as a potential method of
comparing my other processed texts to see how far I have come in cleaning up the words.