Professional Documents
Culture Documents
Wk1b Data For AI
Wk1b Data For AI
Wk1b Data For AI
Building a model and training a model requires a tremendous amount of data, and
that data is gonna require a lot of preparation, but that preparation is one of the
most important parts of the whole artificial intelligence machine learning process.
For example, the data we're gonna use for training must be complete.
If our model that we wanna build is going to identify different types of birds, we
wanna make sure that we have a complete set of all the different examples.
For example, we have yellowish, white, green, and gray birds, but at the set we did
our training on didn't have any green birds.
For example, you might be training a system on demographics of individuals, and you
don't have the street address of everybody.
There are a few people that you have all the street addresses of and a few people
that you don't.
What you want to do is decide, okay, with the people that I don't have complete
information, am I going to make up and fill in those gaps, or am I just going to
not use those records? For example, here with these birds, do we want to maybe
complete the birds or do we just not want to use that data at all? And that's part
of the process that we're going to see later on where we need to evaluate and clean
up our data and decide how we're going to use it.
Now, when we look at this set, we have two gray birds and three green birds, but we
have five yellowish white birds.
Now, when we do our training, what could end up happening is our model is going to
be really, really, really good at identifying those yellowish white birds.
But whenever we present it with a gray bird, it may have some issues with this.
It may not correctly identified, it might get confused.
This is part of the problem early facial recognitions systems had, is that a lot of
the training data was on white males, so it was really good at identifying white
males.
But whenever you had somebody with darker skin or females, it did not do the facial
recognition systems did not do a very good job of identifying those people because
it hadn't been trained to understand the specific characteristics of those types of
people.
For example, in this scenario here, the two gray birds are in the middle of our
scenario.
Does that mean that they're less important than the birds around the outer edge?
There's a green bird towards the bottom of the screen.
Is that one more important? Because he's making sure that the other birds are
flying in the correct direction.
There could be subtle characteristics of the data we present that may end up
influencing our model in ways we don't intend.
For example, states, we have 50 states right now in the United States.
If we want to convert that into a numeric form for our model to learn on, we may
decide, oh, we'll have Alaska B one, Alabama two, and so on, numbering each of the
states.
But that may inadvertently make it appear that Alaska is a much more important
state because it has a number one, and Alabama is less important because it has a
number two.
So those are some of the characteristics we need to worry about whenever we are
selecting and preparing our data to train our model.
Now, we're not gonna dig too much into that in this course, but in the machine
learning course, this will become a a major part of, uh, what we're gonna do in
that course.
Now, there's three very common types of data we're gonna work with.
First of all is tabular data that's in a table like you see here, image, data, and
text data.
First, let's say our question that we want our model to understand is, what's the
best price for this book? Well, each row in our table is going to represent a data
sample we're going to use to train our model on.
Then each column is gonna represent either a feature or a label.
Features are measurable pieces of data that's gonna help make us make predictions
about whatever the label happens to be.
So what we're going to do with our model in the future, you'll see is that we're
going to train our model on these features, the months sold, year sold in genre,
and see if we can get it to predict what the sale price will be for a particular
book in the future.
We're teaching it that if a book was sold in month seven, year 2008 of the genre
romance, its price was $18 and 10 cents.
And so we're going to use that to train our model into what features are important,
what features are not as important, and how to weight those features.
Well, for example, what's animal is in this image here, each data sample is going
to be an individual file, and the features are going to be the pixels in their
relationships to each other.
So when we look at this image, we wanna know whether or not it has a raccoon.
So what we'll do for training is we will train our model on images of raccoons that
we label as raccoons, so it learns what a raccoon looks like.
Then later we're going to be able to, or use our model to look at images and
determine based on what it's been trained, if there's a raccoon in that image.
With textual data, there's going to be a lot of challenges, especially when we get
to looking at natural language processing, which we'll do towards the end of the
course.
But an example in textual data is, let's say we have this review here.
This book is pretty good, I'd recommend it to my friends, and we want our model to
decide, was this a good review or a bad review? When we do our training, each data
sample would be some piece of text stored as a file.
For example, you know, does this text express a positive sentence sentiment or a
negative sentiment? Now here we have a review and our goal is going to be, is this
a positive review? Well, we break it into the different pieces and we'd use it to
train the model.
Now, during training, we would identify a label with it saying This is a positive
sentiment.
That way whenever we went and asked our model in the future, evaluate this review,
is it good or bad? It would've learned from all these previous reviews that we've
already told it all.