Wk1b Data For AI

You might also like

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 4

In this video, we're gonna look at some of the characteristics of data that make it

challenging for machine learning and artificial intelligence.

If you remember in our previous video, we looked at the machine learning or


artificial intelligence flow that had data that fed into building and training our
model, and from that, we were able to either make predictions about the future or
classify the data.

Well, let's talk about the data.

Building a model and training a model requires a tremendous amount of data, and
that data is gonna require a lot of preparation, but that preparation is one of the
most important parts of the whole artificial intelligence machine learning process.

For example, the data we're gonna use for training must be complete.

Here we look at a set of birds.

If our model that we wanna build is going to identify different types of birds, we
wanna make sure that we have a complete set of all the different examples.

For example, we have yellowish, white, green, and gray birds, but at the set we did
our training on didn't have any green birds.

Whenever we go to identify or classify different bird types, we're not gonna be


very accurate if we're presented with green birds when we're using our model.

Also, our model has to have integrity.

The data has to be good.

If you notice here, there's a few birds that aren't complete.

You could think of this as having data that's not complete.

For example, you might be training a system on demographics of individuals, and you
don't have the street address of everybody.

All you have is their zip code.

There are a few people that you have all the street addresses of and a few people
that you don't.

What you want to do is decide, okay, with the people that I don't have complete
information, am I going to make up and fill in those gaps, or am I just going to
not use those records? For example, here with these birds, do we want to maybe
complete the birds or do we just not want to use that data at all? And that's part
of the process that we're going to see later on where we need to evaluate and clean
up our data and decide how we're going to use it.

We also have to make sure that our data is fair.

Now, when we look at this set, we have two gray birds and three green birds, but we
have five yellowish white birds.

Now, when we do our training, what could end up happening is our model is going to
be really, really, really good at identifying those yellowish white birds.

But whenever we present it with a gray bird, it may have some issues with this.
It may not correctly identified, it might get confused.

It might make it think it's a green bird.

This is part of the problem early facial recognitions systems had, is that a lot of
the training data was on white males, so it was really good at identifying white
males.

But whenever you had somebody with darker skin or females, it did not do the facial
recognition systems did not do a very good job of identifying those people because
it hadn't been trained to understand the specific characteristics of those types of
people.

So the systems were not fair.

Another characteristic of the data we have to be aware of is data could be biased


with in ways we don't understand.

For example, in this scenario here, the two gray birds are in the middle of our
scenario.

Does that mean that they're less important than the birds around the outer edge?
There's a green bird towards the bottom of the screen.

Is that one more important? Because he's making sure that the other birds are
flying in the correct direction.

There could be subtle characteristics of the data we present that may end up
influencing our model in ways we don't intend.

For example, states, we have 50 states right now in the United States.

If we want to convert that into a numeric form for our model to learn on, we may
decide, oh, we'll have Alaska B one, Alabama two, and so on, numbering each of the
states.

But that may inadvertently make it appear that Alaska is a much more important
state because it has a number one, and Alabama is less important because it has a
number two.

So those are some of the characteristics we need to worry about whenever we are
selecting and preparing our data to train our model.

Now, we're not gonna dig too much into that in this course, but in the machine
learning course, this will become a a major part of, uh, what we're gonna do in
that course.

Now, there's three very common types of data we're gonna work with.

First of all is tabular data that's in a table like you see here, image, data, and
text data.

Let's look at tabular data in a little bit more detail.

First, let's say our question that we want our model to understand is, what's the
best price for this book? Well, each row in our table is going to represent a data
sample we're going to use to train our model on.
Then each column is gonna represent either a feature or a label.

Features are measurable pieces of data that's gonna help make us make predictions
about whatever the label happens to be.

So what we're going to do with our model in the future, you'll see is that we're
going to train our model on these features, the months sold, year sold in genre,
and see if we can get it to predict what the sale price will be for a particular
book in the future.

The label is kind of our goal or the output we want to predict.

We're teaching it that if a book was sold in month seven, year 2008 of the genre
romance, its price was $18 and 10 cents.

And so we're going to use that to train our model into what features are important,
what features are not as important, and how to weight those features.

With image data, we're gonna try to answer a question.

Well, for example, what's animal is in this image here, each data sample is going
to be an individual file, and the features are going to be the pixels in their
relationships to each other.

So when we look at this image, we wanna know whether or not it has a raccoon.

So what we'll do for training is we will train our model on images of raccoons that
we label as raccoons, so it learns what a raccoon looks like.

Then later we're going to be able to, or use our model to look at images and
determine based on what it's been trained, if there's a raccoon in that image.

With textual data, there's going to be a lot of challenges, especially when we get
to looking at natural language processing, which we'll do towards the end of the
course.

But an example in textual data is, let's say we have this review here.

This book is pretty good, I'd recommend it to my friends, and we want our model to
decide, was this a good review or a bad review? When we do our training, each data
sample would be some piece of text stored as a file.

It could be a sentence, a chapter, or a document overall.

And then the features would be the text characters or words.

For example, you know, does this text express a positive sentence sentiment or a
negative sentiment? Now here we have a review and our goal is going to be, is this
a positive review? Well, we break it into the different pieces and we'd use it to
train the model.

Now, during training, we would identify a label with it saying This is a positive
sentiment.

That way whenever we went and asked our model in the future, evaluate this review,
is it good or bad? It would've learned from all these previous reviews that we've
already told it all.

This is a good review, this is a bad review, and so on.


It would be able to compare those to ours, and based on different strengths and
weaknesses and characteristics in relationship of the words, it would tell us, Hey,
this is a positive review.

You might also like