FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material II 09-Aug-2021 Data2 2

Data Cleaning ‘Most Machine Learning algorithms cannot work with missing features, so let’ create a few functions to take care of them. You noticed earlier that the total_bedroons attribute has some missing values, so le’ fix this. You have three options: + Get rid ofthe corresponding districts + Get rid of the whole attribute + Set the values to some value (zero, the mean, the median, et). You can accomplish these easily using DataFrame’s dropna(), drop(), and FULLna() methods: housing. dropna(subset=["total_bedroons"]) # option 1 housing. drap("total_bedroons", axis=i) # option 2 rnedian ~ houstngf"total_becroons"}.medtan() # option 2 noustngl total_bedroons"].FtUIna(medtan, tnplace=True) Ifyou choose option 3, you should compute the median value on the training set, and use it to fil the missing values in the training set, but also dont forget to save the ‘median value that you have computed. You will need it later to replace missing values in the test set when you want to evaluate your system, and also once the system goes live to replace missing values in new data. Scikit-Learn provides @ handy class to take care of missing values: Sinpleinputer. Here is how to use it. First, you need to create a SimpleInputer instance, specifying that you want to replace each attributes missing values with the median of that attribute: fron sklesrn.tnpute tnport Stnplelnputer Lnputer = Sinpletnputerstrategy="nedian") Since the median can only be computed on numerical attributes, we need to create a copy of the data without the text attribute ocean_proxinity housing.nun = hovstng.drop("ocean_ proximity", axis=1) Now you can fit the Unputer instance to the training data using the F.t() method: ‘puter FLe¢housina_nun) ‘The Anputer has simply computed the median of each attribute and stored the result in its statistics_instance variable. Only the total_bedroons attribute had missing values, but we cannot be sure that there won't be any missing values in new data after the system goes live, so its safer to apply the tnputer to all the numerical attributes: a> Lnputer. statistics, array([ 118.51, 34.25, 29. , 2119.5, 433. , 1164. , 408. , 3.5409]) Prepare the Datafor Machine Leaning Agortims | 67>>> housing_nun.nedian() values array([ 118.51, 34.25, 29. , 2119.5 , 433. , 1164. , 408. , 3.5469]) Now you can use this “trained” imputer to transform the training set by replacing ‘missing values by the learned medians: X « tmputer.transformthousing_nun) ‘The result is a plain NumPy array containing the transformed features. If you want to put it back into a Pandas DataFrame, it’ simple: housing tr d.ataFrane(X, column 2_pur.colurns) Scikit-Learn Design Scikit-Learnis API is remarkably well designed. The main design principles are: + Consistency. All objects share a consistent and simple interface: Estimators. Any object that can estimate some parameters based on a dataset is called an estimator (eg, an inputer is an estimator). The estimation itself is performed by the F1t() method, and it takes only a dataset as a parameter (or two for supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the estimation process is con- sidered a hyperparameter (such as an inputer's strategy), and it must be set as an instance variable (generally via a constructor parameter) — Transformers. Some estimators (such as an imputer) can also transform a dataset; these are called transformers. Once again, the API is quite simple: the transformation is performed by the transform() method with the dataset to transform as a parameter. It returns the transformed dataset, This transforma: tion generally relies on the learned parameters, as is the case for an inputer All transformers also have a convenience method called Fut_transform() that is equivalent to calling Fit() and then transforn() (but sometimes Fit_transforn() is optimized and runs much faster) — Predictors. Finally, some estimators are capable of making predictions given a dataset; they are called predictors. For example, the LinearRegression model in the previous chapter was a predictor: it predicted life satisfaction given a country’s GDP per capita. A predictor has a predict() method that takes a dataset of new instances and returns a dataset of corresponding predictions. t also has a score() method that measures the quality of the predictions given For more detail onthe design principles, se "API design for machin learning software: experiences from the slam projet” Bultic, G. Louppe, M. Blondel, FPedeeoss, A Miller etal. (2013) (68 | Gapter2 End to End Machine Learning Projeta test set (and the corresponding labels in the case of supervised learning algorithms." + Inspection, All the estimator’s hyperparameters are accessible directly via public Instance variables (eg, inputer.strategy), and all the estimator’s learned parameters are also accessible via public instance variables with an underscore suffix (eg., inputer statistics). + Nonproliferation of classes, Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes. Hyperparameters are just regular Python strings or numbers + Composition, Existing building blocks are reused as much as possible. For example, itis easy to create a Pipeline estimator from an arbitrary sequence of, transformers followed by 2 final estimator, as we will se. + Sensible defaults. Scikit-Learn provides reasonable default values for most parameters, making it easy to create a baseline working system quickly. Handling Text and Categorical Attributes Earlier we left out the categorical attribute ocean_proximity because it is a text attribute so we cannot compute its median: >>> housing_cat = housing[["ocean_proximity”]] >>> housing_cat.head(i0) ‘ocean proxinty 17606 1H OCEAN 18632 fran sklesrn.preprocessing tmport ordinalencoder de> ordinal_encoder ~ Ordinalencoder() 18 Some predictors alo provide methods to measure the confidence of ther predictions ei avalable since Sik Le Series. factorze() method, 19 This 020. 1f you us an eater version, please consider upgrading, or use Panda Prepare the Datafor Machine Leaning Agortims | 69>>> housing_cat_encoded = ordinal_encoder.fit_transforn(houstng_cat) >>> housing_cat_encoded[ 19] array({{0-1, 8.1, 1% 7 cn I, a You can get the list of categories using the categories_instance variable. It isa list containing a 1D array of categories for each categorical attribute (in this case, alist containing a single array since there is just one categorical attribute): >>> ordinal_encoder categories [array(('sti OCEAN’, "INLAND", "TSLAND', "NEAR BAY", NEAR OCEAN'], ‘type=object] One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. This may be fine in some cases (eg, for ordered categories such as “bad’, ‘average’, “good, “excellent’), but itis obviously not the case for the ocean_proxinity column (for example, categories 0 and 4 are clearly more similar than categories 0 and 1). To fix this issue, a common solution is to create one binary attribute per category: one attribute equal to 1 when the category is “<1H OCEAN” (and 0 otherwise), another attribute equal to 1 when the category is, “INLAND” (and 0 otherwise), and so on. This is called one-hot encoding, because only one attribute will be equal to I (hot), while the others will be 0 (cold), The new attributes are sometimes called dummy attributes. Scikit-Learn provides a OneHotEn coder class to convert categorical values into one-hot vectors* a2 from sklearn.preprocessing tmport OneHotEncode’ da cat encoder ~ OneHotEncoder() >>> housing_cat_thot = cat_encoder fit s >>> housing_eat_thot 1651245 sparse matrix of type ‘" with 16512 stored elenents in Conpressed Sparse Row Fornat> ansforn(housing_cat) Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After one hot encoding we get a matrix with thousands of columns, and the matrix is ull of, zeros except for a single 1 per row. Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the non- 2 Before Skt earn 0.2, ould only encode integer categoria values, bu since 0.20 it can also handle other types of inputs, ncuding text categorical apt. 70 | Gapter: End to-End Machine Learning Projetzero elements. You can use it mostly like a normal 2D array. but if you really want to convert it to a (dense) NumPy array, just call the toarray() method: >>> housing_cat_thot.toarray() array([[e1 Bs Bre Bre 8] A, 8, 2, 8, 8.1, 01, 8., 8,8, 1), ble Oe Oe : eo, Je 1, a) Once again, you can get the list of categories using the encoder’s categories_ instance variable: b>> cat_encoder catecories_ [aeray({'
You might also like
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Rating: 4 out of 5 stars
4/5 (5823)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1093)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Rating: 4.5 out of 5 stars
4.5/5 (852)
Magazines
Podcasts
Sheet music
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (612)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1719)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (590)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1105)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (898)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (541)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (2105)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (349)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (474)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1025)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1867)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (823)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (122)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (271)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (442)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (1948)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (403)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4771)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (809)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2259)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4208)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (98)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (266)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1929)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1903)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (231)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (234)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2523)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (3973)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (738)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2409)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (74)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (789)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (880)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John Le Carré
3.5/5 (104)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (137)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
Document9 pages
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
sohan souri
No ratings yet
HTML Forms: Click To Add Text
Document40 pages
HTML Forms: Click To Add Text
sohan souri
No ratings yet
Day4 Updated
Document21 pages
Day4 Updated
sohan souri
No ratings yet
Working With Forms and Regular Expressions: Validating A Web Form With Javascript
Document56 pages
Working With Forms and Regular Expressions: Validating A Web Form With Javascript
sohan souri
No ratings yet
IntroProgramming Js
Document35 pages
IntroProgramming Js
sohan souri
No ratings yet
Day5 Htmlmedia
Document10 pages
Day5 Htmlmedia
sohan souri
No ratings yet

FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material II 09-Aug-2021 Data2 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material II 09-Aug-2021 Data2 2

Uploaded by

Copyright:

Available Formats

You might also like