PCA Explained

You might also like

Download as pdf
Download as pdf
You are on page 1of 5
415722, 1:14PM sang wih Hichy Dimensional Daa using Pncipal Component Analysis (PCA) [by tabla Lindgren | Towards ata Science en C >) openin: G 2) opmnin oe swe Dealing with Highly Dimensional Data using Principal Component Analysis (PCA) Abeginners guide to PCA and how to implement it using sklearn (with code!) Photo by Nabeel Hussinon Unsplash ‘common issue for data scientists when creating an algorithm is having too many variables. Naturally, you would think that adding more information would only make your model bette, but with every feature you add comes another dimension. As humans, we ean only visualize things in 2-dimensions or 3-dimensions. For data, this rule does not apply! Data can have an infinite amount of dimensions, but this is where the curse of dimensionality comes into play. ‘The Curse of Dimensionatity is a paradox that data scientists face quite frequently. You want to use more information in order to improve the accuracy of your machine learning model, but the more features you ad, the number of dimensions (n) increases. As the dimensionality of the feature space increases, the number of configurations increases exponentially, and in turn, the number of configurations covered by observation decreases. (Our ultimate goal as data scientists is to create simple models that can run quiekly and are easy to explain, When we have a large amount of features, our mode! becomes more complex and the explainabilty decreases. To deal with these complicated datasets, Principal Component Analysis is an ideal method to reduce the dimensions of your data, \Whatis Principal Component Analysis and whatisit used for? Principal Component Analysis, or more commonly known as PCA, isa way to reduce the number of variables while maintaining the majority ofthe important information. It transforms a number of variables that may be correlated into a smaller number of hitps:/towardsdatascience.comideating-with-highly-dimensional-éate-using-prncipal-component-analysis-pca-feacaB17fe6 15 15122, 1:14 PM Dealing with Highty Dimensional Data using Principal Component Analysis (PCA) [by Isabela Lindgren | Towards Data Science reduces the chance of overfitting your model by eliminating features with high correlation. Open nap Itis important to note that you should only apply PCA to continuous variables, not categorical. Although technically you can use PCA ‘on one-hot encoded or otherwise binary data, it does not work very well. This is because PCA is designed to minimize variance (squared deviations) which is not very meaningful when performed on binary variables. f you have mixed data, alternative methods like MCA may work better. ‘Sohow can you tell how much information is retained in your PCA? We use Explained Variance Ratio as a metric to evaluate the usefulness of your principal components and to choose how many ‘components to use in your model. The explained variance rato is the percentage of variance that is attributed by each ofthe selected components. Ideally, you would choose the number of components to inelude in your mode! by adding the explained variance ratio of ‘each component until you reach a total of around 0.8 oF 80% to avoid overftting. Luckily fr us, sklearn makes i easy to get the explained variance ratio through their explained_variance_ratio_ parameter! We will use this in our coding example. Photo by NASA on Unsnash Example PCA using Sklearn 1. First, le’ load the iris dataset for our code-a-long example. The iris dataset isa famous dataset contains measurements for 150 iris flowers from three different species. fron skiearn import datasets import pandas 25 pd iris ~ datasets. lead visi) aE = pd.datafrane(irie.data, columstiris.feature_nanes) ae 'target'] = dris.get (target) as.head) Seal ingt (ar) seal wt em)_petaenth em) petal warn fem)_ Taree 3 20 “ 2 8 a a On oo a 8 hitps:/towardsdatascience.comidealing-with-highly-dimensional-éate-using-prncipal-component-analysis-pca-feacaB17fe6 25 522, 1:14 PM opal wicth (en), » tpotal width (en) "1 Dealing with Highty Dimensional Data using Principal Component Analysis (PCA) [by Isabela Lindgren | Towards Data Science Open nap 3, We have to standardize the data before implementing PCA. This is absolutely necessary because PCA calculates a new projection of our data on a new axis using the standard deviation of our data. PCA gives more weight to variables that have higher variances than variables with low variances, so itis important to normalize the data on the same scale to get a reasonable covariance fom sklearn.preprocessing Import ‘olunne-features) -heae sl nth (on _seal with cm) potent em) psa wit ond 4, Now, we will import PCA using sklearn and project our original data, which has 4 dimensions, into 2 dimensions. In this part, sklearn creates a covariance matrix to calculate eigenvectors (principal components) and their corresponding eigenvalues. The eigenvectors determine the directions of the new feature space and the eigenvalues determine the magnitude, or variance, of the ata along the new feature axes, "A(n_sompone palconponen 5. To better visualize the principal components, le’s pair them with the target (flower type) associated with the particular observation in a pandas dataframe. hitps:/towardsdatascience.comidealing-with-highly-dimensional-éate-using-prncipal-component-analysis-pca-feacaB17fe6 Pox_trget 27m -ossT385 35 522, 1:14 PM Dealing with Highly Dimensional Data using Principal Component Analysis (PCA) [by Isabela Lindgren | Towards Data Science Open nap yeize = (12/10) +21 Principal Component Analysis (2PCs) for Iris Dataset ‘Second Principal Component - - “Ftst Principal Component 7. We can see that the three classes are pretty distinet and fairly separable. We can conclude that the compressed data representation is most likely sufficient for a classification model. We can compage the variance in the overall dataset to what was captured from our {wo primary components using explained_varianee_ratio_ risnce of each fance_ratio} (pea. explained variance. ‘oval Varsanc sis })*100, 2) Variance of each component: [0.729525 0.22050762] Total Variance fxplsined: 95.51 ‘We can see that our first two principal components explain the majority ofthe variance in this dataset (95.8196)! This isan indication of the total information represented compared to the original data ‘Summary ‘The key takeaways from this article are: hitps:/towardsdatascience.comidealing-with-highly-dimensional-éate-using-prncipal-component-analysis-pca-feacaB17fe6 45, 522, 1:14 PM Dealing with Highty Dimensional Data using Principal Component Analysis (PCA) [by Isabela Lindgren | Towards Data Science e G ©) cprinae + Only apply PCA to continuous data, + Make sure your data is normalized before applying PCA! ‘Thanks for the read! Ifyou like my blog posts, you can support me by following my Medium here. You can also connect with me via my Linkedin here. References: 1. hups://towandsdatascience com/a-one-siop-shop-for-pineipal-component-analysis.SS82fb7e0a9¢ 2. hups://deepai.org/machine-learning-glossary-and-terms/curse-of-dimensionality 3. htpsi//elitedatascience.com/averftting-in-machine-learning 4, hurps://wunw geeksforgecks org/princial-companent-analysis-with-python/ 5. https://scikit-learn org/stable/modules/generated /sklearn decomposition, PCA html 6, hups://www.researchgate,net/profile/ Dominique Valentin/publication/239542271 Multiple Correspondence Analysis/links/ $542979900e/256b[8bb95e95,pdf 7. huups://www.visiondummy.com/2014/03 eigenvalues-cigenvectors/, 8. hurps://sebastianraschka.com/Articles/2015 pea in 3 steps.himl 9 hurps://etav github io/python/scikit pea hum! 10. Flaticon Data Science Curriculum Section 37 Sign up for The Variable By Towards Data Since vey Thursday. the Vibe delves the very best of Towards Data Since tom hands-on tus and euting edge reserch to orignal fetus you dot want omés Taka (© swttisnewsioter ) hitps:/towardsdatascience.comideating-with-highly-dimensional-éate-using-prncipal-component-analysis-pca-feacaB17fe6 55

You might also like