Professional Documents
Culture Documents
Features Election
Features Election
by Rischan Mafrur
Outlier Removal Data Normalization t-TEST The Receiver Operating Characteristic Curve Fishers Discriminant Ratio
Outlier Removal
Learning by Example
Problem Example 4.2.1 [page: 107] We have N(100) data random in 1 dimension Gaussian with mean value =1 & variance =0.16 add five outlier point [6.2 , -6.4, 4.2, 15, 6.8]
Generate data set Adding some outliers value Scramble the data Find outliers and the index
Cont..
Result:
Now we can identify the value and the position of the outliers
Data Normalization
3 Normalization Methods
NormalizeStd Function
NormalizeMnMx Function
NormalizeSoftMax Function
Result
Original Data by Min Max [-1,1]
by Std
by SoftMax [0.5]
t-TEST
Learning by Example
Problem in Example 4.4.1 [page :112] Assuming the data set is normally distributed. We have 2 Gaussian Class with m1= 8.75, and m2 =9, and the variance = 4.
Generate the vectors x1,x2 each containing N =1000. Assumed we dont know about mean and the variance, we just know about the vectors x1 and x2, and then we want to know the equality of means both of data. we use the significance level : 5% (level of confidence 95 %) and 0.1 % (level of confidence 99.9 %)
Cont...
In t-test we have two hypotheses : H0 : The mean values of the data in two classes are equal.
H1 : The mean values of the data in two classes are not equal.
In this case, when the significance level 5% the result h =1, which implies that the hypothesis of the equality of the means can be rejected. And when the significance level 0.1 % the result h=0, which implies that no evidence to reject the hypothesis of equality of the means. m1 = 8.75, m2 = 9, when significance level 5% implies the means of two classes is not equal but for the significance level 0.1 % implies the means of two classes is equal. so we can conclude : the smaller the significance level (the more confident we want to be in our decision and the harder to reject the equality hypothesis)
ROC
Receiver Operating Characteristic
ROC is a measure of the class-discrimination capability of a specific feature. It measures the overlap between the pdfs describing the data distribution of the feature in two classes [Theo 09, Section 5.5].
Learning by Example
Problem in Example 4.5.1 [page: 113] We have 2 classes 1 dimensional Gaussian with m1=2, and m2 =0 We must plotting using plotHist Compute and Plot the corresponding AUC values using the function ROC.
We also can try using different m value: [m1,m2] =[0,0] [m1,m2] =[2,2] [m1,m2] =[5,5]
[m1,m2] =[2,0]
[m1,m2] =[5,0]
ROC Curve
AUC value
Plot
PlotHist [m1,m2] =[0,0] PlotHist [m1,m2] =[2,2]
FDR
FDR commonly used for quantify the discriminatory power of individual features between two classes.
Learning by Example
Problem in Example 4.6.2 [page: 115] In this case, we have a data like in Table 4.3. We have 2 data, Cirrhotic Liver and Fatty Liver with 4 features (mean, std, skew, & kurtosis) The problem is which one has to choose the most informative feature? so we can use FDR for select which the data has most informative feature.
Result
We can see the result : According to the result the higher FDR value is mean with FDR= 13.8893. so the most informative features is the mean.
Thank you :)