Professional Documents
Culture Documents
Sparse Matrix
Sparse Matrix
Sparse Matrix
#Make a matrix
A=matrix(c(11,12,13,14,15,21,22,23,24,25,31,32,33,34,35,41,42,43,44,45,51,52,53,54,55),nrow=5,ncol=
5)
A.csr<-as.matrix.csr(A)
A.csc<-as.matrix.csc(A)
A.coo<-as.matrix.coo(A)
Program #2
library('Matrix')
m1<-matrix(0,nrow=1000,ncol=1000)
m1[1,1]<-1
m1[2,3]<-5
object.size(m1)
m2<-Matrix(0,nrow=1000,ncol=1000,sparse=TRUE)
m2[1,1]<-1
m2[2,3]<-5
object.size(m2)
m2 %*% rnorm(1000)
m2+m2
m2-m2
t(m2)
require(xgboost)
require(Matrix)
require(data.table)
if (!require(vcd)) {
install.packages('vcd') #Available in Cran. Used for its dataset with
categorical values.
require(vcd)
}
# According to its documentation, Xgboost works only on numbers.
# Sometimes the dataset we have to work on have categorical data.
# A categorical variable is one which have a fixed number of values. By
example, if for each observation a variable called "Colour" can have only
"red", "blue" or "green" as value, it is a categorical variable.
#
# In R, categorical variable is called Factor.
# Type ?factor in console for more information.
#
# In this demo we will see how to transform a dense dataframe with categorical
variables to a sparse matrix before analyzing it in Xgboost.
# The method we are going to see is usually called "one hot encoding".
# 2 columns have factor type, one has ordinal type (ordinal variable is a
categorical variable with values wich can be ordered, here: None > Some >
Marked).
cat("Structure of the dataset\n")
str(df)
# Let's add some new categorical features to see if it helps. Of course these
feature are highly correlated to the Age feature. Usually it's not a good
thing in ML, but Tree algorithms (including boosted trees) are able to select
the best features, even in case of highly correlated features.
# For the first feature we create groups of age by rounding the real age. Note
that we transform it to factor (categorical data) so the algorithm treat them
as independant values.
df[,AgeDiscret:= as.factor(round(Age/10,0))]
# We remove ID as there is nothing to learn from this feature (it will just
add some noise as the dataset is small).
df[,ID:=NULL]
# List the different values for the column Treatment: Placebo, Treated.
cat("Values of the categorical feature Treatment\n")
print(levels(df[,Treatment]))
print(chisq.test(df$Age, df$Y))
# Pearson correlation between Age and illness disappearing is 35
print(chisq.test(df$AgeDiscret, df$Y))
# Our first simplification of Age gives a Pearson correlation of 8.
print(chisq.test(df$AgeCat, df$Y))
# The perfectly random split I did between young and old at 30 years old have
a low correlation of 2. It's a result we may expect as may be in my mind > 30
years is being old (I am 32 and starting feeling old, this may explain that),
but for the illness we are studying, the age to be vulnerable is not the
same. Don't let your "gut" lower the quality of your model. In "data science",
there is science :-)