Examen Data Mining Janvier 2015 Correcti

E.S.S.A.
I Janvier 2015
Examen Session Principale
Cours Data Mining
Durée : 1h30 D. Malouche,
Exercice 1
Compléter de code R suivant
> data(bodyfat,package="TH.data")
> dim(bodyfat)
[1] 71 10
> colnames(bodyfat)
[1] "age" "DEXfat" "waistcirc" "hipcirc" "elbowbreadth"

[6] "kneebreadth" "anthro3a" "anthro3b" "anthro3c" "anthro4"
> ### On divise les donnees en deux sous-echantillons

> set.seed(1234)
> ind <- sample(2,nrow(bodyfat),replace=T,prob=c(.7,.3))
> table(ind)
ind
1 2
56 15
> bodyfat.train <- bodyfat[ind==1,]

> dim(bodyfat.train)
[1] 56 10
> bodyfat.test <- bodyfat[ind==2,]

> dim(bodyfat.test)
[1] 15 10
> library(rpart)
> myFormula <- DEXfat ~ age+waistcirc+hipcirc+elbowbreadth+kneebreadth
> bodyfat_rpart <- rpart(myFormula, data=bodyfat.train,control=rpart.control(minsplit=10))
> print(bodyfat_rpart)
n= 56
node), split, n, deviance, yval

* denotes terminal node
1) root 56 7265.0290000 30.94589

2) waistcirc< 88.4 31 960.5381000 22.55645
4) hipcirc< 96.25 14 222.2648000 18.41143
8) age< 60.5 9 66.8809600 16.19222 *
9) age>=60.5 5 31.2769200 22.40600 *
5) hipcirc>=96.25 17 299.6470000 25.97000
10) waistcirc< 77.75 6 30.7345500 22.32500 *
1
11) waistcirc>=77.75 11 145.7148000 27.95818
22) hipcirc< 99.5 3 0.2568667 23.74667 *
23) hipcirc>=99.5 8 72.2933500 29.53750 *
3) waistcirc>=88.4 25 1417.1140000 41.34880
6) waistcirc< 104.75 18 330.5792000 38.09111
12) hipcirc< 109.9 9 68.9996200 34.37556 *
13) hipcirc>=109.9 9 13.0832000 41.80667 *
7) waistcirc>=104.75 7 404.3004000 49.72571 *
waistcirc< 88.4
|
hipcirc< 96.25 waistcirc< 104.8

age< 60.5 waistcirc< 77.75
hipcirc< 109.9
16.19 22.41 hipcirc< 99.5
49.73
22.32 23.75 29.54 34.38 41.81
> y=bodyfat$DEXfat[ind==1]
> ybar=mean(y)
> ybar
[1] 30.94589
> sum((y-ybar)^2)
[1] 7265.029
> bodyfat.test[1:2,]
age DEXfat waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b

51 60 36.42 89.5 100.5 7.1 10.0 4.24 4.68
60 60 21.94 65.0 90.0 5.7 8.2 3.72 4.11
anthro3c anthro4
51 4.15 5.91
60 3.48 5.29
2
> DEXfat_pred=predict(bodyfat_rpart,newdata=bodyfat.test[1:2,])
> DEXfat_pred
51 60
34.37556 16.19222
Exercice 2
> library(randomForest)
> rf<-randomForest(Species~ .,data=iris,mtry = 3,ntree = 50)
> table(predict(rf))
setosa versicolor virginica

50 49 51
> table(predict(rf),iris$Species)
setosa versicolor virginica

setosa 50 0 0
versicolor 0 47 2
virginica 0 3 48
> print(rf)
Call:
randomForest(formula = Species ~ ., data = iris, mtry = 3, ntree = 50)
Type of random forest: classification
Number of trees: 50
No. of variables tried at each split: 3
OOB estimate of error rate: 3.33%

Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 2 48 0.04
> importance(rf)
MeanDecreaseGini
Sepal.Length 1.292915
Sepal.Width 1.380635
Petal.Length 49.984232
Petal.Width 46.797418
1. Indiquer la signification du résultat de la commande importance(rf).

Réponse: Pour chaque variable xj et pour chaque échantillon OOB on permute les valeurs de xj . Im-
portance de xj = augmentation moyenne de l’erreur d’un arbre apres permutation. Plus l’augmentation
d’erreur est forte, plus la variable est importante.
2. Indiquer comment l’erreur OBB a été calculuée.

Réponse : Erreur OOB “Out Of Bag”
1 Pn
• (yi − ybi )2 Pour des arbres de régression.
n i=1
1 Pn
• 1 Pour les arbres de classification.
n i=1 {yi 6=ybi }
3. Compléter le code R.
3
Exercice 3
> setwd("~/Documents/Examens14_15/Exam_Principal_dm_1415/CodeR_Exam_dm/")
> txt <- readLines("constitution_tunisienne_Janv2014.txt",
+ encoding = 'utf-8')
> length(txt)
[1] 582
> i=grep('Article',txt)
> txt[i][1:6 ]
[1] "Article premier." "ArticleA֒ 2." "Article 3." "ArticleA֒ 4."

[5] "Article 5." "Article 6."
> length(i)
[1] 149
> txt=txt[-i]
> library(tm)
> for(i in 1:length(txt))
+ txt[i]=tolower(txt[i])
+ txt[i]=removeWords(txt[i],c("l'","d'","s'","jusqu'","qu'","c'","n'","m'"))
+ txt[i]=removeWords(txt[i],words = stopwords('french'))
> txt=removePunctuation(txt)
> txt=removeNumbers(txt)
> corpus <- Corpus(VectorSource(txt))
> tdm <- TermDocumentMatrix(corpus,control = list(removePunctuation = TRUE,
+ stopwords = TRUE,
+ minWordLength=3))
> tdm
<<TermDocumentMatrix (terms: 1637, documents: 433)>>

Non-/sparse entries: 5861/702960
Sparsity : 99%
Maximal term length: 22
Weighting : term frequency (tf)
> dim(tdm)
[1] 1637 433
> m=as.matrix(tdm)
> dim(m)
[1] 1637 433
> sum((m!=0))
[1] 5861
> sum((m==0))
[1] 702960
> max(sapply(rownames(tdm),nchar))
[1] 22
4
> sort(sapply(rownames(tdm),nchar),d=T)[1:10]
infraconstitutionnelle inconstitutionnalitO instrumentalisation
22 20 19
constitutionnalitO constitutionnelles exceptionnellement
18 18 18
constitutionnelle environnementales inconstitutionnel
17 17 17
dOcentralisation
16
> sort(rowSums(m),d=T)[1:20]
assemblOe prOsident loi peuple rOpublique
137 121 108 102 101
reprOsentants gouvernement Otat membres cour
89 83 64 56 53
peut cas chef droit Řtre
52 49 44 44 43
dispositions conformOment conseil dOlai commission
38 36 35 35 32
> sort(rowSums((m!=0)),d=T)[1:20]
assemblOe loi prOsident peuple
102 93 93 92
rOpublique reprOsentants gouvernement Otat
87 82 62 57
peut membres cas chef
47 43 42 42
cour droit Řtre conformOment
41 37 37 36
dispositions conseil constitutionnelle dOlai
36 29 28 28
> grep("dOmocratie",rownames(m))
[1] 416
> findAssocs(tdm,'dOmocratie',.71)
$dOmocratie
numeric(0)
>
>
1. Compléter le coder R
2. Quel est la longueur des trois mots le plus long utilisée dans ce texte ? Indiquer lesquels ?
Réponse : Infraconstitutionnelle (22), inconstitutionnalité (20), intrumentalisation (19).
3. Combien de fois le mot ”démocratie” a été utilisé dans la constitution tunisien? Quels sont les mots
les plus associés à ce terme?
Réponse : 1 fois, les mots associés à ”démocratie” sont adoptent, civil, faciliter.
4. Quel est le terme le plus fréquent dans la constitution ?
Réponse : Assemblée.
5. Combien d’articles dans la constitution tunisienne ?
Réponse : 149
6. Indiquer les étapes suivis (traitement du texte) pour arriver à l’objet Corpus.
Réponse : 1/ transformer tous les mots en miniscule, 2/ supprimer les mots l’. d’..., 3/ supprimer les
”stopwords” en francais : le, la, les...etc.
5
Exercice 4
Considère le tableau de contingence dataCirc2 Circonscription x candidats des résultats des dernières
éléctions présidentielles en Tunisie. Les colonnes représentent sont le nombre de votes obtenus pour chaque
candidat. Les deux dernières colonnes : BCE.2T et Marzouki.2T sont le nombre de votes au 2nd tour. On a
effectué une analyse de correspondances pour comprendre la correspondances entre les votes des candidats,
entre régions x candidats et reports des votes du 2nd tour. On trace la représentation simultanée candidats
x région à patir du le code R suivant :
> load("~/Documents/Examens14_15/Exam_Principal_dm_1415/CodeR_Exam_dm/tElec.RData")
> library(FactoMineR)
> ac1=CA(dataCirc2,ncp = 3,col.sup = 28:29,graph = F)
> rownames(ac1$col$coord)=c(as.character(1:6),"BCE.1T","Riahi",as.character(9:23),"Marzouki.1T","25"
> ac1$col$coord[c(7,8,24,26,27),1]
BCE.1T Riahi Marzouki.1T H.Hamdi Hammami
-0.12440559 -0.19467101 -0.11396980 1.76971198 -0.06684761
> ac1$col$coord[c(7,8,24,26,27),2]
BCE.1T Riahi Marzouki.1T H.Hamdi Hammami
0.30792567 0.04850019 -0.44914042 0.01935463 0.21501139
> ac1$col.sup$coord[,1]
BCE.y Marzouki.y
-0.03368558 -0.03244745
> ac1$col.sup$coord[,2]
BCE.y Marzouki.y
0.2929097 -0.4112017
2
4
1
BCE.1T
10
BCE.2T
Hammami
13
321
141
Riahi 12 25
218 H.Hamdi
20
0
Axe2
19
16 23 5 6
11
2215 9
17
−1
Marzouki.2T
Marzouki.1T
−2
−1 0 1 2 3 4 5 6
Axe1
6
1. Quelle méthode statistique a été utilisée pour cette analyse ? Indiquer quels package R et quelles sont
les commandes R utilis.
Réponse : On utilisé une analyse des correspondances appliquée sur le tableu de contingences circon-
scriotions x candidats + deux colonnes supplémentaires BCE.2T et Marzouki.2T. Le package R utilisé
est FactoMineR avec la commande CA.
2. Placez le nom des candidats manquants (du 1er et 2nd Tour) sur le graphique.
3. Comment intrétez-vous la position des deux candidats BCE et Marzouki (1er et 2nd Tour) sur le
graphique ? Comment intrétez-vous la position du candidat H. Hamdi sur le graphique ?
Réponse :
• On remarque 1/ qu’il un changement faible entre les positions de BCE et Marzouki (1er et 2nd
Tour) : il n’y a pas eu de changement de ”couleurs” entre le 1er et le 2nd tour des gouvernorats
lors de ces éléctions. 2/ Qu’il y a une opposotion sur le 2nd Axe entre les candidats BCE et
Marzouki ce qui veut dire clairement qu’il y a des zones en majorité claire en faveur de l’un ou
de l’autre candidat.
• La majorité des votes du candidat H. Hamdi proviennent de la région SBZ.

Examen Data Mining Janvier 2015 Correcti

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Examen Data Mining Janvier 2015 Correcti

Uploaded by

Copyright:

Available Formats

E.S.S.A.

Examen Session Principale

Cours Data Mining

Durée : 1h30 D. Malouche,

[1] "age" "DEXfat" "waistcirc" "hipcirc" "elbowbreadth"

> ### On divise les donnees en deux sous-echantillons

> bodyfat.train <- bodyfat[ind==1,]

> bodyfat.test <- bodyfat[ind==2,]

node), split, n, deviance, yval

1) root 56 7265.0290000 30.94589

hipcirc< 96.25 waistcirc< 104.8

age DEXfat waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b

setosa versicolor virginica

setosa versicolor virginica

OOB estimate of error rate: 3.33%

1. Indiquer la signification du résultat de la commande importance(rf).

2. Indiquer comment l’erreur OBB a été calculuée.

[1] "Article premier." "ArticleA֒ 2." "Article 3." "ArticleA֒ 4."

<<TermDocumentMatrix (terms: 1637, documents: 433)>>

[1] 1637 433

[1] 1637 433

You might also like