Utilisation de données externes

Importation de données

heart.txt

heart = read.table("donnees/heart.txt", header = T)
head(heart)

  age     sexe type_douleur pression cholester sucre electro taux_max
1  70 masculin            D      130       322     A       C      109
2  67  feminin            C      115       564     A       C      160
3  57 masculin            B      124       261     A       A      141
4  64 masculin            D      128       263     A       A      105
5  74  feminin            B      120       269     A       C      121
6  65 masculin            D      120       177     A       A      140
  angine depression pic vaisseau    coeur
1    non        2.4   2        D presence
2    non        1.6   2        A  absence
3    non        0.3   1        A presence
4    oui        0.2   2        B  absence
5    oui        0.2   1        B  absence
6    non        0.4   1        A  absence

Detroit_homicide.txt

dh = read.table("donnees/Detroit_homicide.txt", 
                skip = 35, header = T)
head(dh)

     FTP UEMP   MAN    LIC     GR CLEAR     WM  NMAN   GOV   HE     WE
1 260.35 11.0 455.5 178.15 215.98  93.4 558724 538.1 133.9 2.98 117.18
2 269.80  7.0 480.2 156.41 180.48  88.5 538584 547.6 137.6 3.09 134.02
3 272.04  5.2 506.1 198.02 209.57  94.4 519171 562.8 143.6 3.23 141.68
4 272.96  4.3 535.8 222.10 231.67  92.0 500457 591.0 150.3 3.33 147.98
5 272.51  3.5 576.0 301.92 297.65  91.0 482418 626.1 164.3 3.46 159.85
6 261.34  3.2 601.7 391.22 367.62  87.4 465029 659.8 179.5 3.60 157.19
    HOM   ACC    ASR
1  8.60 39.17 306.18
2  8.90 40.27 315.16
3  8.52 45.31 277.53
4  8.89 49.51 234.07
5 13.07 55.05 230.84
6 14.57 53.90 217.99

dim(dh)

[1] 13 14

hepatitis.TXT

hep = read.table("donnees/hepatitis.TXT",
                 header = T, na.strings = "?")
head(hep)

  AGE    SEX STEROID ANTIVIRALS FATIGUE MALAISE ANOREXIA LIVER_BIG
1  30   male      no         no      no      no       no        no
2  50 female      no         no     yes      no       no        no
3  78 female     yes         no     yes      no       no       yes
4  31 female    <NA>        yes      no      no       no       yes
5  34 female     yes         no      no      no       no       yes
6  34 female     yes         no      no      no       no       yes
  LIVER_FIRM SPLEEN_PALPABLE SPIDERS ASCITES VARICES BILIRUBIN
1         no              no      no      no      no       1.0
2         no              no      no      no      no       0.9
3         no              no      no      no      no       0.7
4         no              no      no      no      no       0.7
5         no              no      no      no      no       1.0
6         no              no      no      no      no       0.9
  ALK_PHOSPHATE SGOT ALBUMIN PROTIME HISTOLOGY Class
1         85.00   18     4.0   61.85        no  LIVE
2        135.00   42     3.5   61.85        no  LIVE
3         96.00   32     4.0   61.85        no  LIVE
4         46.00   52     4.0   80.00        no  LIVE
5        105.33  200     4.0   61.85        no  LIVE
6         95.00   28     4.0   75.00        no  LIVE

adult

adult = read.table("donnees/adult.data", 
                   sep = ",", na.strings = " ?")
head(adult)

  V1                V2     V3         V4 V5                  V6
1 39         State-gov  77516  Bachelors 13       Never-married
2 50  Self-emp-not-inc  83311  Bachelors 13  Married-civ-spouse
3 38           Private 215646    HS-grad  9            Divorced
4 53           Private 234721       11th  7  Married-civ-spouse
5 28           Private 338409  Bachelors 13  Married-civ-spouse
6 37           Private 284582    Masters 14  Married-civ-spouse
                  V7             V8     V9     V10  V11 V12 V13
1       Adm-clerical  Not-in-family  White    Male 2174   0  40
2    Exec-managerial        Husband  White    Male    0   0  13
3  Handlers-cleaners  Not-in-family  White    Male    0   0  40
4  Handlers-cleaners        Husband  Black    Male    0   0  40
5     Prof-specialty           Wife  Black  Female    0   0  40
6    Exec-managerial           Wife  White  Female    0   0  40
             V14    V15
1  United-States  <=50K
2  United-States  <=50K
3  United-States  <=50K
4  United-States  <=50K
5           Cuba  <=50K
6  United-States  <=50K

names(adult)

 [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11"
[12] "V12" "V13" "V14" "V15"

adult.names = read.table("donnees/adult.names",
                         skip = 96, sep = ":",
                         stringsAsFactors = FALSE)
adult.names

               V1
1             age
2       workclass
3          fnlwgt
4       education
5   education-num
6  marital-status
7      occupation
8    relationship
9            race
10            sex
11   capital-gain
12   capital-loss
13 hours-per-week
14 native-country
                                                                                                                                                                                                                                                                                                                                                                                                                                 V2
1                                                                                                                                                                                                                                                                                                                                                                                                                       continuous.
2                                                                                                                                                                                                                                                                                                                            Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
3                                                                                                                                                                                                                                                                                                                                                                                                                       continuous.
4                                                                                                                                                                                                                                                                            Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
5                                                                                                                                                                                                                                                                                                                                                                                                                       continuous.
6                                                                                                                                                                                                                                                                                                                        Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
7                                                                                                                                                                                                         Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
8                                                                                                                                                                                                                                                                                                                                                               Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
9                                                                                                                                                                                                                                                                                                                                                                      White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
10                                                                                                                                                                                                                                                                                                                                                                                                                    Female, Male.
11                                                                                                                                                                                                                                                                                                                                                                                                                      continuous.
12                                                                                                                                                                                                                                                                                                                                                                                                                      continuous.
13                                                                                                                                                                                                                                                                                                                                                                                                                      continuous.
14  United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

adult.names$V1

 [1] "age"            "workclass"      "fnlwgt"         "education"     
 [5] "education-num"  "marital-status" "occupation"     "relationship"  
 [9] "race"           "sex"            "capital-gain"   "capital-loss"  
[13] "hours-per-week" "native-country"

names(adult) = c(adult.names$V1, "class")
head(adult)

  age         workclass fnlwgt  education education-num
1  39         State-gov  77516  Bachelors            13
2  50  Self-emp-not-inc  83311  Bachelors            13
3  38           Private 215646    HS-grad             9
4  53           Private 234721       11th             7
5  28           Private 338409  Bachelors            13
6  37           Private 284582    Masters            14
       marital-status         occupation   relationship   race     sex
1       Never-married       Adm-clerical  Not-in-family  White    Male
2  Married-civ-spouse    Exec-managerial        Husband  White    Male
3            Divorced  Handlers-cleaners  Not-in-family  White    Male
4  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
5  Married-civ-spouse     Prof-specialty           Wife  Black  Female
6  Married-civ-spouse    Exec-managerial           Wife  White  Female
  capital-gain capital-loss hours-per-week native-country  class
1         2174            0             40  United-States  <=50K
2            0            0             13  United-States  <=50K
3            0            0             40  United-States  <=50K
4            0            0             40  United-States  <=50K
5            0            0             40           Cuba  <=50K
6            0            0             40  United-States  <=50K

Compléments

Reprendre l’importation du fichier "heart.txt" (cf ci-dessus), et répondre aux questions suivantes en complétant le code précédemment écrit.

Créer une indicatrice binaire FALSE/TRUE pour la présence ou non de problème de coeur (dernière variable)

heart$indicatrice = heart$coeur == "presence"

Créer une variable comptant le nombre de fois où une variable est égale à A (entre type_douleur, sucre, electro, et vaisseau)

heart$nbA = (heart$type_douleur == "A") + 
  (heart$sucre == "A") +
  (heart$electro == "A") +
  (heart$vaisseau == "A")

heart$nbAbis = 
  rowSums(heart[c("type_douleur", "sucre", "electro", "vaisseau")] == "A")

Créer une variable factor à partir de l’indicatrice binaire faite au point 1 avec comme labels des modalités presence pour TRUE et absence pour FALSE

heart$ind2 = factor(heart$indicatrice, labels = c("Absence", "Présence"))
head(heart)

  age     sexe type_douleur pression cholester sucre electro taux_max
1  70 masculin            D      130       322     A       C      109
2  67  feminin            C      115       564     A       C      160
3  57 masculin            B      124       261     A       A      141
4  64 masculin            D      128       263     A       A      105
5  74  feminin            B      120       269     A       C      121
6  65 masculin            D      120       177     A       A      140
  angine depression pic vaisseau    coeur indicatrice nbA nbAbis     ind2
1    non        2.4   2        D presence        TRUE   1      1 Présence
2    non        1.6   2        A  absence       FALSE   2      2  Absence
3    non        0.3   1        A presence        TRUE   3      3 Présence
4    oui        0.2   2        B  absence       FALSE   2      2  Absence
5    oui        0.2   1        B  absence       FALSE   1      1  Absence
6    non        0.4   1        A  absence       FALSE   3      3  Absence

Créer un nouveau data.frame avec uniquement les individus ayant strictement moins de 60 ans

heart1a = heart[heart$age < 60,]
heart1b = heart[which(heart$age < 60),]
heart1 = subset(heart, age < 60)

Créer maintenant, à partir du précédent, deux data.frames :

un pour les hommes
un autre pour les femmes

heart1f = subset(heart1, sexe == "feminin")
heart1m = subset(heart1, sexe == "masculin")

Reprendre l’importation du fichier detroit_homicide.txt (cf ci-dessus)

Intégrer le texte introductif dans un attribut du data.frame

attributes(dh)

$names
 [1] "FTP"   "UEMP"  "MAN"   "LIC"   "GR"    "CLEAR" "WM"    "NMAN" 
 [9] "GOV"   "HE"    "WE"    "HOM"   "ACC"   "ASR"  

$class
[1] "data.frame"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13

attr(dh, "names")

 [1] "FTP"   "UEMP"  "MAN"   "LIC"   "GR"    "CLEAR" "WM"    "NMAN" 
 [9] "GOV"   "HE"    "WE"    "HOM"   "ACC"   "ASR"

attr(dh, "info") = 
  paste(readLines("donnees/Detroit_homicide.txt", n = 19), collapse = "\n")
cat(attr(dh, "info"))

This is the data set called `DETROIT' in the book `Subset selection in
regression' by Alan J. Miller published in the Chapman & Hall series of
monographs on Statistics & Applied Probability, no. 40.   The data are
unusual in that a subset of three predictors can be found which gives a
very much better fit to the data than the subsets found from the Efroymson
stepwise algorithm, or from forward selection or backward elimination.

The original data were given in appendix A of `Regression analysis and its
application: A data-oriented approach' by Gunst & Mason, Statistics
textbooks and monographs no. 24, Marcel Dekker.   It has caused problems
because some copies of the Gunst & Mason book do not contain all of the data,
and because Miller does not say which variables he used as predictors and
which is the dependent variable.   (HOM was the dependent variable, and the
predictors were FTP ... WE)

The data were collected by J.C. Fisher and used in his paper: "Homicide in
Detroit: The Role of Firearms", Criminology, vol.14, 387-400 (1976)

The data are on the homicide rate in Detroit for the years 1961-1973.

Intégrer les labels des variables dans un autre attribut, sous forme de data.frame à deux colonnes

noms = tail(readLines("donnees/Detroit_homicide.txt", n = 34), 15)
noms = noms[noms != ""]
attr(dh, "info.var") = data.frame(
  var = trimws(substr(noms, 1, 6)),
  descriptif = substr(noms, 10, 100),
  stringsAsFactors = FALSE
)

Utilisation de données externes - correction

Programmation statistique avec `R` - STID 2ème année

Importation de données

heart.txt

Detroit_homicide.txt

hepatitis.TXT

adult

Compléments

Créer une indicatrice binaire FALSE/TRUE pour la présence ou non de problème de coeur (dernière variable)

Créer une variable comptant le nombre de fois où une variable est égale à A (entre type_douleur, sucre, electro, et vaisseau)

Créer une variable factor à partir de l’indicatrice binaire faite au point 1 avec comme labels des modalités presence pour TRUE et absence pour FALSE

Créer un nouveau data.frame avec uniquement les individus ayant strictement moins de 60 ans

Créer maintenant, à partir du précédent, deux data.frames :

Reprendre l’importation du fichier detroit_homicide.txt (cf ci-dessus)

Intégrer le texte introductif dans un attribut du data.frame

Intégrer les labels des variables dans un autre attribut, sous forme de data.frame à deux colonnes

Utilisation de données externes - correction

Programmation statistique avec R - STID 2ème année

Importation de données

heart.txt

Detroit_homicide.txt

hepatitis.TXT

adult

Compléments

Créer une indicatrice binaire FALSE/TRUE pour la présence ou non de problème de coeur (dernière variable)

Créer une variable comptant le nombre de fois où une variable est égale à A (entre type_douleur, sucre, electro, et vaisseau)

Créer une variable factor à partir de l’indicatrice binaire faite au point 1 avec comme labels des modalités presence pour TRUE et absence pour FALSE

Créer un nouveau data.frame avec uniquement les individus ayant strictement moins de 60 ans

Créer maintenant, à partir du précédent, deux data.frames :

Reprendre l’importation du fichier detroit_homicide.txt (cf ci-dessus)

Intégrer le texte introductif dans un attribut du data.frame

Intégrer les labels des variables dans un autre attribut, sous forme de data.frame à deux colonnes

Programmation statistique avec `R` - STID 2ème année