pendigits
data - first analysisR
Here are the packages we use in this work.
library(reshape2)
library(ggplot2)
library(FactoMineR)
Download the two following files :
Data description can be found here: each line represent a draw of a digit between 0 and 9, with 8 points (\((x,y)\) coordinates for each point) and the number drawn.
The data files are here easy to read:
","
pen.tra = read.table("donnees/pendigits.tra", sep = ",")
pen.tes = read.table("donnees/pendigits.tes", sep = ",")
Now, to collapse them together, we use the following code
pen = rbind(pen.tra, pen.tes)
At the end of the import process, we name the attributes according to what we know: the two first attributes are \((x_1, y_1)\), the third and the fourth \((x_2, y_2)\), and so on. The last attribute is the written digit. Then, we create the vector "X1" "Y1" "X2" "Y2" ... "X8" "Y8" "digit"
.
names(pen) = c(paste0(c("X", "Y"), rep(1:8, each = 2)), "digit")
class(pen$digit)
## [1] "integer"
digit
variable is here a numeric attribute (integer
more precisely). But, it is a nominal one in reality. So, we transform it into factor
.
pen$digit = factor(pen$digit)
class(pen$digit)
## [1] "factor"
We now have the following value (only 6 first are presented here).
X1 | Y1 | X2 | Y2 | X3 | Y3 | X4 | Y4 | X5 | Y5 | X6 | Y6 | X7 | Y7 | X8 | Y8 | digit |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47 | 100 | 27 | 81 | 57 | 37 | 26 | 0 | 0 | 23 | 56 | 53 | 100 | 90 | 40 | 98 | 8 |
0 | 89 | 27 | 100 | 42 | 75 | 29 | 45 | 15 | 15 | 37 | 0 | 69 | 2 | 100 | 6 | 2 |
0 | 57 | 31 | 68 | 72 | 90 | 100 | 100 | 76 | 75 | 50 | 51 | 28 | 25 | 16 | 0 | 1 |
0 | 100 | 7 | 92 | 5 | 68 | 19 | 45 | 86 | 34 | 100 | 45 | 74 | 23 | 67 | 0 | 4 |
0 | 67 | 49 | 83 | 100 | 100 | 81 | 80 | 60 | 60 | 40 | 40 | 33 | 20 | 47 | 0 | 1 |
100 | 100 | 88 | 99 | 49 | 74 | 17 | 47 | 0 | 16 | 37 | 0 | 73 | 16 | 20 | 20 | 6 |
Our variable of interest is the digit reprensented. So, we start by describe this attribute.
table(pen$digit)
##
## 0 1 2 3 4 5 6 7 8 9
## 1143 1143 1144 1055 1144 1055 1056 1142 1055 1055
Then, we reprensent the bar chart of the distribution of digits. We add a color by digit, because we use these colors later.
ggplot(pen, aes(digit, fill = digit)) +
geom_bar() +
guides(fill = FALSE)
We present here the mean, the median and the standard deviation of each coordinates.
tab = apply(pen[-17], 2, function(x) {
return (c(Mean = mean(x),
Median = median(x),
StdDev = sd(x)))
})
round(t(tab), 2)
## Mean Median StdDev
## X1 38.81 32.0 34.26
## Y1 85.12 89.0 16.22
## X2 40.61 40.0 26.34
## Y2 83.77 91.0 19.16
## X3 49.77 53.0 34.10
## Y3 65.57 71.0 27.00
## X4 51.22 53.5 30.58
## Y4 44.50 43.0 29.91
## X5 56.87 60.0 34.14
## Y5 33.70 33.0 27.25
## X6 60.52 73.0 37.29
## Y6 34.83 30.0 27.12
## X7 55.02 53.0 22.34
## Y7 34.94 27.0 33.16
## X8 47.29 40.0 41.76
## Y8 28.85 9.0 35.78
We also represent the distribution of each coordinate by a boxplot. For that, we can do a plot for each one.
for (j in 1:16)
boxplot(pen[,j])
But, at the end, it produces 16 graphics, hard to compare. So we can therefore reprensent in one graphic all \(x_j\) distribution, and in other one all \(y_j\) distribution. This operation can be done with the melt()
function of the package reshape2
, which transform the form of the data. Here the step by step procedure for \(x\) coordinates
pen.X = pen[seq(1, 15, 2)]
names(pen.X)
## [1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8"
pen.X.melt = melt(pen.X)
## No id variables; using all as measure variables
knitr::kable(head(pen.X.melt))
variable | value |
---|---|
X1 | 47 |
X1 | 0 |
X1 | 0 |
X1 | 0 |
X1 | 0 |
X1 | 100 |
And the boxplots
ggplot(pen.X.melt, aes(variable, value)) +
geom_boxplot()
Now, in one step for \(y\) coordinates
ggplot(melt(pen[seq(2, 16, 2)]), aes(variable, value)) + geom_boxplot()
## No id variables; using all as measure variables
digit
First, we present the mean value of each coordinate for each digit. We can do this step with apply()
(and tapply()
) or with aggregate()
.
apply(pen[-17], 2, tapply, pen$digit, mean)
agg = aggregate(. ~ digit, pen, mean)
knitr::kable(agg, digits = 2)
digit | X1 | Y1 | X2 | Y2 | X3 | Y3 | X4 | Y4 | X5 | Y5 | X6 | Y6 | X7 | Y7 | X8 | Y8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 35.37 | 86.06 | 11.58 | 58.31 | 14.94 | 19.60 | 51.17 | 7.29 | 85.94 | 31.30 | 89.29 | 68.49 | 59.01 | 89.31 | 22.10 | 75.24 |
1 | 14.70 | 61.39 | 44.35 | 77.94 | 69.86 | 89.51 | 77.50 | 79.80 | 67.64 | 54.06 | 47.80 | 32.66 | 44.60 | 16.16 | 59.91 | 1.38 |
2 | 18.39 | 76.95 | 42.13 | 99.39 | 67.46 | 79.76 | 51.28 | 46.05 | 19.83 | 19.38 | 11.64 | 9.09 | 53.06 | 5.25 | 98.71 | 4.17 |
3 | 24.78 | 84.06 | 56.66 | 99.52 | 86.64 | 84.69 | 64.53 | 60.59 | 82.13 | 43.22 | 90.88 | 17.26 | 50.01 | 2.28 | 3.47 | 6.24 |
4 | 42.96 | 99.54 | 22.13 | 79.38 | 5.75 | 51.16 | 42.83 | 40.47 | 85.10 | 49.56 | 86.30 | 59.72 | 70.99 | 31.45 | 62.60 | 0.00 |
5 | 41.24 | 90.94 | 42.60 | 75.83 | 57.31 | 59.18 | 36.46 | 29.36 | 26.18 | 33.15 | 37.64 | 50.24 | 42.83 | 57.69 | 59.46 | 60.31 |
6 | 87.52 | 98.72 | 51.75 | 86.72 | 20.71 | 58.48 | 6.94 | 26.93 | 32.61 | 3.14 | 81.11 | 11.02 | 61.57 | 30.54 | 11.00 | 23.35 |
7 | 3.50 | 91.01 | 45.37 | 98.25 | 78.85 | 80.76 | 71.27 | 47.47 | 52.73 | 14.93 | 33.60 | 18.47 | 39.51 | 33.80 | 81.14 | 34.31 |
8 | 56.95 | 82.08 | 39.83 | 79.62 | 51.81 | 51.93 | 50.56 | 24.22 | 35.25 | 17.07 | 39.93 | 36.90 | 67.78 | 68.49 | 49.00 | 81.40 |
9 | 69.26 | 81.32 | 52.79 | 83.26 | 45.45 | 81.28 | 56.57 | 82.96 | 79.06 | 71.09 | 89.78 | 43.23 | 61.48 | 14.34 | 18.15 | 4.54 |
For graphical representation, we have to produce a lot of boxplots. The most interesting ones in our case is a set of boxplot of \(x_j\) coordinate for each digit (8 sets of 10 boxplots), and the same for each \(y_j\). This can be done due to the melt()
function.
pen.melt = melt(pen, id.vars = 17)
head(pen.melt)
## digit variable value
## 1 8 X1 47
## 2 2 X1 0
## 3 1 X1 0
## 4 4 X1 0
## 5 1 X1 0
## 6 6 X1 100
And now the production of graphic.
ggplot(pen.melt, aes(digit, value, color = digit)) +
geom_boxplot() +
facet_wrap(~ variable)
To go further, we can try to do a heatmap of \((x_j,y_j)\) coordinates for each digit. But, it produces 8 heatmaps per digit. 80 graphics can be very too many to be usefull in this way of working. We will see later how we can take advantage of interactive possibility in R
in data analysis.
We want to represent the first "0"
drawn. This step is interesting for further analysis. We select the first 0 digit, and extract \(x\) and \(y\) coordinates. After we plot these with a line, and add the number of the point.
first.0 = subset(pen, digit == 0)[1,]
x = unlist(first.0[seq(1,15, by = 2)])
y = unlist(first.0[seq(2,16, by = 2)])
par(mar = c(0, 0, 0, 0) + .1)
plot(x, y, type = "b", pch = " ", axes = FALSE, col = "gray70")
text(x, y, 1:8)
Now, we have the basis of a function that can reproduc each digit. We want the following constraints for this one :
v
: a vector with the 16 values of each point (if it is a data.frame
, transform it as simple vector)n
: a numeric value (with NULL
value by default), denoting the number to drawn (possibly not pass)point
: a boolean vector (with FALSE
value by default), indicating if the point number has to be added to the graphadd
: a boolean vector (with FALSE
value by default), indicating if the plot is to added to the previous onecolor
: a specific color (with "black"
value by default)Here is the final function.
drawn <- function(v, n = NULL, point = FALSE, add = FALSE, color = "black") {
# Transformation of the data.frame if needed
if (is.data.frame(v))
v = unlist(v)
# extract x and y coordinates
x = v[seq(1, 15, by = 2)]
y = v[seq(2, 16, by = 2)]
# optimize space into graphics in reducing margin (sse ?par for more information)
opar = par(mar = c(0, 0, 2, 0) + .1)
if (!add) { # Create a graphic
plot(x, y,
# Specify limits is a way to have always the same frame for plotting lines
xlim = c(-5, 105), ylim = c(-5, 105),
# Do not show axes
axes = FALSE,
# If point is TRUE, we add a space (with pch = " ") at each point
# If not, draw a line
type = ifelse(point, "b", "l"),
pch = " ",
# Specify color (black by default)
col = color,
# Add a title (NULL by default)
main = n)
if (point) text(x, y, 1:8)
} else { # Add line to the plot
# lines() add lines to an existing plot (the last produce)
lines(x, y,
# same comment than before
xlim = c(-5, 105), ylim = c(-5, 105),
type = ifelse(point, "b", "l"),
pch = " ",
col = color)
}
par(opar)
}
Now, here some use of this function
drawn(first.0)
drawn(first.0, n = 0)
drawn(first.0, point = TRUE)
drawn(first.0)
first.1 = subset(pen, digit == 1)[1,]
drawn(first.1, add = TRUE, col = "red")
For our curiosity, we can now drawn each first digit of the data. The mfrow
parameter of the par()
function permits us to separate the graphical window into 10 cells (with 2 rows and 5 columns).
par(mfrow = c(2, 5))
for (i in 0:9) {
s = subset(pen, digit == i)[1,]
drawn(s[-16], n = i)
}
From the previous section, we now want to represent the average digit, i.e. a line produce by the average value for each coordinate. To see if there are some area that are not use for a particular digit, we represent also all the drawn in a very light gray.
par(mfrow = c(2, 5))
for (i in 0:9) {
s = subset(pen, digit == i)
# Computing the average values
mean = apply(s[-17], 2, mean)
# Drawn the first drawn
drawn(s[1,1:16], col = "gray90", n = i)
# Add all other drawn
for (j in 2:nrow(s))
drawn(s[j,1:16], col = "gray90", add = TRUE)
# Last, add the average line
drawn(mean[-17], add = TRUE)
}
A good way to represent many variables in one plot is the Principal Component Analysis. Then, we perform PCA on the data, and look at the quality of the projection on the first components.
res = PCA(pen, quali.sup = 17, graph = FALSE)
eig = data.frame(comp = 1:16,
setNames(res$eig, c("eigenvalue", "percent", "cum.percent")))
ggplot(eig) +
geom_bar(aes(comp, percent), stat = "identity") +
stat_summary(aes(comp, cum.percent, group = 1),
fun.y = sum, geom = "line")
Now, we plot the first plan with a specific color for each digit. For that, we have to add the digit
variable to the coordinates data.frame
produce by PCA()
.
res2 = data.frame(res$ind$coord, digit = pen$digit)
ggplot(res2, aes(Dim.1, Dim.2, color = digit)) +
geom_point()
The above plot is hard to understand. So, we choose to represent each digit separately, to have a better idea on how digit are spread (or not) in the PCA projection.
ggplot(res2, aes(Dim.1, Dim.2, color = digit)) +
geom_point() +
facet_wrap(~ digit) +
guides(color = FALSE)
We conclude here that we surely need to clusters data separatly for each digit, to detect if there really are different ways to write a digit (and how).
ggplot()
We can also represent digit drawn with the ggplot()
function. We have to use the melt()
function, and to be careful on how we do this.
f0 = data.frame(x = unlist(first.0[seq(1, 15, 2)]),
y = unlist(first.0[seq(2, 16, 2)]))
ggplot(f0, aes(x, y)) +
geom_path() +
theme_void()
pen$id = 1:nrow(pen)
pen.melt = melt(pen, id.vars = c("id", "digit"))
pen.melt$coord = substr(pen.melt$variable, 1, 1)
pen.melt$point = substr(pen.melt$variable, 2, 2)
pen.melt.X = subset(pen.melt, coord == "X", -c(variable, coord))
names(pen.melt.X) = c("id", "digit", "x", "point")
pen.melt.Y = subset(pen.melt, coord == "Y", -c(variable, coord))
names(pen.melt.Y) = c("id", "digit", "y", "point")
pen.merge = merge(pen.melt.X, pen.melt.Y)
agg.melt = melt(agg, id.vars = "digit")
agg.melt$coord = substr(agg.melt$variable, 1, 1)
agg.melt$point = substr(agg.melt$variable, 2, 2)
agg.melt.X = subset(agg.melt, coord == "X", -c(variable, coord))
names(agg.melt.X) = c("digit", "x", "point")
agg.melt.Y = subset(agg.melt, coord == "Y", -c(variable, coord))
names(agg.melt.Y) = c("digit", "y", "point")
agg.merge = merge(agg.melt.X, agg.melt.Y)
ggplot(pen.merge, aes(x, y)) +
geom_path(aes(group = id), color = "gray90") +
facet_wrap(~ digit) +
geom_path(data = agg.merge) +
theme_void()