Chapter 4 Data Structures and Types
There is no scalar in R.
There is no scalar in R.
There is no scalar in R.
Every structure is of vector form. A (seemingly) scalar is indeed a vector of length 1. This is a striking concept compared to other general-purposed language or statistical packages. It turns out that the fact won’t affect mush about how to code R, but the understanding of such concept is helpful for better coding thinking in the R language.
Compared to other general-purposed language, the built-in data structures are not rich in R. But they are considerably flexible. Most of the analytics can be fulfilled based on the minimal set of these structures.
All structures are vectors in R. And there are two types of vector: atomic and recursive. This chapter will quickly go through both of them.
4.1 Atomic Vector
Common types of atomic vector include: numeric, character, logical, and factor.
Atomic vector is the foundamental data structure in R. It is called atomic because it can only contain the same type of data and no nesting allowed. For example:
1
## [1] 1
The number 1
is a vector (of length 1, so it looks like a scalar but yes it is a vector). To check the type of a vector one can use the function typeof
:
typeof(1)
## [1] "double"
The result of typeof
reveals that the storage mode of the vector 1
is double; hence the vector is a numeric vector. There are many bulit-in functions to manipulate numeric vector. Some examples follow:
1:5
## [1] 1 2 3 4 5
seq(5, 1, -1)
## [1] 5 4 3 2 1
c(1, 1, 2, 3, 5)
## [1] 1 1 2 3 5
rep(777, 3)
## [1] 777 777 777
One can also use integer
, numeric
, character
, and logical
as a creation function to create vectors of corresponding type, with the argument to be the desired length of the created vector:
integer(3) # default value is 0L
## [1] 0 0 0
numeric(3) # default value is 0
## [1] 0 0 0
character(3) # default value is "" (empty string)
## [1] "" "" ""
logical(3) # default value is FALSE
## [1] FALSE FALSE FALSE
Normally a number will be associated with type double
. To force R use integer
type (it use considerably less storage space compared to a double
), just append L
to the number:
typeof(1:5L)
## [1] "integer"
Since an atomic vector is, atomic, it has no nesting structure:
c(1:5, seq(5, 1, -1), c(1, 1, 2, 3, 5))
## [1] 1 2 3 4 5 5 4 3 2 1 1 1 2 3 5
The result of concatenation of multiple atomic vectors is one single atomic vector, not a vector nested with three different vectors. Does this mean that atomic vectors of different types can not be combined? Not necessary.
num_vec <- 1:3
str_vec <- c('a', 'b', 'c')
c(num_vec, str_vec)
## [1] "1" "2" "3" "a" "b" "c"
When a numeric vector and a character vector combine, the resulting atomic vector is of type character
. Such behavior is called type coercion and is a very important concept in most programming languages. Coercion always occurs when different types of atomic vectors are combined. The principle of coercion is to minimize information loss, if any.
For example, when a logical vector combined with a numeric:
num_vec <- 1:3 # here one uses "<-"" to asign a value to an object
bol_vec <- c(TRUE, FALSE, FALSE) # or c(T, F, F) but not recommended
typeof(c(num_vec, bol_vec))
## [1] "integer"
logicals are coerced into integer
.
More examples on coercion:
typeof(c(1, 2L))
## [1] "double"
typeof(c(1L, TRUE))
## [1] "integer"
typeof(c(1.0, TRUE))
## [1] "double"
typeof(c(1.0, TRUE, "foo"))
## [1] "character"
In addition to coercion, one can also explicitly convert one type to another. This is done by the function family as
. For example, to convert logicals into integers and vice versa:
as.logical(-1:2L)
## [1] TRUE FALSE TRUE TRUE
as.integer(c(TRUE, FALSE, FALSE))
## [1] 1 0 0
Notice that when casting integers to logicals, all the non-zeroes are converted to TRUE
, only 0s are converted to FALSE
.
4.1.1 General Operations on Vectors
4.1.1.1 Subseting
Vectors can be subset by using the bracket syntax.
vv <- 10:1
vv[1:2]
## [1] 10 9
vv[c(1, 3, 5)]
## [1] 10 8 6
vv[rep(3, 5)]
## [1] 8 8 8 8 8
The bracket accepts another vector as a selection vector. If the selection vector is of type numeric
, the original vector is to be selected based on numerical index. Any non-integer will be floor
ed so the result of c(1:10)[c(1.1, 1.9)]
is 1, 1.
Since the bracket accepts vectors, one can also use logical vector for subsetting:
vv <- 3:1
vv[c(TRUE, FALSE, FALSE)]
## [1] 3
The fact that subsetting is done by another vector results in filtering being very easy:
vv <- 1:10
vv[vv > 5]
## [1] 6 7 8 9 10
Since all structures are vectors, mathematical operators such as >
also work on vectors by nature. The result of vv > 5
is an element-wise comparison and hence also a vector of the same lenght:
vv <- 1:10
vv > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
The resulting logical vector can then be used to subset the original vector.
One can also use negative selection:
vv <- 1:3
vv[-3]
## [1] 1 2
vv[-length(vv)]
## [1] 1 2
vv[-(1:3)]
## integer(0)
Subset a vector may resulting in missing values:
vv <- 1:3
vv[4]
## [1] NA
Here the selection vector tries to extract the 4th element of the original vector but there is none. This will not cause any error in R; instead, the resulting vector will contains NA
, which generally means a missing value.
4.1.1.2 Updating
Vectors can be updated by using the bracket syntax as well.
vv <- 1:5
vv[1:3] <- 0
vv
## [1] 0 0 0 4 5
One interesting question arises: what happen if one tried to update an element that does not exist in the vector?
vv <- 1:5
vv[7] <- 0
vv
## [1] 1 2 3 4 5 NA 0
Two things to be noticed. First, the value will be asigned, as if it is newly created. Second, the vector will be expanded with NA
s to align the new length that satisfies the updated result.
How about a delete? There is no delete method in R. To effectively delete an element in a vector, use negative selection and asign the new object:
# to delete the 5th element:
vv <- 1:5
(vv <- vv[-5]) # use parenthesis to force print
## [1] 1 2 3 4
The operation of the bracket syntax is indeed functional. (Type ?"["
to see the document.) To understand more about what’s going on behibd the scene, one should refer to section 7.5 and also 7.6.
4.1.1.3 Naming
Vectors can be named. To check the names of a vector, use the function names
. To name or rename a vector, just try asign the names in a character vector to the names
function call.
vv <- 1:3
names(vv)
## NULL
names(vv) <- c("foo", "bar", "baz")
str(vv)
## Named int [1:3] 1 2 3
## - attr(*, "names")= chr [1:3] "foo" "bar" "baz"
If one is confused why the renaming syntax actually works, see section 7.6 for more details.
Vectors can be partially named:
vv <- 1:3
names(vv) <- c("foo", "bar")
names(vv)
## [1] "foo" "bar" NA
Since the third element is unnamed, it is NA
when calling names
to print the names.
4.1.2 Factor
Factors are special integer vectors. They are generally used to record categorical variables. It combines both features of characters and numerics, so it could be confusing for new users. Usually a factor is defined on a set of characters:
baz <- c("foo", "bar")
bazf <- factor(baz)
typeof(bazf)
## [1] "integer"
str(bazf)
## Factor w/ 2 levels "bar","foo": 2 1
Notice that the typeof
indicates an integer
type. This is true because factors are stored internally as integers. The levels
of a factor reveals the complete set of characters that could appear in the factor. Now consider a little more complicated example:
baz <- c("foo", "bar", "bar", "foo", "bar")
bazf <- factor(baz)
levels(bazf)
## [1] "bar" "foo"
When an object contains repeating samples of characters, it may be a good choice to use factor
type to represent it for performance and efficiency issue, since levels
only contain distinct values.
levels
can be manually specified when creating a factor
, and levels
are ordered. The order will be automatically determined if not specified.
baz <- c("foo", "bar", "bar", "foo", "bar")
bazf1 <- factor(baz, levels=c("bar", "foo"))
bazf2 <- factor(baz, levels=c("foo", "bar"))
str(bazf1)
## Factor w/ 2 levels "bar","foo": 2 1 1 2 1
str(bazf2)
## Factor w/ 2 levels "foo","bar": 1 2 2 1 2
levels
can even contain unseen values:
baz <- c("foo", "bar", "bar", "foo", "bar")
bazf3 <- factor(baz, levels=c("foo", "bar", "baz"))
str(bazf3)
## Factor w/ 3 levels "foo","bar","baz": 1 2 2 1 2
4.1.3 Matrix
A matrix in R is nothing more than a vector with the dim
attribute. That is, a matrix is still a vector. To create a matrix one can use the matrix
function:
matrix(0, 3, 3)
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
mm <- matrix(1:9, 3, 3)
str(mm) # always use the str function to check variable that you don't know for sure
## int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
attributes(mm)
## $dim
## [1] 3 3
For now just forget about what an “attribute” is. One can always use dim
function to check the dimension attribute of a variable:
vv <- 1:9
dim(vv)
## NULL
A matrix can also be created by existing vector:
vv <- 1:9
dim(vv) <- c(3,3)
vv
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Here the second line creates (or replaces) the dim
attribute with a 3-by-3 dimension setup.
4.1.3.1 Digression on attributes
This section is for those who are still curious about “attribute”. One can use the attributes
function to query all available attributes on an object:
myobject <- 1:6
attributes(myobject)
## NULL
As one can see, a pure numeric vector has no attribute at all. Arbitrary attribute can be set by using the attr
function:
attr(myobject, "myattr") <- "so be it"
attr(myobject, "myattr") # attr can also be used to query specific attribute by name
## [1] "so be it"
attributes(myobject) # unlike attr, query all attributes at once
## $myattr
## [1] "so be it"
str(myobject)
## atomic [1:6] 1 2 3 4 5 6
## - attr(*, "myattr")= chr "so be it"
This is the OO (Object-Oriented) nature part of the R language. R is a hybrid of OO and functional. The detailed OO system concept in R is beyond the scope of this chapter so will not be elaborated any further. Here one should just remember that in R everythin is an object, and every object can have attributes.
Now let’s considering giving the object an attribute named dim
:
attr(myobject, "dim") <- c(2, 3)
str(myobject)
## int [1:2, 1:3] 1 2 3 4 5 6
## - attr(*, "myattr")= chr "so be it"
class(myobject)
## [1] "matrix"
myobject
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## attr(,"myattr")
## [1] "so be it"
Clearly the object becomes a matrix
! Here the function call to attr
on dim
is effectively the same as using dim<-
in the previous example. One should now understand that every object can have any sort of attributes, but some attributes are special than others. dim
is one of such specials. An atomic vector with a dim
attribute makes itself a matrix
class.
4.1.3.2 Matrix Operations
Matrices are mathematical matrices, so they can readily perform linear algebra:
(vv <- matrix(1:9, 3, 3))
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
# scalar product
vv * 3
## [,1] [,2] [,3]
## [1,] 3 12 21
## [2,] 6 15 24
## [3,] 9 18 27
# inner product
vv %*% matrix(1:6, 3, 2)
## [,1] [,2]
## [1,] 30 66
## [2,] 36 81
## [3,] 42 96
# outer products
1:9 %o% 1:9 # a more general version: see ?outer
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 1 2 3 4 5 6 7 8 9
## [2,] 2 4 6 8 10 12 14 16 18
## [3,] 3 6 9 12 15 18 21 24 27
## [4,] 4 8 12 16 20 24 28 32 36
## [5,] 5 10 15 20 25 30 35 40 45
## [6,] 6 12 18 24 30 36 42 48 54
## [7,] 7 14 21 28 35 42 49 56 63
## [8,] 8 16 24 32 40 48 56 64 72
## [9,] 9 18 27 36 45 54 63 72 81
4.1.4 Array
When there is more than 2 dimensions, a vector may be called an array. One can also call a matrix is a 2-dimensional array.
4.2 Recursive Vector
4.2.1 List
The most important representative data structrue of recursive vector is list
. A list, unlike atomic vector, can contain elements of different types, and can be nested:
alist <- list(1:3, c("foo", "bar"), paste, list(letters, month.abb))
str(alist)
## List of 4
## $ : int [1:3] 1 2 3
## $ : chr [1:2] "foo" "bar"
## $ :function (..., sep = " ", collapse = NULL)
## $ :List of 2
## ..$ : chr [1:26] "a" "b" "c" "d" ...
## ..$ : chr [1:12] "Jan" "Feb" "Mar" "Apr" ...
All the operation appliable to atomic vector is also appliable to list. A list can also be named.
# slicing
alist[1:3]
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "foo" "bar"
##
## [[3]]
## function (..., sep = " ", collapse = NULL)
## .Internal(paste(list(...), sep, collapse))
## <bytecode: 0x7fc0a4091ca0>
## <environment: namespace:base>
# replacement
alist[1] <- 1
# delete
alist[1] <- NULL
# concatenate
c(alist, list(1:3))
## [[1]]
## [1] "foo" "bar"
##
## [[2]]
## function (..., sep = " ", collapse = NULL)
## .Internal(paste(list(...), sep, collapse))
## <bytecode: 0x7fc0a4091ca0>
## <environment: namespace:base>
##
## [[3]]
## [[3]][[1]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
##
## [[3]][[2]]
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
##
##
## [[4]]
## [1] 1 2 3
There is a little difference in indexing. The single bracket [
results in a list; the double brackets [[
results in the element itself without warpping in a list. To be clear, see the following example:
alist <- list(1:3, c("foo", "bar"), paste, list(letters, month.abb))
alist[1]
## [[1]]
## [1] 1 2 3
alist[[1]]
## [1] 1 2 3
identical(unlist(alist[1]), alist[[1]])
## [1] TRUE
This also means that [
is used to slice a list (into another list), while [[
is used to extract single element within a list. Both support the use of character vectors to select by names:
alist <- list(integers=1:3, foobar=c("foo", "bar"), pf=paste, another_list=list(letters, month.abb))
names(alist)
## [1] "integers" "foobar" "pf" "another_list"
alist[c("foobar")] # a list (of length 1)
## $foobar
## [1] "foo" "bar"
alist[["foobar"]] # a character vector
## [1] "foo" "bar"
There is another special operator for list: the $
extractor. It serves the same purpose as [[
that it can be used to extract single element in a list by name, but in this case a variable (a symbol) rather than a character is used:
identical(alist$foobar, alist[["foobar"]])
## [1] TRUE
4.2.2 Data Frame
One important extension of list
is data.frame
, a table data structure to represent tabular data. A data.frame
is indeed a list
, with additional attributes.
DF <- data.frame(a=1:3, b=letters[1:3])
str(DF)
## 'data.frame': 3 obs. of 2 variables:
## $ a: int 1 2 3
## $ b: Factor w/ 3 levels "a","b","c": 1 2 3
class(DF) # the abstract type (in OOP sense)
## [1] "data.frame"
typeof(DF) # the storage type
## [1] "list"
A data.frame
can be thought of as a list with elements sharing exactly the same length. Useful operation on data.frame
includes:
- sort: using
order
- join: using
merge
- append: using
rbind
orcbind
- aggregate: using
by
,aggregate
, orave
- pivot: using
reshape
- partition: using
split
Each operation above worth a considerable space to elaborate but since there is a overall better solution to model tabular data in R: the data.table
package, the details are left for readers to explore on their own.
4.3 Special Values
4.3.1 NA
The not applicable value in R, usually means missing value. NA
is more complicated than its first seen. There are actually different types of NA
:
typeof(NA)
## [1] "logical"
typeof(NA_character_)
## [1] "character"
typeof(NA_integer_)
## [1] "integer"
typeof(NA_real_)
## [1] "double"
Most time one uses NA
in general, and let the R language to deal with the casting:
c("foo", "bar", NA) # the NA is indeed a character type
## [1] "foo" "bar" NA
c(1:3, NA) # here NA is typed integer
## [1] 1 2 3 NA
Many operation can result in missing value. For example, to subset using out-of-range index:
letters[26:27] # there is no 27th alphabet in English!
## [1] "z" NA
4.3.2 NaN
NaN
means “not a number.” The value results from mathematic operation that results in un-defined value, say:
0/0
## [1] NaN
Also notice that 1/0
results in Inf
and -1/0
results in -Inf
.
4.3.3 NULL
NULL
is another special value: it means nothing. Do not confuse it with NA
. NULL
means nothing so it does not have length:
length(NULL)
## [1] 0
length(NA)
## [1] 1