Chapter 4 Data Structures and Types

There is no scalar in R.

Every structure is of vector form. A (seemingly) scalar is indeed a vector of length 1. This is a striking concept compared to other general-purposed language or statistical packages. It turns out that the fact won’t affect mush about how to code R, but the understanding of such concept is helpful for better coding thinking in the R language.

Compared to other general-purposed language, the built-in data structures are not rich in R. But they are considerably flexible. Most of the analytics can be fulfilled based on the minimal set of these structures.

All structures are vectors in R. And there are two types of vector: atomic and recursive. This chapter will quickly go through both of them.

4.1 Atomic Vector

Common types of atomic vector include: numeric, character, logical, and factor.

Atomic vector is the foundamental data structure in R. It is called atomic because it can only contain the same type of data and no nesting allowed. For example:

## [1] 1

The number 1 is a vector (of length 1, so it looks like a scalar but yes it is a vector). To check the type of a vector one can use the function typeof:

typeof(1)

## [1] "double"

The result of typeof reveals that the storage mode of the vector 1 is double; hence the vector is a numeric vector. There are many bulit-in functions to manipulate numeric vector. Some examples follow:

1:5

## [1] 1 2 3 4 5

seq(5, 1, -1)

## [1] 5 4 3 2 1

c(1, 1, 2, 3, 5)

## [1] 1 1 2 3 5

rep(777, 3)

## [1] 777 777 777

One can also use integer, numeric, character, and logical as a creation function to create vectors of corresponding type, with the argument to be the desired length of the created vector:

integer(3)    # default value is 0L

## [1] 0 0 0

numeric(3)    # default value is 0

## [1] 0 0 0

character(3)  # default value is "" (empty string)

## [1] "" "" ""

logical(3)    # default value is FALSE

## [1] FALSE FALSE FALSE

Normally a number will be associated with type double. To force R use integer type (it use considerably less storage space compared to a double), just append L to the number:

typeof(1:5L)

## [1] "integer"

Since an atomic vector is, atomic, it has no nesting structure:

c(1:5, seq(5, 1, -1), c(1, 1, 2, 3, 5))

##  [1] 1 2 3 4 5 5 4 3 2 1 1 1 2 3 5

The result of concatenation of multiple atomic vectors is one single atomic vector, not a vector nested with three different vectors. Does this mean that atomic vectors of different types can not be combined? Not necessary.

num_vec <- 1:3
str_vec <- c('a', 'b', 'c')
c(num_vec, str_vec)

## [1] "1" "2" "3" "a" "b" "c"

When a numeric vector and a character vector combine, the resulting atomic vector is of type character. Such behavior is called type coercion and is a very important concept in most programming languages. Coercion always occurs when different types of atomic vectors are combined. The principle of coercion is to minimize information loss, if any.

For example, when a logical vector combined with a numeric:

num_vec <- 1:3 # here one uses "<-"" to asign a value to an object
bol_vec <- c(TRUE, FALSE, FALSE) # or c(T, F, F) but not recommended
typeof(c(num_vec, bol_vec))

## [1] "integer"

logicals are coerced into integer.

More examples on coercion:

typeof(c(1, 2L))

## [1] "double"

typeof(c(1L, TRUE))

## [1] "integer"

typeof(c(1.0, TRUE))

## [1] "double"

typeof(c(1.0, TRUE, "foo"))

## [1] "character"

In addition to coercion, one can also explicitly convert one type to another. This is done by the function family as. For example, to convert logicals into integers and vice versa:

as.logical(-1:2L)

## [1]  TRUE FALSE  TRUE  TRUE

as.integer(c(TRUE, FALSE, FALSE))

## [1] 1 0 0

Notice that when casting integers to logicals, all the non-zeroes are converted to TRUE, only 0s are converted to FALSE.

4.1.1 General Operations on Vectors

4.1.1.1 Subseting

Vectors can be subset by using the bracket syntax.

vv <- 10:1
vv[1:2]

## [1] 10  9

vv[c(1, 3, 5)]

## [1] 10  8  6

vv[rep(3, 5)]

## [1] 8 8 8 8 8

The bracket accepts another vector as a selection vector. If the selection vector is of type numeric, the original vector is to be selected based on numerical index. Any non-integer will be floored so the result of c(1:10)[c(1.1, 1.9)] is 1, 1.

Since the bracket accepts vectors, one can also use logical vector for subsetting:

vv <- 3:1
vv[c(TRUE, FALSE, FALSE)]

## [1] 3

The fact that subsetting is done by another vector results in filtering being very easy:

vv <- 1:10
vv[vv > 5]

## [1]  6  7  8  9 10

Since all structures are vectors, mathematical operators such as > also work on vectors by nature. The result of vv > 5 is an element-wise comparison and hence also a vector of the same lenght:

vv <- 1:10
vv > 5

##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

The resulting logical vector can then be used to subset the original vector.

One can also use negative selection:

vv <- 1:3
vv[-3]

## [1] 1 2

vv[-length(vv)]

## [1] 1 2

vv[-(1:3)]

## integer(0)

Subset a vector may resulting in missing values:

vv <- 1:3
vv[4]

## [1] NA

Here the selection vector tries to extract the 4th element of the original vector but there is none. This will not cause any error in R; instead, the resulting vector will contains NA, which generally means a missing value.

4.1.1.2 Updating

Vectors can be updated by using the bracket syntax as well.

vv <- 1:5
vv[1:3] <- 0
vv

## [1] 0 0 0 4 5

One interesting question arises: what happen if one tried to update an element that does not exist in the vector?

vv <- 1:5
vv[7] <- 0
vv

## [1]  1  2  3  4  5 NA  0

Two things to be noticed. First, the value will be asigned, as if it is newly created. Second, the vector will be expanded with NAs to align the new length that satisfies the updated result.

How about a delete? There is no delete method in R. To effectively delete an element in a vector, use negative selection and asign the new object:

# to delete the 5th element:
vv <- 1:5
(vv <- vv[-5])  # use parenthesis to force print

## [1] 1 2 3 4

The operation of the bracket syntax is indeed functional. (Type ?"[" to see the document.) To understand more about what’s going on behibd the scene, one should refer to section 7.5 and also 7.6.

4.1.1.3 Naming

Vectors can be named. To check the names of a vector, use the function names. To name or rename a vector, just try asign the names in a character vector to the names function call.

vv <- 1:3
names(vv)

## NULL

names(vv) <- c("foo", "bar", "baz")
str(vv)

##  Named int [1:3] 1 2 3
##  - attr(*, "names")= chr [1:3] "foo" "bar" "baz"

If one is confused why the renaming syntax actually works, see section 7.6 for more details.

Vectors can be partially named:

vv <- 1:3
names(vv) <- c("foo", "bar")
names(vv)

## [1] "foo" "bar" NA

Since the third element is unnamed, it is NA when calling names to print the names.

4.1.2 Factor

Factors are special integer vectors. They are generally used to record categorical variables. It combines both features of characters and numerics, so it could be confusing for new users. Usually a factor is defined on a set of characters:

baz <- c("foo", "bar")
bazf <- factor(baz)
typeof(bazf)

## [1] "integer"

str(bazf)

##  Factor w/ 2 levels "bar","foo": 2 1

Notice that the typeof indicates an integer type. This is true because factors are stored internally as integers. The levels of a factor reveals the complete set of characters that could appear in the factor. Now consider a little more complicated example:

baz <- c("foo", "bar", "bar", "foo", "bar")
bazf <- factor(baz)
levels(bazf)

## [1] "bar" "foo"

When an object contains repeating samples of characters, it may be a good choice to use factor type to represent it for performance and efficiency issue, since levels only contain distinct values.

levels can be manually specified when creating a factor, and levels are ordered. The order will be automatically determined if not specified.

baz <- c("foo", "bar", "bar", "foo", "bar")
bazf1 <- factor(baz, levels=c("bar", "foo"))
bazf2 <- factor(baz, levels=c("foo", "bar"))
str(bazf1)

##  Factor w/ 2 levels "bar","foo": 2 1 1 2 1

str(bazf2)

##  Factor w/ 2 levels "foo","bar": 1 2 2 1 2

levels can even contain unseen values:

baz <- c("foo", "bar", "bar", "foo", "bar")
bazf3 <- factor(baz, levels=c("foo", "bar", "baz"))
str(bazf3)

##  Factor w/ 3 levels "foo","bar","baz": 1 2 2 1 2

4.1.3 Matrix

A matrix in R is nothing more than a vector with the dim attribute. That is, a matrix is still a vector. To create a matrix one can use the matrix function:

matrix(0, 3, 3)

##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    0    0    0

mm <- matrix(1:9, 3, 3)
str(mm) # always use the str function to check variable that you don't know for sure

##  int [1:3, 1:3] 1 2 3 4 5 6 7 8 9

attributes(mm)

## $dim
## [1] 3 3

For now just forget about what an “attribute” is. One can always use dim function to check the dimension attribute of a variable:

vv <- 1:9
dim(vv)

## NULL

A matrix can also be created by existing vector:

vv <- 1:9
dim(vv) <- c(3,3)
vv

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Here the second line creates (or replaces) the dim attribute with a 3-by-3 dimension setup.

4.1.3.1 Digression on `attributes`

This section is for those who are still curious about “attribute”. One can use the attributes function to query all available attributes on an object:

myobject <- 1:6
attributes(myobject)

## NULL

As one can see, a pure numeric vector has no attribute at all. Arbitrary attribute can be set by using the attr function:

attr(myobject, "myattr") <- "so be it"
attr(myobject, "myattr") # attr can also be used to query specific attribute by name

## [1] "so be it"

attributes(myobject)     # unlike attr, query all attributes at once

## $myattr
## [1] "so be it"

str(myobject)

##  atomic [1:6] 1 2 3 4 5 6
##  - attr(*, "myattr")= chr "so be it"

This is the OO (Object-Oriented) nature part of the R language. R is a hybrid of OO and functional. The detailed OO system concept in R is beyond the scope of this chapter so will not be elaborated any further. Here one should just remember that in R everythin is an object, and every object can have attributes.

Now let’s considering giving the object an attribute named dim:

attr(myobject, "dim") <- c(2, 3)
str(myobject)

##  int [1:2, 1:3] 1 2 3 4 5 6
##  - attr(*, "myattr")= chr "so be it"

class(myobject)

## [1] "matrix"

myobject

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## attr(,"myattr")
## [1] "so be it"

Clearly the object becomes a matrix! Here the function call to attr on dim is effectively the same as using dim<- in the previous example. One should now understand that every object can have any sort of attributes, but some attributes are special than others. dim is one of such specials. An atomic vector with a dim attribute makes itself a matrix class.

4.1.3.2 Matrix Operations

Matrices are mathematical matrices, so they can readily perform linear algebra:

(vv <- matrix(1:9, 3, 3))

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

# scalar product
vv * 3

##      [,1] [,2] [,3]
## [1,]    3   12   21
## [2,]    6   15   24
## [3,]    9   18   27

# inner product
vv %*% matrix(1:6, 3, 2)

##      [,1] [,2]
## [1,]   30   66
## [2,]   36   81
## [3,]   42   96

# outer products
1:9 %o% 1:9 # a more general version: see ?outer

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
##  [1,]    1    2    3    4    5    6    7    8    9
##  [2,]    2    4    6    8   10   12   14   16   18
##  [3,]    3    6    9   12   15   18   21   24   27
##  [4,]    4    8   12   16   20   24   28   32   36
##  [5,]    5   10   15   20   25   30   35   40   45
##  [6,]    6   12   18   24   30   36   42   48   54
##  [7,]    7   14   21   28   35   42   49   56   63
##  [8,]    8   16   24   32   40   48   56   64   72
##  [9,]    9   18   27   36   45   54   63   72   81

4.1.4 Array

When there is more than 2 dimensions, a vector may be called an array. One can also call a matrix is a 2-dimensional array.

4.2 Recursive Vector

4.2.1 List

The most important representative data structrue of recursive vector is list. A list, unlike atomic vector, can contain elements of different types, and can be nested:

alist <- list(1:3, c("foo", "bar"), paste, list(letters, month.abb))
str(alist)

## List of 4
##  $ : int [1:3] 1 2 3
##  $ : chr [1:2] "foo" "bar"
##  $ :function (..., sep = " ", collapse = NULL)  
##  $ :List of 2
##   ..$ : chr [1:26] "a" "b" "c" "d" ...
##   ..$ : chr [1:12] "Jan" "Feb" "Mar" "Apr" ...

All the operation appliable to atomic vector is also appliable to list. A list can also be named.

# slicing
alist[1:3]

## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "foo" "bar"
## 
## [[3]]
## function (..., sep = " ", collapse = NULL) 
## .Internal(paste(list(...), sep, collapse))
## <bytecode: 0x7fc0a4091ca0>
## <environment: namespace:base>

# replacement
alist[1] <- 1

# delete
alist[1] <- NULL

# concatenate
c(alist, list(1:3))

## [[1]]
## [1] "foo" "bar"
## 
## [[2]]
## function (..., sep = " ", collapse = NULL) 
## .Internal(paste(list(...), sep, collapse))
## <bytecode: 0x7fc0a4091ca0>
## <environment: namespace:base>
## 
## [[3]]
## [[3]][[1]]
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
## 
## [[3]][[2]]
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
## 
## 
## [[4]]
## [1] 1 2 3

There is a little difference in indexing. The single bracket [ results in a list; the double brackets [[ results in the element itself without warpping in a list. To be clear, see the following example:

alist <- list(1:3, c("foo", "bar"), paste, list(letters, month.abb))
alist[1]

## [[1]]
## [1] 1 2 3

alist[[1]]

## [1] 1 2 3

identical(unlist(alist[1]), alist[[1]])

## [1] TRUE

This also means that [ is used to slice a list (into another list), while [[ is used to extract single element within a list. Both support the use of character vectors to select by names:

alist <- list(integers=1:3, foobar=c("foo", "bar"), pf=paste, another_list=list(letters, month.abb))
names(alist)

## [1] "integers"     "foobar"       "pf"           "another_list"

alist[c("foobar")] # a list (of length 1)

## $foobar
## [1] "foo" "bar"

alist[["foobar"]]  # a character vector

## [1] "foo" "bar"

There is another special operator for list: the $ extractor. It serves the same purpose as [[ that it can be used to extract single element in a list by name, but in this case a variable (a symbol) rather than a character is used:

identical(alist$foobar, alist[["foobar"]])

## [1] TRUE

4.2.2 Data Frame

One important extension of list is data.frame, a table data structure to represent tabular data. A data.frame is indeed a list, with additional attributes.

DF <- data.frame(a=1:3, b=letters[1:3])
str(DF)

## 'data.frame':    3 obs. of  2 variables:
##  $ a: int  1 2 3
##  $ b: Factor w/ 3 levels "a","b","c": 1 2 3

class(DF)   # the abstract type (in OOP sense)

## [1] "data.frame"

typeof(DF)  # the storage type

## [1] "list"

A data.frame can be thought of as a list with elements sharing exactly the same length. Useful operation on data.frame includes:

sort: using order
join: using merge
append: using rbind or cbind
aggregate: using by, aggregate, or ave
pivot: using reshape
partition: using split

Each operation above worth a considerable space to elaborate but since there is a overall better solution to model tabular data in R: the data.table package, the details are left for readers to explore on their own.

4.3 Special Values

4.3.1 `NA`

The not applicable value in R, usually means missing value. NA is more complicated than its first seen. There are actually different types of NA:

typeof(NA)

## [1] "logical"

typeof(NA_character_)

## [1] "character"

typeof(NA_integer_)

## [1] "integer"

typeof(NA_real_)

## [1] "double"

Most time one uses NA in general, and let the R language to deal with the casting:

c("foo", "bar", NA) # the NA is indeed a character type

## [1] "foo" "bar" NA

c(1:3, NA)          # here NA is typed integer

## [1]  1  2  3 NA

Many operation can result in missing value. For example, to subset using out-of-range index:

letters[26:27] # there is no 27th alphabet in English!

## [1] "z" NA

4.3.2 `NaN`

NaN means “not a number.” The value results from mathematic operation that results in un-defined value, say:

0/0

## [1] NaN

Also notice that 1/0 results in Inf and -1/0 results in -Inf.

4.3.3 `NULL`

NULL is another special value: it means nothing. Do not confuse it with NA. NULL means nothing so it does not have length:

length(NULL)

## [1] 0

length(NA)

## [1] 1