Chapter 5 String Operations

5.1 Basics

5.1.1 paste

Strings are called character in the R language. It’s important to know some basic string manipulation skills. One useful string function is the paste function:

paste("foo", "bar", "baz", sep=' ')
## [1] "foo bar baz"

The function takes as many arguments as it can and concatenates them with the separator specified in the sep arguemnt. Notice that character is vector, so the paste function can operate on vector–this is called vectorization in R:

paste(c("foo", "bar"), "baz1" , sep=' ')
## [1] "foo baz1" "bar baz1"
paste(c("foo", "bar"), c("baz1", "baz2", "baz3") , sep=' ')
## [1] "foo baz1" "bar baz2" "foo baz3"

What happens here is that the shorter arguments are recycled to have equal length and then each string is concatenated elment-wise. For more details about vectorization and recycling, see section 8.

The paste function can also collapse a character vector into one character (a vector of length 1). This is done by using the collapse argument:

paste(c("foo", "bar", "baz"), collapse='-')
## [1] "foo-bar-baz"

Collapsing can also be used after concatenation. This is done by supplying more than one character vectors to paste and also specifying the collapse argument:

# concatenate only
paste(c("foo", "bar", "baz"), letters[1:5])
## [1] "foo a" "bar b" "baz c" "foo d" "bar e"
# concatenate and collapse
paste(c("foo", "bar", "baz"), letters[1:5], collapse='-')
## [1] "foo a-bar b-baz c-foo d-bar e"

There is also a variant of paste called paste0. The only difference is that paste0 use a default separator of empty string (i.e., use sep='' in paste):

paste0(c("foo", "bar"), "baz")
## [1] "foobaz" "barbaz"

5.1.2 sprintf

Another useful function is sprintf. It is used to format a string with interpolation:

sprintf("hello %s!", "world")
## [1] "hello world!"
sprintf("hello %.3f!", 7123)
## [1] "hello 7123.000!"
sprintf("hello %i", 7)
## [1] "hello 7"

See ?sprintf for more formatting options. Multiple interpolation is also supported, given in order:

sprintf("first: %i; second: %s", 1, "two")
## [1] "first: 1; second: two"

5.1.3 cat and print

The cat function concatenates its arguments and print.

cat("foo", c("bar", "baz"))
## foo bar baz
cat("foo", c("bar", "baz"), sep=':')
## foo:bar:baz

Notice the difference between cat and paste: the former does not recycle and the result is printed; the latter does recylcing and the result is returned rather than printed. Too see this fact clear:

# letters is a built-in variable storing all lower-cased English alphabets
# .Last.value is another built-in variable storing the result of the last evaluation
cat(letters[1:3], sep='+')
## a+b+c
.Last.value
## NULL
paste(letters[1:3], collapse='+')
## [1] "a+b+c"
.Last.value
## NULL

5.1.4 Miscellaneous

Use substr to extract partial string. Use nchar to count number of characters in a string. Use strsplit to split a string by specified separator.

ss <- "why so serious?"
substr(ss, 1, 3)
## [1] "why"
nchar(ss)
## [1] 15
strsplit(ss, split=' ') # the returned type is a list
## [[1]]
## [1] "why"      "so"       "serious?"

5.2 Regular Expressions

There are some built-in functions in the R language for using regular expressions. Some of them are quite useful, say, gsub and grep. But some are not that intuitive to work with.

5.2.1 Filtering

Use grep to match and return the matched. It is vectorized, can return either numeric index or original charaters by using the value argument. One can also choose to return a logical by using the grepl variant.

5.2.2 Substitution

Use gsub for substitution of matched pattern.

5.2.3 Extraction

There are multiple ways to do matched pattern extraction using built-in functions. Sadly none of them are intuitive to use. The most useful technique may be the combination of regexec and regmatches.

s <- "$VAR/foo/bar"
matched <- regexec("\\$([^/]*)", s)
regmatches(s, matched)
## [[1]]
## [1] "$VAR" "VAR"

However, it is still too much to type. The coding style can be simplified by using the magrittr library:

library(magrittr)
"$VAR/foo/bar" %>% regmatches(., regexec("\\$([^/]*)", .)) %>% .[[1]]
## [1] "$VAR" "VAR"

5.2.4 The stringr Library

The stringr provides more intuitive and easy-to-use functions regarding regex operations.