Chapter 5 String Operations
5.1 Basics
5.1.1 paste
Strings are called character
in the R language. It’s important to know some basic string manipulation skills. One useful string function is the paste
function:
paste("foo", "bar", "baz", sep=' ')
## [1] "foo bar baz"
The function takes as many arguments as it can and concatenates them with the separator specified in the sep
arguemnt. Notice that character
is vector, so the paste
function can operate on vector–this is called vectorization in R:
paste(c("foo", "bar"), "baz1" , sep=' ')
## [1] "foo baz1" "bar baz1"
paste(c("foo", "bar"), c("baz1", "baz2", "baz3") , sep=' ')
## [1] "foo baz1" "bar baz2" "foo baz3"
What happens here is that the shorter arguments are recycled to have equal length and then each string is concatenated elment-wise. For more details about vectorization and recycling, see section 8.
The paste
function can also collapse a character vector into one character (a vector of length 1). This is done by using the collapse
argument:
paste(c("foo", "bar", "baz"), collapse='-')
## [1] "foo-bar-baz"
Collapsing can also be used after concatenation. This is done by supplying more than one character vectors to paste
and also specifying the collapse
argument:
# concatenate only
paste(c("foo", "bar", "baz"), letters[1:5])
## [1] "foo a" "bar b" "baz c" "foo d" "bar e"
# concatenate and collapse
paste(c("foo", "bar", "baz"), letters[1:5], collapse='-')
## [1] "foo a-bar b-baz c-foo d-bar e"
There is also a variant of paste
called paste0
. The only difference is that paste0
use a default separator of empty string (i.e., use sep=''
in paste
):
paste0(c("foo", "bar"), "baz")
## [1] "foobaz" "barbaz"
5.1.2 sprintf
Another useful function is sprintf
. It is used to format a string with interpolation:
sprintf("hello %s!", "world")
## [1] "hello world!"
sprintf("hello %.3f!", 7123)
## [1] "hello 7123.000!"
sprintf("hello %i", 7)
## [1] "hello 7"
See ?sprintf
for more formatting options. Multiple interpolation is also supported, given in order:
sprintf("first: %i; second: %s", 1, "two")
## [1] "first: 1; second: two"
5.1.3 cat
and print
The cat
function concatenates its arguments and print.
cat("foo", c("bar", "baz"))
## foo bar baz
cat("foo", c("bar", "baz"), sep=':')
## foo:bar:baz
Notice the difference between cat
and paste
: the former does not recycle and the result is printed; the latter does recylcing and the result is returned rather than printed. Too see this fact clear:
# letters is a built-in variable storing all lower-cased English alphabets
# .Last.value is another built-in variable storing the result of the last evaluation
cat(letters[1:3], sep='+')
## a+b+c
.Last.value
## NULL
paste(letters[1:3], collapse='+')
## [1] "a+b+c"
.Last.value
## NULL
5.1.4 Miscellaneous
Use substr
to extract partial string. Use nchar
to count number of characters in a string. Use strsplit
to split a string by specified separator.
ss <- "why so serious?"
substr(ss, 1, 3)
## [1] "why"
nchar(ss)
## [1] 15
strsplit(ss, split=' ') # the returned type is a list
## [[1]]
## [1] "why" "so" "serious?"
5.2 Regular Expressions
There are some built-in functions in the R language for using regular expressions. Some of them are quite useful, say, gsub
and grep
. But some are not that intuitive to work with.
5.2.1 Filtering
Use grep
to match and return the matched. It is vectorized, can return either numeric index or original charaters by using the value
argument. One can also choose to return a logical
by using the grepl
variant.
5.2.2 Substitution
Use gsub
for substitution of matched pattern.
5.2.3 Extraction
There are multiple ways to do matched pattern extraction using built-in functions. Sadly none of them are intuitive to use. The most useful technique may be the combination of regexec
and regmatches
.
s <- "$VAR/foo/bar"
matched <- regexec("\\$([^/]*)", s)
regmatches(s, matched)
## [[1]]
## [1] "$VAR" "VAR"
However, it is still too much to type. The coding style can be simplified by using the magrittr
library:
library(magrittr)
"$VAR/foo/bar" %>% regmatches(., regexec("\\$([^/]*)", .)) %>% .[[1]]
## [1] "$VAR" "VAR"
5.2.4 The stringr
Library
The stringr
provides more intuitive and easy-to-use functions regarding regex operations.