Chapter 16 Speed-up with Rcpp

This chapter is mainly based on the author’s materials of R Lightning Talk at Data Science Conference 2015 (Taipei).

16.1 What exactly is R?

It’s a program written in R, C++, and Fortran.
Let’s take a look at its source files.

library(data.table)
library(stringr)

# download R source code if you dont have it yet
rver <- "R-3.2.5"
rsrc <- sprintf("%s.tar.gz", rver)
if ( !file.exists(rver) ) {
  download.file(file.path("http://cran.csie.ntu.edu.tw/src/base/R-3/", rsrc), rsrc)
  untar(rsrc)
}

# get all file extensions in R source code
all_src_files <- list.files(file.path(rver, "src/"), recursive=TRUE, full.names=TRUE)
all_ext <- str_extract(all_src_files, "\\.[a-zA-Z]+$")
file_counts <- as.data.table(sort(table(all_ext), decreasing=TRUE), keep.rownames=TRUE)

16.1.1 Distribution of source file counts in R

library(wordcloud)
par(mar=c(0, 0, 0, 0))
wordcloud(file_counts$V1, file_counts$V2, min.freq=1, scale=c(7,1), rot.per=.35, random.order=F)

library(ggvis)
file_counts[1:10] %>% ggvis(~reorder(V1, -V2), ~V2) %>% layer_bars() %>%
    add_axis('x', title="Source Type") %>% add_axis('y', title="File Count")

Top-ten file types:
- Rd: document file
- .R: R source, of course
- .C: C source
- .mo: binary data file
- .po: gettext file, about programming language translation
- .h: header file for C
- .afm: font file
- .in: usually template config file for some macro preprocessor
- .win: same above, but for Windows
- .f: Fortran source

# now focus on only source type files
src_ext <- c('c', 'h', 'R', 'f')
srcfiles <- list()
for ( ext in src_ext )
    srcfiles[[ext]] <- grep(sprintf("\\.%s$", ext), all_src_files, value=TRUE)
ext_dist <- sapply(srcfiles, length)
round(ext_dist / sum(ext_dist), 2)

##    c    h    R    f 
## 0.37 0.15 0.46 0.02

As you can see, R is heavily writen in C. Indeed, most primitive functions are written in C code. Vectorization is implemented in C code. The speedy part of R is written in C. (And that’s why it is speedy.)

For example, + (the add function) is a primitive function written in C, vectorized.

`+` # no R source code printed cause it is written in C

## function (e1, e2)  .Primitive("+")

In addition to distribution of number of files in R source, we can take a further look into the distribution of number of lines in each source type.

# unix only
wc_res <- sapply(srcfiles, function(x) 
    sum(as.integer(sapply(sprintf("wc -l < %s", x), system, intern=TRUE))))
round(wc_res / sum(wc_res), 2)

##    c    h    R    f 
## 0.45 0.07 0.20 0.27

16.1.2 Distribution of source line counts in R

line_counts <- as.data.table(wc_res, keep.rownames=TRUE)
line_counts %>% ggvis(~rn, ~wc_res) %>% layer_bars() %>%
    add_axis('x', title="Source Type") %>% add_axis('y', title="Line Count", title_offset=60)

Finally, we’ve found that only 20.14% of R source lines is written in R, and 45.23% of it is written by C, and 27.27% Fortran.

16.2 Speed-Up

When R is not satisfying your need for speed…

It’s simple. We can write R function in C/C++, which is usually above 10X faster than the equivalent of native R code.
Okay. So why not just use C/C++ for everything?
- cause your programming time will explode to blow up all your potential performance gain, well, provided that you really can finish the analytic job purely in C++.
This kind of high-level / low-level trade-off is happening in any modern analytic tools.
- but at least you are capable of trading for such gain in R, which is a very flexible language for analytics.
- for some other tools you simply CAN’T do anything to speed up your performance.

But writing C/C++ for R is painful + The traditional R API to C/C++ is talking like alien.

Here comes Rcpp

What is Rcpp?
- A library for R (???) to build a more manageble API for integrating R and C++, so that users are more easy to write function in C++ to speed up their R program.
- It provides syntax sugar so that the coding is less like C++ programing and more like R programing. Indeed it becomes something in-between.
Useful resources:
- Rcpp Github page
- Rcpp website

16.2.1 Example 1: n-gram generation

Imagin you are dealing with hexdump of files and planning to use its n-gram frequency feature. Your input data may look like this:

samplefile <- "14-rcpp.Rmd"
system(sprintf("xxd -g1 %s | head", samplefile), intern=TRUE) # unix only

##  [1] "0000000: 23 20 53 70 65 65 64 2d 75 70 20 77 69 74 68 20  # Speed-up with "
##  [2] "0000010: 52 63 70 70 20 7b 23 72 63 70 70 7d 0a 0a 54 68  Rcpp {#rcpp}..Th"
##  [3] "0000020: 69 73 20 63 68 61 70 74 65 72 20 69 73 20 6d 61  is chapter is ma"
##  [4] "0000030: 69 6e 6c 79 20 62 61 73 65 64 20 6f 6e 20 74 68  inly based on th"
##  [5] "0000040: 65 20 61 75 74 68 6f 72 27 73 20 6d 61 74 65 72  e author's mater"
##  [6] "0000050: 69 61 6c 73 20 6f 66 20 5b 52 20 4c 69 67 68 74  ials of [R Light"
##  [7] "0000060: 6e 69 6e 67 20 54 61 6c 6b 5d 28 68 74 74 70 73  ning Talk](https"
##  [8] "0000070: 3a 2f 2f 67 69 74 68 75 62 2e 63 6f 6d 2f 65 76  ://github.com/ev"
##  [9] "0000080: 65 72 64 61 72 6b 2f 72 63 70 70 5f 6c 69 67 68  erdark/rcpp_ligh"
## [10] "0000090: 74 6e 69 6e 67 5f 64 73 63 32 30 31 35 29 20 61  tning_dsc2015) a"

We firstly tidy the input to be a long character sequence…

hexdumplines <- system(sprintf("xxd -g1 %s | cut -d \" \" -f 2-17", samplefile), intern=TRUE) # unix only
hexdump <- unlist(strsplit(paste(hexdumplines, collapse=' '), ' '))
hexdump <- hexdump[hexdump != '']
str(hexdump)

##  chr [1:21009] "23" "20" "53" "70" "65" "65" "64" "2d" ...

Then what? Try looping in native R code, which is rather straight forward.

# first trial: use native R loop (could be terrible for large-scale processing...)
generateNgramFreqR <- function(x, n) {
    len <- length(x) - (n - 1)
    out <- character(len) # pre-allocate size, very important
    for ( i in 1:len ) {
        out[i] <- x[i]
        if ( n > 1 )
            for ( j in 1:(n-1) )
                out[i] <- paste0(out[i], x[i+j])
    }
    unlist(as.list(sort(table(out), decreasing=TRUE)))
}
system.time(r_res <- generateNgramFreqR(hexdump, 2))

##    user  system elapsed 
##   0.166   0.005   0.172

r_res[1:10]

## 2020 696e 2c20 6520 6572 2074 7265 2069 0a20 7320 
## 1251  307  306  288  227  217  186  185  178  177

Or we can use some external library that focuses on text mining, for example, use tau (???).

# second trial: use external library
generateNgramFreqTau <- function(x, n) {
    require(tau)
    tau_res <- textcnt(x, n=n, method="string", split="[[:space:]]+")
    names(tau_res) <- gsub(' ', '', names(tau_res))
    sort(tau_res, decreasing=TRUE)
}
system.time(tau_res <- generateNgramFreqTau(hexdump, 2))

##    user  system elapsed 
##   0.099   0.006   0.116

tau_res[1:10]

## 2020 696e 2c20 6520 6572 2074 7265 2069 0a20 7320 
## 1251  307  306  288  227  217  186  185  178  177

# check if the results are eqaul
identical(tau_res, r_res)

## [1] TRUE

But the performance of the above two solutions are both not satisfying when it comes to large-scale problem. Think about dealing with hex-dump strings from tens of thousands of files, we will need more efficient solution.

Sometimes to make things better all you have to do is to find ANOTHER package that is more powerful over the probelm at hand. But still sometimes you just don’t have much to choose. In this very case of generating n-grams, though there is another well-known general text mining pacakge tm (???; ???) available, it is not suitable to serve as a faster solution to our problem. Indeed, it is prohibitively slow.

# third trial: use another external library, hopefully faster? (Nope.)
generateNgramFreqTm <- function(x, n) {
    require(tm)
    op <- options(mc.cores=NULL) 
    options(mc.cores=1)
    on.exit(options(op))
    
    ngramTokenizer <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min=2, max=2))
    corp <- Corpus(VectorSource(paste(x, collapse=' ')))
    tdm <- TermDocumentMatrix(corp, control=list(tokenize=ngramTokenizer))
    res <- rowSums(as.matrix(tdm))
    names(res) <- gsub(' ', '', names(res))
    sort(res, decreasing=TRUE)
}

# full-size input not run, too slow
length(hexdumplines)

## [1] 1314

system.time(tm_res <- generateNgramFreqTm(hexdumplines[1:10000], 2))

##    user  system elapsed 
##   1.293   0.135   1.412

tm_res[1:10]

## nana 2020 696e 2c20 6520 6572 2074 7265 2069 0a20 
## 8685 1251  307  306  288  227  217  186  185  178

Let’s try Rcpp to see what we could further achieve.

# final trial: use Rcpp to write the function in C++
library(Rcpp)

## Warning: package 'Rcpp' was built under R version 3.2.4

cppFunction('
    CharacterVector generateNgramFreq (CharacterVector hexvector, int ngram) {
        int len = hexvector.size() - (ngram - 1);
        CharacterVector out(len);
        for (int i = 0; i < len; i++) {
            out(i) = hexvector[i];
            for (int j = 1; j < ngram; j++) {
                out(i) += hexvector[i+j];
            }
        }
        return out;
    }')
generateNgramFreq

## function (hexvector, ngram) 
## .Primitive(".Call")(<pointer: 0x11ed5c050>, hexvector, ngram)

generateNgramFreqCpp <- function(x, n)
    unlist(as.list(sort(table(generateNgramFreq(x, n)), decreasing=TRUE)))
system.time(cpp_res <- generateNgramFreqCpp(hexdump, 2))

##    user  system elapsed 
##   0.012   0.001   0.012

cpp_res[1:10]

## 2020 696e 2c20 6520 6572 2074 7265 2069 0a20 7320 
## 1251  307  306  288  227  217  186  185  178  177

identical(cpp_res, r_res)

## [1] TRUE

Now it’s time to benchmark these solutions to see the performance difference. You’ll see that the Rcpp solution is above 10X faster than its alternatives. Depends on the scale of your problem such gap could be further enlarged.

library(microbenchmark)
microbenchmark(generateNgramFreqR(hexdump, 2),
               generateNgramFreqTau(hexdump, 2),
               generateNgramFreqCpp(hexdump, 2), 
               times=10)

## Unit: milliseconds
##                              expr       min        lq      mean    median
##    generateNgramFreqR(hexdump, 2) 112.56563 134.52664 190.84401 179.78234
##  generateNgramFreqTau(hexdump, 2)  75.23374 118.21113 135.82765 130.69962
##  generateNgramFreqCpp(hexdump, 2)  10.99040  12.97046  21.10394  19.52712
##        uq       max neval
##  236.5872 302.41911    10
##  169.4746 180.27579    10
##   26.6165  36.01916    10

Interested? See the Microsoft malware classification challenge

16.2.2 Example 2: moving-window computing

Move on to the next case. Suppose you’d like to monitor the system logs in your computer, counting the process that triggers logging in a sliding-window manner.

syslogname <- "system.log" # this is for Mac, for Linux you may use "syslog" instead
syslogs <- grep(syslogname, dir("/var/log", full.names=TRUE), value=TRUE)
loglines <- unlist(lapply(syslogs, 
                          function(x) {
                              con <- gzfile(x)
                              on.exit(close(con))
                              readLines(con)
                          }))
head(loglines)

## [1] "May 13 00:00:01 kyledembpr.local SubmitDiagInfo[718]: Couldn't load config file from on-disk location. Falling back to default location. Reason: Won't serialize in _readDictionaryFromJSONData due to nil object"
## [2] "May 13 00:00:09 kyledembpr.local awacsd[202]: Could not find token for ff32:0000:0000:0000:0000:0000:0000:5222"                                                                                                   
## [3] "May 13 00:00:21 kyledembpr.local Microsoft Outlook[284]: Stream 0x7f4705b0 is sending an event before being opened"                                                                                               
## [4] "May 13 00:00:21 --- last message repeated 1 time ---"                                                                                                                                                             
## [5] "May 13 00:00:21 kyledembpr.local Microsoft Outlook[284]: Stream 0x7d1992b0 is sending an event before being opened"                                                                                               
## [6] "May 13 00:00:39 --- last message repeated 1 time ---"

Here we use regex to extract fields we are intereted, convert datetime character to POSIX time object, and take a look at the top distributed categories of process names.

logdat <- as.data.table(str_match(loglines, "(^.*[0-9]+ [0-9]{2}:[0-9]{2}:[0-9]{2})([^:]+:)")[,2:3])
setnames(logdat, c("ts", "cate"))
logdat[, `:=`(cate=str_match(cate, " ([a-zA-Z]+)\\[[0-9]+\\]")[,2],
              ts=as.POSIXct(ts, format="%b %e %H:%M:%S"))]
logdat[, tsint:=as.integer(ts)]
logdat <- logdat[!is.na(cate)]
setorder(logdat, ts)
logdat

##                         ts           cate      tsint
##     1: 2016-05-06 00:00:01 SubmitDiagInfo 1462464001
##     2: 2016-05-06 00:00:07        configd 1462464007
##     3: 2016-05-06 00:00:15        configd 1462464015
##     4: 2016-05-06 00:00:23         awacsd 1462464023
##     5: 2016-05-06 00:00:53         awacsd 1462464053
##    ---                                              
## 48934: 2016-05-13 18:00:39         awacsd 1463133639
## 48935: 2016-05-13 18:01:09         awacsd 1463133669
## 48936: 2016-05-13 18:01:20           sshd 1463133680
## 48937: 2016-05-13 18:01:39         awacsd 1463133699
## 48938: 2016-05-13 18:02:09         awacsd 1463133729

sort(table(logdat$cate), decreasing=TRUE)[1:10]

## 
##                     awacsd                       Dock 
##                      22045                       8688 
##                    configd                    Outlook 
##                       8479                       2550 
##               WindowServer                   sandboxd 
##                       1328                       1036 
##                     kernel              storeaccountd 
##                        924                        364 
## GoogleSoftwareUpdateDaemon              nsurlstoraged 
##                        352                        345

Suppose the task is to count number of records for each process within the last 5 minutes for every 1 minute. This is a moving-window operation which could be painful should you use pure R code to solve it. Of course there are dedicated packages aiming at this kind of task but again for this very case none of them to my best knowledge could work efficently.

# first trial: again, naive approach first
movingWindowCountR <- function(DT, wsize, interval, select=NULL) {
    require(data.table)
    sts <- min(DT$tsint)
    ninter <- ceiling(diff(range(DT$tsint)) / interval)
    if ( !is.null(select) )
        DT <- DT[cate %in% select]
    res <- list()
    for ( ca in unique(DT$cate) ) {
        sts_ <- sts
        cnt <- integer(ninter)
        for ( i in 1:ninter ) {
            cnt[i] <- nrow(DT[cate == ca][tsint <= sts_ & tsint > sts_ - wsize])
            sts_ <- sts_ + interval
        }
        res[[ca]] <- cnt
    }
    res
}
# run for only two processess cause to slow...
system.time(r_res <- movingWindowCountR(logdat, 300, 60, c("kernel", "discoveryd")))

##    user  system elapsed 
##  39.031   0.529  41.415

Let’s plot the moving-window-count time series for process “kernel”.

kernel_mvcnt <- data.table(count=r_res[["kernel"]])
kernel_mvcnt[, t:=1:.N]
kernel_mvcnt %>% ggvis(~t, ~count) %>% layer_lines()

The bottleneck is, of course, at the loop code. R is not good at explicit looping. When you just don’t find a way to avoid coding like this, it is probably a strong sign indicating the use of Rcpp.

But before that, let’s struggle a little bit more.

# second trial: play tricky
movingWindowCountR2 <- function(DT, wsize, interval, select=NULL) {
    require(data.table)
    wsize <- wsize
    allts <- data.table(tsint=min(DT$tsint):max(DT$tsint))
    if ( !is.null(select) )
        DT <- DT[cate %in% select]
    res <- lapply(split(DT, DT$cate), 
                  function(x) {
                      x <- merge(x, allts, by="tsint", all=TRUE)
                      tsunit_cnt <- x[, list(cnt=sum(!is.na(cate))), by="tsint"]
                      tsunit_cumcnt <- c(rep(0, wsize), cumsum(tsunit_cnt$cnt))
                      as.integer(tail(tsunit_cumcnt, -wsize) - head(tsunit_cumcnt, -wsize))
                  })
    lapply(res, function(x) x[seq(1, length(x), interval)])
}

# run for all processes
system.time(r2_res <- movingWindowCountR2(logdat, 300, 60))

##    user  system elapsed 
##  93.059   5.520 100.371

identical(r_res[["kernel"]], r2_res[["kernel"]])

## [1] TRUE

Our second solution is considerably faster, though little bit harder to code and maintain. What if we simply re-write the loop in the first solution in Rcpp?

# final trial
cppFunction('
    NumericVector movingWindowCount(NumericVector ts, int wsize, int interval, int sts, int nt) {
        NumericVector out(nt);
        for (int i = 0; i < nt; i++) {
            NumericVector cnts = ts[(ts <= sts) & (ts > sts - wsize)];
            out(i) = cnts.size();
            sts += interval;
        }
        return out;
    }')

movingWindowCountCpp <- function(DT, wsize, interval, select=NULL) {
    require(data.table)
    sts <- min(DT$tsint)
    ninter <- ceiling(diff(range(DT$tsint)) / interval)
    if ( !is.null(select) )
        DT <- DT[cate %in% select]
    lapply(split(DT, DT$cate), 
           function(x) as.integer(movingWindowCount(x$ts, wsize, interval, sts, ninter)))
}

# run for all processes
system.time(rcpp_res <- movingWindowCountCpp(logdat, 300, 60))

##    user  system elapsed 
##  13.382   0.946  14.436

identical(rcpp_res[["kernel"]], r_res[["kernel"]])

## [1] TRUE

length(rcpp_res)

## [1] 123

xargs <- list(logdat, 300, 60, c("kernel", "discoveryd"))
microbenchmark(do.call(movingWindowCountR, xargs),
               do.call(movingWindowCountR2, xargs),
               do.call(movingWindowCountCpp, xargs), 
               times=3)

## Unit: milliseconds
##                                  expr        min         lq       mean
##    do.call(movingWindowCountR, xargs) 34155.1397 34721.1223 37138.3304
##   do.call(movingWindowCountR2, xargs)  1631.4695  1687.2937  1759.3501
##  do.call(movingWindowCountCpp, xargs)   334.6661   344.3424   356.5243
##      median         uq        max neval
##  35287.1050 38629.9257 41972.7464     3
##   1743.1179  1823.2905  1903.4631     3
##    354.0188   367.4535   380.8882     3

At the benchmark of only processing a small number of processes, solution 2 may outperform Rcpp solution. Nevertheless, the latter scales considerably greater. Let’s further examine this point. (It all depends on your machine. For example on my MacBook 12" solution 2 is faster overall at 2 processes but on my MacBook Pro 13" the Rcpp solution constantly outperforms the others.)

allcate <- unique(logdat$cate)
nround <- 20
elapsed_r2 <- numeric(nround)
elapsed_rcpp <- numeric(nround)
for ( i in 1:nround ) {
    xargs <- list(logdat, 300, 60, allcate[1:i])
    
    stime <- proc.time()
    do.call(movingWindowCountR2, xargs)
    elapsed_r2[i] <- (proc.time() - stime)["elapsed"]
    
    stime <- proc.time()
    do.call(movingWindowCountCpp, xargs)
    elapsed_rcpp[i] <- (proc.time() - stime)["elapsed"]    
}

bres <- rbind(data.table(n=1:nround, t=elapsed_r2, alg="r2"), 
              data.table(n=1:nround, t=elapsed_rcpp, alg="rcpp"))
bres %>% ggvis(~n, ~t, fill=~alg) %>% layer_points() %>% 
    add_axis('x', title="nround") %>% add_axis('y', title="Seconds Elapsed")

As you can see, the Rcpp solution is already remnarkably fast in fair-sized problem. Sometimes (like the current one) the speed gain could be exponential in scale of the given problem. The promising point is that we don’t actually add any complexity in our coding. (Okay maybe just a little if one is new to Rcpp.) Only a small part of the code is re-written in Rcpp, and the performance gain is tremendous.

See Rcpp sugar introduction to discover more tips about writing Rcpp functions.

16.2.3 Example 3: ad-hoc data.frame operation

There are other cases that you found solutions of existing packages lacking efficiency. That’s assume an MOOC-like student-course activity log in a simplist format. We log all activities for all students in all courses with timestamp. (Assume 26 courses in total, marked by English letters. Students are indexed by incremental integers.)

# generate a simulated data as our playground
genSimData <- function(nlog, nuser, seed) {
    set.seed(seed)
    dat <- data.table(user=sample(1:nuser, nlog, replace=TRUE, 
                                  prob=rbeta(1:nuser, .1, 2)),
                      course=sample(letters, nlog, replace=TRUE, prob=1:26),
                      ts=runif(nlog, as.integer(as.POSIXct("2015-01-01")),
                               as.integer(as.POSIXct("2015-06-30"))))
    dat[, t:=as.POSIXct(ts, origin="1970-01-01")]
    setorder(dat, user, course, ts)
    dat
}

simlog <- genSimData(nlog=1e6, nuser=1e3, seed=528491)
head(simlog)

##    user course         ts                   t
## 1:    2      a 1420055427 2015-01-01 03:50:26
## 2:    2      a 1420219054 2015-01-03 01:17:33
## 3:    2      a 1420320164 2015-01-04 05:22:44
## 4:    2      a 1421227260 2015-01-14 17:21:00
## 5:    2      a 1421467857 2015-01-17 12:10:57
## 6:    2      a 1422182483 2015-01-25 18:41:22

Now based on such dataset, assume we’d like to extract a feature: How many courses in total did a student register at the time of his/her last activity record for each course? For example, if student 1’s last activity for course a is today, then how many courses in total (including course a) did she also have at least one activity records?

This problem requires a some-how twisted by-group operation on data.frame, and hence could be challenging if the scale is huge.

# naive apporach first
adHocOperateR <- function(DT) {
    require(data.table)
    DT <- copy(DT)
    DT[, `:=`(tsmin=min(ts), tsmax=max(ts)), by="user,course"]
    DT <- unique(DT[, list(user, course, tsmin, tsmax)])
    res <- integer()
    for ( i in unique(DT$user) ) {
        tmpdat <- DT[user == i]
        for ( j in tmpdat$course ) {
            res <- c(res, sum(tmpdat$tsmin <= tmpdat[course == j, tsmax]))
        }
    }
    cbind(DT[, list(user, course)], res)
}

system.time(res_r1 <- adHocOperateR(simlog))

##    user  system elapsed 
##  12.326   0.285  12.996

res_r1[sample(nrow(res_r1), 10)]

##     user course res
##  1:  658      z  25
##  2:  386      q  26
##  3:  504      y  24
##  4:  481      p  15
##  5:  284      j  26
##  6:  184      q  26
##  7:  629      k  26
##  8:  884      d  26
##  9:  118      k  25
## 10:  320      b  26

The naive approach scales well in nlog but not in nuser cause the latter impacts how many loops actually run in our solution. Relying only on data.table, can we improve? The anwser is yes. The following solution use the so-called “grouping by each i” technique to drastically improve performance.

# let's utilizing data.table bitter
adHocOperateR2 <- function(DT) {
    require(data.table)
    DT <- copy(DT)
    DT[, `:=`(tsmin=min(ts), tsmax=max(ts)), by="user,course"]
    tmpdat1 <- unique(DT[, list(user, course, tsmin)])
    tmpdat2 <- unique(DT[, list(user, course, tsmax)])
    setkey(tmpdat1, user)
    setkey(tmpdat2, user)
    cbind(tmpdat1[, list(user, course)], 
          res=tmpdat1[tmpdat2, list(res=sum(tsmin <= tsmax)), by=.EACHI][, res])
}
system.time(res_r2 <- adHocOperateR2(simlog))

##    user  system elapsed 
##   0.116   0.022   0.139

setkey(res_r1, user)
identical(res_r1, res_r2)

## [1] TRUE

Time for Rcpp. Is it possible to be even faster?

cppFunction('
    NumericVector adHocOperate (NumericVector first, NumericVector last) {
        int len = first.size();
        NumericVector out(len);
        for (int i = 0; i < len; i++) {
            out(i) = 0;
            for (int j = 0; j < len; j++) {
                if (first(j) <= last(i)) {
                    out(i) += 1;
                }
            }
        }
        return out;
    }')

adHocOperateCpp <- function(DT) {
    require(data.table)
    DT <- copy(simlog)
    DT[, `:=`(tsmin=min(ts), tsmax=max(ts)), by="user,course"]
    DT <- unique(DT[, list(user, course, tsmin, tsmax)])
    cbind(DT[, list(user, course)],
          res=as.integer(DT[, list(res=adHocOperate(.SD$tsmin, .SD$tsmax)), by="user"][, res]))
}

system.time(res_cpp <- adHocOperateCpp(simlog))

##    user  system elapsed 
##   0.107   0.016   0.124

setkey(res_cpp, user)
identical(res_cpp, res_r1)

## [1] TRUE

Well, the result suggests that Rcpp solution performs fairly equal to the data.table solution. Time to inspect the scalability issue.

# make the data bigger
simlog <- genSimData(nlog=1e7, nuser=1e4, seed=528491)
microbenchmark(adHocOperateR2(simlog),
               adHocOperateCpp(simlog), 
               times=10)

## Unit: seconds
##                     expr      min       lq     mean   median       uq
##   adHocOperateR2(simlog) 1.495588 1.535027 1.625718 1.581211 1.730887
##  adHocOperateCpp(simlog) 1.374856 1.383693 1.463981 1.409672 1.549285
##       max neval
##  1.851289    10
##  1.731394    10

Finally, the Rcpp solution still outperforms, but not in so much advantage. The data.table solution did a fairly good job. The reason is not surprising: data.table itself is largely written in C.

Here is the take-a-way from the 3rd example: when performance is the key factor in your computing job and you find yourself facing severe performance bottleneck, you either rewrite the key operaion part of your code in Rcpp or you choose a package that is written in C and hopefully can solve your problem. To be a more independent data analyst, it is really worth learning Rcpp. As you can see in the above three examples, sometimes it only takes a little simple coding and your task will be done in lightning fast.