Introduction to R

Kyle Chung
November 28th, 2013 at Trend Micro

Outline

About Me

  • Career
    • Now: SPN Data Correlation, Trend Micro
    • Former: Data Mining Programmer at Newegg.com
  • Academic Background
    • Master of Economics, NTU
    • Bachelor of International Business, NCCU
      • Double Major Accounting
  • Skills
    • Econometrics, Data Mining, or something like that

And then, about R...

What is R?

R: The most powerful statistical package

  • A general programming language that is…
    • interpreted
    • designed for statistics
    • with a powerful and active academic community
    • able to run in both interactive and batch mode
    • applicaple to all modern operating systems
    • FREE! (open source)
  • Some useRs consider it general due to its high flexibility to develop applications for integration with other infrastructures…
    • ODBC with in-line SQL
    • high-level functions to deal with curl, json, XML…
    • Hadoop integration
    • And many others

Some History

  • R is a GNU project based on S language, developed by AT&T (1976)
  • S was then sold (so there is S-PLUS)
  • Okay. That's it.

And why using R?

How about other languages?

  • Among all over statistical packages such as…
    • SAS Extremely expensive with surprisingly little capability.
    • SPSS Not programmatically friendly. Not free.
    • Stata
      • Powerful for economics and epidemiology
      • But with limited flexibility and scalability
      • Not free
    • Matlab Too fat. Not free.
    • GAUSS
      • General Apparatus for Users to Suicide Systematically (I mean it.)

Installation

Let's Begin!

  • For Windows
    • Download and execute the binary
  • For Mac
    • Download and execute the binary
  • For Linux, depending on the distribution…
    • sudo apt-get install r-base-core
    • sudo yum install R-core (recommend using EPEL)
    • Compile it on your own (don't look at me)
  • R on the cloud
    • StatAce: A startup that allows you to conduct analysis on the cloud

Some Advices

  • Choose 64-bit version whenever you can
    • R heavily depends on RAM
    • Some advanced applications only support 64-bit platform, e.g., RHadoop
  • Prefer Linux/Mac over Windows
    • There is a troublesome Unicode encoding issue in Windows not yet solved
      • if you are to analyze data coming from a variety of sources the world over, you WILL be in trouble
    • Some useful packages do not have pre-compiled version of or simply cannot run in Windows, e,g, bigmemory, rPython

IDE for R

RStudio

  • http://www.rstudio.com/
  • For all platforms
    • RStudio-Server for Linux server
  • Well integrated with Git
  • Markdown language interface customized for R
    • Slides for this lecture are purely writen in RStudio
  • Module used to develop web app of R based on interactive graphing
  • And yes, it's FREE!

IPython Notebook

  • http://ipython.org/
  • Generally a Pyhton IDE
  • But you can easily execute R codes
    • try the magic function: %load_ext rmagic

Basic Programming Tips in R

Object-Oriented?

Your role matters!

  • In general, R is an OO language
  • As a software developer:
    • You write your application in OO style
    • You are possibly a contributor to CRAN
  • As a data analyst:
    • You hack into data with procedural or functional programming
    • You only need to know basic concept of OO
  • After all, it's up to you!
    • I personally never write OO-style codes for data analysis purpose

Fundamentals

Environment Setup #1

sessionInfo() # type ?sessionInfo for document help
  • Where is my working directory?
getwd()
  • WIN users be careful: backslash path must be escaped
setwd('C:\\Dropbox')
# or simply:
setwd('C:/Dropbox')
  • Save environment and quit
save.image()  # default to destined file '.RData'
q()

Environment Setup #2

  • Configure startup script
edit(file='C:/Program Files/R/R-3.0.0/etc/Rprofile.site')
  • Get/set search paths for libraries (packages)
.libPaths()
[1] "C:/Users/everdark/Documents/R/win-library/3.0"
[2] "C:/Program Files/R/R-3.0.2/library"           
(.libPaths(c(.libPaths(), getwd())))
[1] "C:/Users/everdark/Documents/R/win-library/3.0"
[2] "C:/Program Files/R/R-3.0.2/library"           
[3] "C:/Dropbox/R/Course"                          

Misc

  • R is case-senstive
  • Comment character is # only (no multi-line comment char)
  • In general, all objects are put into ram
    • Yes, there are exceptions
    • use object.size to check memory usage of object(s)
  • No line breaker required
    • Use semi-colon (;) to put multiple statements on one line
  • Use gc for garbage collection
    • In general R will handle this automatically
  • Use library or require to load external packages
  • Use install.packages to install CRAN packages
  • Use as function for type convertion

Assignment: '<-' versus '='

x <- 999L; y = 0 # append 'L' to force integer type
x                # the same as print(x)
[1] 999
print(y)
[1] 0
  • Use '<-' for assignment
    • The Google R style guide indeed prohibits '=' as assignment
    • Better readability
  • Use '=' for parameter association (in function call)
    • Avoid unpredictable result

More on Assignment

  • Parameter assignment is local-scoped when using '=', but not '<-'
power2 <- function(z) z^2  
power2(z = 3); z           # Error: object 'z' not found
power2(z <- 4); z          # z is assigned numeric 4
  • What's more, due to lazy evaluation
z <- 'whatever remains'
tellTheTruth <- function(z) return(TRUE)
tellTheTruth(z <- 'I won\'t appear.')
[1] TRUE
z
[1] "whatever remains"

Super Assignment: '<<-'

  • Assign value globally within local function (What's the point?)
changeX <- function(x) {
    x <- x + 100
    gvar <<- 'I am a GLOBAL variable.'
    x
}
x <- 0; changeX(x=x)
[1] 100
x
[1] 0
gvar
[1] "I am a GLOBAL variable."

Variables(Symbols) in R are...

  • Not typed and without declaration
x <- 123
typeof(x)
[1] "double"
x <- 'abc'
typeof(x)
[1] "character"
rm(x)     # remove the object
  • Lazily evaluated (already mentioned!)
y <- x <- 'abc'
identical(tracemem(x), tracemem(y))
[1] TRUE
y <- 'abc'
identical(tracemem(x), tracemem(y))
[1] FALSE
identical(x, y)
[1] TRUE

Type and Class

  • Types (of Vectors)
  • Matrices
  • Lists
  • Data Frames
  • Factors

Everything shall begin from a scalar!

Everything shall begin from a vector...

  • Oops! There is no such thing called scalar in R
  • The fundemantal building block is vector
    • 1L is an integer vector of length 1
    • Matrices are vectors
    • Arrays (multidimensional Matrices) are vectors
  • Index is 1-based
V <- c(10L,20,30)   # c for concatenate
typeof(V)           # notice for the implicit casting
[1] "double"
V[2]                # try ?'['
[1] 20

Types

  • Types include…
    • logical TRUE, FALSE, T, F
    • integer, double, complex
    • character
    • list
    • closure, special, builtin (these are functions)
    • NULL
  • Use typeof or mode to check the variable type
    • Difference? See ?mode

Type v.s. Class

  • Every object has a mode(type) and a class
    • Mode suggests how it is stored in memory
    • Class suggest its abstract type (utilized in OOP)
x <- integer()
c(typeof(x), class(x))           # try also storage.mode(x)
[1] "integer" "integer"
y <- data.frame()
c(typeof(y), class(y))
[1] "list"       "data.frame"

Vectors

  • Elements of a (atomic) vector must be all of the same type
(mixed <- c('123', 123)) # implicit casting
[1] "123" "123"
  • No insertion or deletion
(mixed <- c(mixed, 4000)[2:4])   # V points to a new vector
[1] "123"  "4000" NA    
  • Can be named
names(mixed) <- c('1st', '2nd', '3rd'); mixed
   1st    2nd    3rd 
 "123" "4000"     NA 

Recycling

  • Operation that requires equal-lengthed vectors generally cause R to recycle the short one to necessary length
c(1,2) * c(100,100,100) # '*' is element-wise, see ?'*'
Warning: longer object length is not a multiple of shorter object length
[1] 100 200 100
  • Apply to Matrices as well (Matrices are vectors)
    • Matrices are expanded to vectors by column

Matrix Multiplication

  • Implemented by %*%
  • No recycling: non-comfortables will generate ERROR
c(1,2,3) %*% c(100,100,100) # inner product
     [,1]
[1,]  600
c(1,2,3) %o% c(100,100,100) # outer product
     [,1] [,2] [,3]
[1,]  100  100  100
[2,]  200  200  200
[3,]  300  300  300

Indexing #1

  • Vector indexing is done by vector
V <- c(10,20,30,40)
V[c(2,4)]
[1] 20 40
V[2:length(V)]    # see ?seq for a generalized ':' function
[1] 20 30 40
bool_index <- c(TRUE,FALSE,FALSE,FALSE)
V[bool_index]
[1] 10

Indexing #2

V1 <- c(10,20,30,40)
(V2 <- V1[c(rep(1,2),2:4)])  # take the 1st element twice
[1] 10 10 20 30 40
V2[-c(2,4)]                  # negative indexing
[1] 10 20 40
  • Application?
V2[V2 > mean(V2)]            # indeed, '>'(V2, mean(V2))
[1] 30 40

Indexing #3

  • Can be done by names (if any)
names(V1) <- c('one', 'two', 'three', 'four')
str(V1) # a very useful function to check structure of an object
 Named num [1:4] 10 20 30 40
 - attr(*, "names")= chr [1:4] "one" "two" "three" "four"
V1[c('two', 'three')]
  two three 
   20    30 
V1[grep('^t', names(V1))]   # see also ?grepl and ?regex
  two three 
   20    30 

Matrices #1

  • A new class: vectors with additional attribute 'dimension'
(M <- matrix(c(1,2,3,4), nrow=2, byrow=FALSE))
     [,1] [,2]
[1,]    1    3
[2,]    2    4
str(M)           # simply a numeric vector with 2 dimensions
 num [1:2, 1:2] 1 2 3 4
dim(M)           # try length(M)
[1] 2 2

Matrices #2

  • Can be named on either rows or columns
colnames(M) <- c('c1', 'c2')
rownames(M) <- c('r1', 'r2')
  • Indexed
identical(M[,2], M[1:2,'c2']) # test exact equality
[1] TRUE
  • Editted
M[1,1] <- 0; M
   c1 c2
r1  0  3
r2  2  4

Matrices #3

  • Be careful! By default R will drop dimension when possible
dim(M[1,])
NULL
str(M[1,])             # rowname attribute is dropped, too
 Named num [1:2] 0 3
 - attr(*, "names")= chr [1:2] "c1" "c2"
str(M[1,,drop=FALSE])  # you'll need this one day
 num [1, 1:2] 0 3
 - attr(*, "dimnames")=List of 2
  ..$ : chr "r1"
  ..$ : chr [1:2] "c1" "c2"

Matrices #4

  • Append by cbind or rbind
M1 <- matrix(1:4, 2, 2)       # use positional argument
M2 <- matrix(0, 2, 2)         # recycle occurs
cbind(M1, M2)
     [,1] [,2] [,3] [,4]
[1,]    1    3    0    0
[2,]    2    4    0    0
unique.matrix(rbind(M1, M2))  # distinct by row
     [,1] [,2]
[1,]    1    3
[2,]    2    4
[3,]    0    0

Sparse Matrix

There are many implementations of sparse matrix in R. The base package Matrix provide a fairly flexible class dgCMatrix (which, of course, extends class matrix).

  • A document example
i <- c(1,3:8); j <- c(2,9,6:10); x <- 7 * (1:7)
(A <- Matrix::sparseMatrix(i, j, x = x))
8 x 10 sparse Matrix of class "dgCMatrix"

[1,] . 7 . . .  .  .  .  .  .
[2,] . . . . .  .  .  .  .  .
[3,] . . . . .  .  .  . 14  .
[4,] . . . . . 21  .  .  .  .
[5,] . . . . .  . 28  .  .  .
[6,] . . . . .  .  . 35  .  .
[7,] . . . . .  .  .  . 42  .
[8,] . . . . .  .  .  .  . 49

Lists #1

  • Similar to dict in Python
  • Still vectors, but recursive (contrary to atomic)
    • Can contain elements of different types
mylist <- list(name='Kyle', gender='male', 18)
str(mylist)         # notice that tag name is not required
List of 3
 $ name  : chr "Kyle"
 $ gender: chr "male"
 $       : num 18
  • Nesting is possible
    • Notice that this is why the name recursive
nested <- list(old=mylist, new='new')

Recursive v.s. Atomic Vectors

  • Atomic vectors can't be broken down, and can't have multiple types in its elements
    • numeric, character, logical, …
(atomic <- c(c(1,2,3), 4, 5))              # no nesting
[1] 1 2 3 4 5
  • Not the case for recursive vectors
str(recursive <- list(list('a',2), TRUE))
List of 2
 $ :List of 2
  ..$ : chr "a"
  ..$ : num 2
 $ : logi TRUE

Lists #2

  • List indexing: take individual element
c(
    mylist$name,      # use '$' operator
    mylist[['name']], # use '[[' with tag name
    mylist[[1]]       # use '[[' with numeric index
    )
[1] "Kyle" "Kyle" "Kyle"
  • List indexing: slicing (always return a list)
mylist[1:2]           # try also mylist[c('name', 'gender')]
$name
[1] "Kyle"

$gender
[1] "male"

Lists #3

  • Insertion and deletion made possible!
mylist$newtag <- 'something new'
mylist[[3]] <- NULL
mylist[['gender']] <- 'female'
mylist
$name
[1] "Kyle"

$gender
[1] "female"

$newtag
[1] "something new"
  • Use unlist to convert recursive vectors into atomic one

Data Frames #1

  • Indeed lists; hence (recursive) vectors
extends('data.frame') # return superclass of data.frame
[1] "data.frame" "list"       "oldClass"   "vector"    
  • To model tabular data
    • DataFrames class from the pandas package in Python?
    • Yes it is designed to mimic data.frame in R
DF <- data.frame(small=letters, big=toupper(letters), rn=round(runif(26),1))
head(DF) # print the first 6 rows
  small big  rn
1     a   A 0.9
2     b   B 0.6
3     c   C 0.2
4     d   D 0.6
5     e   E 0.4
6     f   F 0.5

Data Frames #2

  • Join
    • use merge for inner/outter join
  • Append
    • use rbind, cbind (these functions are polymorphic)
  • Aggregate (group-by operation)
    • use aggregate, sweep
  • Pivot
    • use reshape
  • Partition
    • use split.data.frame
  • Order-by
    • use order

Data Tables

Package data.table provides a new class 'data.table' which extends 'data.frame' but with more efficient computing capability for fairly large dataset.

  • Specifically, it provides:
    • Indexed data frame with binary search implemented
    • by-reference data manipulation
    • fast group-by operator
  • See quick introduction for more info

Factors #1

  • To model nominal or categorical data
    • Yes, they are vectors
    • Seemingly character, but numeric
(fchar <- factor(sample(letters[1:5], 5, replace=TRUE)))
[1] d b b c c
Levels: b c d
c(class(fchar), typeof(fchar))
[1] "factor"  "integer"
str(fchar)
 Factor w/ 3 levels "b","c","d": 3 1 1 2 2

Factors #2

x <- factor(c(1, 10, 100, 10))
levels(x)
[1] "1"   "10"  "100"
as.character(x)
[1] "1"   "10"  "100" "10" 
as.numeric(x)
[1] 1 2 3 2
grep('00', x)    # grep use implicit casting
[1] 3

NULL

  • Not exist, has no mode (or mode of its own)
  • Not to confused with NA, which represents missing values
c(length(NA), length(NULL))
[1] 1 0
  • Not to confused with NaN, which represents not-a-number
0/0
[1] NaN

Control Structure

  • if-else
  • repeat
  • while
  • for
  • try

Conditioning

  • Syntax
if (expression_A) {
    code_block_A
} else if (expression_B) {
    code_block_B
} else {
    code_block_C
}
  • Vectorized version: ifelse
x <- c(1, 2, 3)
ifelse(x > mean(x), 'Yes', 'NO')
[1] "NO"  "NO"  "Yes"

Boolean Operators

  • Be aware of the difference between:
    • Element-wise AND/OR: & and |
    • Scalar AND/OR: && and ||
x <- c(1, 0, 1, 1)
y <- c(1, 0, 0, 0)
x & y             # return Boolean vector
[1]  TRUE FALSE FALSE FALSE
x && y            # only the FIRST element is tested
[1] TRUE
  • Reminder: non-zero numerics are casted as TRUE
    • if (1) print(TRUE) is valid (not for Java you know…)

Looping: repeat

i <- 1
repeat {
    i <- i + 1
    if (i > 10) break
}
i
[1] 11
  • Simply a while (TRUE) loop
  • Escape keywords:
    • next to skip the current iteration
    • break to break the entire loop
  • Not required for indent but recommended
  • {} is required for multi-line block

Looping: while

i <- 1
while (i < 10) {
    i <- i + 1
}
i
[1] 10
  • Check the condition at the begining of each iteration
  • There is no 'until' loop in R

Looping: for

  • Iterate over numeric/character vector
for (i in 1:10) print(i)                 # result not printed
for (c in letters) print(c)
  • Iterate over a list of objects: use get
A <- c(1, 2, 3)
B <- c(100, 200, 300)
for (obj in c('A', 'B')) print(mean(get(obj)))
[1] 2
[1] 200
  • Iterate over a list of objects: put object into a list
for (obj in list(A,B)) print(mean(obj))  # result not printed

Exception Handling in R

You may want to see the section for Function first!

  • Not programmingly friendly compared to other language
    • Not a statement-like format
  • Use try for simple application
  • Use tryCatch for more flexible and formal usage
result <- 
    tryCatch({
        # main expression block
        # last valuated expression will be returned in case of success
    }, warning=function(cond) {
        # warning handling block, wrapped in function
        # argument is a condition class auto-generated in case of warning
    }, error=function(cond) {
        # error handling block, wrapped in function
        # argument is a condition class auto-generated in case of error
    }, finally {
        # the finally block that always evaluated
    })

Exception Handling: Toy Example #1

expr <- quote(1 + '1')

out <- tryCatch({
    eval(expr)
}, error=function(cond) {
    # simply return the condition object
    cond 
}, finally={
    # be ware that the finally block is NOT wrapped in function
    print('Finally!')
})
[1] "Finally!"
out
<simpleError in 1 + "1": non-numeric argument to binary operator>
class(out)
[1] "simpleError" "error"       "condition"  

Exception Handling: Toy Example #2

try_list <- list(expr1=quote(1 + 1), expr2=quote(1 + '1'))
sapply(try_list, eval)
Error: non-numeric argument to binary operator
tryEval <- function(expr) {
    out <- tryCatch({
        eval(expr)
    }, error=function(cond) {
        message('the original error message:', '\n', cond)
        return(NA)
    })
    out
}
sapply(try_list, tryEval)
the original error message:
Error in 1 + "1": non-numeric argument to binary operator
expr1 expr2 
    2    NA 

Function

Nothing more than a specific typed object, but deserve its own section cause it is the core of R programming.

Or skip to next section

Basic Syntax of Function #1

doSomething <- function(x) {
    x    # this is the same as return(x)
}
  • function itself is a function object to create function
    • Actually, { is a function, too
      • Try ?'{'
  • Formal name: closure
    • consisting of arguments, body, and its environment (scope)
environment(doSomething)
<environment: R_GlobalEnv>
  • Use ls to list all variables in global (default) scope

Basic Syntax of Function #2

  • return is not require
    • the function will by default return the last object in body code, and only the last one
      • use list to return multiple objects
    • use return for a conditional break
    • or use stop to break the function with error message
    • and use warning to generate warning message
justInteger <- function(x) 
    if ( !is.integer(x) ) stop('please give me integer')
justInteger(1L)
justInteger('1')
Error: please give me integer

Function Nested

  • It is possilbe (and sometimes desirable) to write a function within a function
    • This results in scope hierarchy
    • Inner-most scope have highest priority
  • Functions have no side effect
    • Only local copies will be changed
    • You can NOT change a variable in-place
    • A reminder: there are no pointer things in base R
      • variables are copied by assignment
      • but with lazy evaluation

So...

What do you think about the follwing prgramming fact in R?

V1 <- c(10,20,30,40); names(V1)
NULL
names(V1) <- c('one', 'two', 'three', 'four'); V1
  one   two three  four 
   10    20    30    40 

Apparently names is a function, and we assign values to the result of its call

  • alt okok.png

In fact..

  • Try ?'names<-' and mode(get('names<-'))
V1 <- c(10,20,30,40)
V1 <- 'names<-'(V1, value=c('one', 'two', 'three', 'four'))
V1
  one   two three  four 
   10    20    30    40 
  • There are something called replacement functions in R
  • Remember this?
    • V1[1] <- 11 will replace the first element of V1 with numeric 11
    • Indeed: try ?'[<-'
    • Also: '$<-' for list operation

Polymorphism

  • The same function call leads to different operations for objects of different classes
    • Bulit-in function such as plot, print, summary, and many others are polymorphic
    • These functions are named generic funcitons in document
    • Use methods to query applicaple classes for a given generic function
methods(summary)[1:20] # restrict to first 20 methods
 [1] "summary.aov"        "summary.aovlist"    "summary.aspell"    
 [4] "summary.connection" "summary.data.frame" "summary.Date"      
 [7] "summary.default"    "summary.ecdf"       "summary.factor"    
[10] "summary.ggplot"     "summary.glm"        "summary.infl"      
[13] "summary.lm"         "summary.loess"      "summary.loglm"     
[16] "summary.manova"     "summary.matrix"     "summary.mlm"       
[19] "summary.negbin"     "summary.nls"       

Polymorphism: An Example

# reduce sample dataset
iris1 <- iris[,1:2]

# linear regression without intercept
lm_model <- 
    lm(data=iris1, 
       Sepal.Length~ -1+Sepal.Width)

# summarize a data.frame
summary(iris1) 
  Sepal.Length   Sepal.Width  
 Min.   :4.30   Min.   :2.00  
 1st Qu.:5.10   1st Qu.:2.80  
 Median :5.80   Median :3.00  
 Mean   :5.84   Mean   :3.06  
 3rd Qu.:6.40   3rd Qu.:3.30  
 Max.   :7.90   Max.   :4.40  
summary(lm_model)

Call:
lm(formula = Sepal.Length ~ -1 + Sepal.Width, data = iris1)

Residuals:
   Min     1Q Median     3Q    Max 
-2.524 -1.036  0.482  0.990  2.841 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
Sepal.Width   1.8690     0.0326    57.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.23 on 149 degrees of freedom
Multiple R-squared:  0.957, Adjusted R-squared:  0.956 
F-statistic: 3.28e+03 on 1 and 149 DF,  p-value: <2e-16

The 'Apply' Function Family

  • Highly readable
  • Ready for parallelization
  • No side effect (compared to explicit loop)
  • Useful members:
    • apply
      • Apply a function to each row/column of a matrix
    • lapply
      • Apply a function to each element of a list
      • Use sapply for a vector version
    • mapply
      • Apply a function with each parameter given in a vector
    • And many others

Example for 'apply'

(M <- matrix(sample(1:15), 3, 5))
     [,1] [,2] [,3] [,4] [,5]
[1,]   15    2    6   10    4
[2,]    8   13   14    1   12
[3,]   11    5    7    3    9
apply(M, 2, mean)
[1] 11.333  6.667  9.000  4.667  8.333
apply(M, 2, function(x) sum(x)/length(x)) # use anonymous function
[1] 11.333  6.667  9.000  4.667  8.333
# Indeed we have...
colMeans(M)
[1] 11.333  6.667  9.000  4.667  8.333

Example for 'lapply'

str(cars)  # remember that a data.frame is a list
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
apply(cars, 2, mean)   # return a vector
speed  dist 
15.40 42.98 
lapply(cars, mean)     # return a list
$speed
[1] 15.4

$dist
[1] 42.98

Final Remarks on Function

  • Actually, all control-flow constructs are formed by functions
c(mode(get('if')), mode(get('for')))
[1] "function" "function"
  • And all operators are functions, too
c(mode(get('+')), mode(get('&&')), mode(get('<-')))
[1] "function" "function" "function"
  • Everything is an object, and many of them functions
mode(get('{'))
[1] "function"

Time for Error Handling!

Now we shall go back to the section for error handling.

Vectorization

A function applied to a vector is actually applied to each element individually (i.e., A map function)

  • 'Apply' family functions are NOT vectorized
  • Most of the built-in operators are vectorized
  • The principle for writing loops in R:
    • NOT to write loops in R (whenever possible)
    • Use vectorization first
  • But why?
    • R interpreter is slow
    • Vectorization is implemented in C or Fortran (they're faster)
    • Side effects

Little Experiment on Vectorization

  • Looping add in Python
import time
mil = range(1, 1000001)
start_time = time.time()
mil = [m + 100 for m in mil]
print time.time() - start_time, 'seconds'
0.101000070572 seconds
  • alt okok.png

Little Experiment on Vectorization

  • Looping add in R
mil <- c(1:1000000)
system.time( # explicit loop with side effect
    for ( i in seq_len(length(mil)) ) mil[i] <- mil[i] + 100
    )
   user  system elapsed 
   3.12    0.00    3.13 
  • alt flip.png

Little Experiment on Vectorization

  • Looping add in R (one more try!)
mil <- c(1:1000000)
system.time( # high-level loop wraper without side effect
    mil <- sapply(mil, function(x) x + 100)
    )
   user  system elapsed 
   3.89    0.03    3.93 
  • alt rage.png

Little Experiment on Vectorization

  • Vectorized add in R
mil <- c(1:10000000) # one more zero
system.time(         # What the hell are you two just doing?
    mil <- mil + 100
    )
   user  system elapsed 
   0.15    0.00    0.16 
  • alt truestory.png

I/O

Here comes the data.

Basic Operation

  • Commonly used functions include…(See ?files for more)
    • dir: list files under the curent working directory
    • file.info: examine information of a given file
    • file.exist: check the existence of a file
  • Or simply use system to execute your system commmand
    • stdout will be redirected to R console
    • system('ls') is conceptually the same as dir
    • Limitation: pipes not allowed
  • Use shell for more flexible shell command
dir()[grep('\\.html$', dir())]
# is the same as
shell('ls | grep "\\.html$"', intern=TRUE)

Low-Level Load Function: readLines()

  • Read the source line-by-line into character vector
    • Source could be an external file or a connection
    • Connection can be established purely in R
    • Write to a file via writeLines
temp_file <- file()
cat(
'this is the first line
and this is the second line
this is the end of file
', file=temp_file)   # be ware of the quoting position
readLines(temp_file) # connection will be auto-closed
[1] "this is the first line"      "and this is the second line"
[3] "this is the end of file"    

What is a "connection"?

  • ?connection
  • Connection in R document is used to describe the mechanism of I/O operation.
  • There are lots of different kind of connection:
    • a file connection established by file
    • a compressed file connection by gzfile
      • such taht you don't need to manually uncompress the data before analysis
    • a web page connection by url
      • analyze online content with other packages such as XML, rjson
    • many others

Low-Level Load Function: scan()

  • More configurable than readLines
    • (Optionally) Use pre-defined schema to load and convert data type
    • Can parse structurally delimited raw data
  • The main function to establish high-level load function in R
temp_file <- file()
cat('1,,three', file=temp_file)
scan(temp_file, what=list(col1=1L,col2=1L,col=''), sep=',')
$col1
[1] 1

$col2
[1] NA

$col
[1] "three"

High-Level Load Function: read.table()

  • See ?read.table
    • Many variants exist: read.csv, read.fwf, …
  • Some parameter suggestions…
    • use nrows=100 to read the initial 100 rows for testing purpose
    • use comment.char='' to disable comment char if you are dealing with messy string data to prevent from long string crash
    • use stringsAsFactors=FALSE if you don't like characters to be converted into factors
    • use colClasses to pre-defined data schema for speedup
    • use fill=TRUE if your data is unbalanced (have missing columns for some rows)
  • The returned object is data.frame

Visualization

There are a bunch of ways to do data visualization in R, with either base environment or leveraing other add-on pacakges and even other languages– for example, JavaScript.

Static Graphing

High-Level Graphing

  • Do visualiztion in R is kinda high-level stuff
    • We have plot, barplot, boxplot, matplot, …
      • plot is a generic function doing X-Y ploting
      • barplot is used to generate barplot (what else?)
      • boxplot for boxplot
      • matplot for multi-line ploting
      • and still others, e.g., rgl::plot3d
    • These are mainly built-in graphing functions
    • Many advanced add-on packages available

An Example: Scatter Plot

  • Let's generate some sample data to plot!
case <- iris[,3:5]
colnames(case) = gsub('\\.', '', colnames(case))
case$Name <- paste('N', 
                   round(runif(nrow(case)),3)*1000, 
                   sep='')
head(case)
  PetalLength PetalWidth Species Name
1         1.4        0.2  setosa N904
2         1.4        0.2  setosa N503
3         1.3        0.2  setosa N133
4         1.5        0.2  setosa N858
5         1.4        0.2  setosa N659
6         1.7        0.4  setosa N741

An Example: Scatter Plot (by Base)

plot(x=case$PetalLength,
     y=case$PetalWidth,
     xlab='PetalLength',
     ylab='PetalWidth',
     col=c(1:3)[case$Species], 
     pch=19)
legend('bottomright', 
       levels(case$Species), 
       col=c(1:3), pch=19)
title('Scatter Plot of Iris Data')
  • Problem in base package
    • Syntax and parameters inconsistent across types of plot
    • Hard to configure details

plot of chunk unnamed-chunk-77

An Example: Scatter Plot (by ggplot2)

library(ggplot2)
AES <- aes(x=PetalLength, 
           y=PetalWidth, 
           group=Species, 
           color=Species)
ggplot(case, AES) + 
    geom_point(size=3) + 
    theme(legend.position='top')
  • Advantage of ggplot2
    • Syntax highly consistent across types of plot
    • Details are auto-adjusted with managable modifications
    • See more examples here

plot of chunk unnamed-chunk-79

Dynamic Graphing

Integration with JavaScript (Libraries)

  • Possible solutions:
    1. Do data analysis in R, render graph by JS
      • rcharts, d3network, googleVis, …
      • These packages provide high-level interface to graphing libraries outside R
    2. Do data analysis and graphing in R, add interactivity by JS
      • gridSVG, SVGAnnotation, …
      • These packages provide high-level interface to JavaScript functionality
    3. Use web app devtool shiny (a RStudio product)
  • Either case, automation done purely in R environment is possible
    • You may need some JS knowledge

Try Solution 1! Plot via NVD3

library(rCharts)
nvd3 <- nvd3Plot(
    PetalLength ~ PetalWidth, data=case, type='scatterChart', 
    group='Species', 
    xAxis=list(axisLabel='PetalWidth'), 
    yAxis=list(axisLabel='PetalLength'),
    chart=list(showDistX=TRUE, showDistY=TRUE)
); nvd3

NVD3 with d3.js Modification

see my RPubs article for the source code.

Try Solution 2! Render graph in R

  • This time we create the plot by R device
library(ggplot2)
library(gridSVG)
AES <- aes(x=PetalLength, 
           y=PetalWidth, 
           group=Species, 
           color=Species)
gg <- ggplot(case, AES) + 
    geom_point(size=3) + 
    theme(legend.position='top')
ggsvg <- grid.export("ggplot_scatter.html", 
                     xmldecl=NULL, addClasses=TRUE)
file.show('ggplot_scatter.html')

Convert grid-based graph to SVG format

Zoom out to check scalability!

Hack .svg: Convert data to JSON

# read back svg inline
raw_svgcode <- readLines('./svg_demo/ggplot_scatter.html', warn=FALSE)

# add d3.js library path
d3js_library_url <- 'https://dl.dropboxusercontent.com/u/210177/d3/d3.v3.min.js'
modified_svgcode <- c(paste('<script src="', 
                            d3js_library_url, 
                            '"></script>', 
                            sep=''), raw_svgcode)

# convert data to json (in object-of-objecs format)
head(gg$data)
tojson <- apply(gg$data, 1, function(x) list(x))
tfile <- file()
cat(
    '<script> data=',
    rjson::toJSON(lapply(tojson, function(x) unlist(x))),
    '</script>'
    ,
    file=tfile
    )
importData <- readLines(tfile, warn=FALSE)
close(tfile)
modified_svgcode <- c(modified_svgcode, importData)

Hack .svg: Bind data reversely

# bind data
tfile <- file()
cat(
    '
    <script>
    scatterpoints = d3.select(".points")
                      .selectAll("use")
                      .data(data);
    </script>
    ',
    file=tfile
    )
bindData <- readLines(tfile, warn=FALSE)
close(tfile)
modified_svgcode <- c(modified_svgcode, bindData)

Hack .svg: Add simple tooltip

# add simple tooltip (use browser default title facility)
tfile <- file()
cat(
    '
    <script>
    d3.selectAll("use")
      .append("title")
      .text(function(d) {return d.Name;});
    </script>
    ',
    file=tfile
)
addTooltip <- readLines(tfile, warn=FALSE)
close(tfile)
modified_svgcode <- c(modified_svgcode, addTooltip)

Hack .svg: Add hyperlink event

# add hyperlink event
tfile <- file()
cat(
    '
    <script>
    d3.selectAll("use")
      .on("click", function(d) {
        var url = "http://google.com/search?q=";
        url += d.Name;
        window.location.href = url;}
        );
    </script>
    ',
    file=tfile
)
addHref <- readLines(tfile, warn=FALSE)
close(tfile)
modified_svgcode <- c(modified_svgcode, addHref)

writeLines(modified_svgcode, './svg_demo/ggplot_scatter_modified.html')
file.show('./svg_demo/ggplot_scatter_modified.html')

Hack .svg: Done!

Final Remarks

Some Useful Packages?

  • On-disk computing
    • ff, ffbase, sqldf
  • Parallel computing
    • parallel for embarrassingly parallel problem
    • many others (including GPU computing)
  • Machine learning
    • randomForest, arules, e1071, …
  • Database interface
    • RODBC
  • Hadoop
    • rmr2, rhdfs, rhbase, …
  • See CRAN task views for more info

References

  • Chang (2013), R Graphics Cookbook
  • Matloff (2011), The Art of R Programming
  • Murray (2013), Interactive Data Visualization for the Web