Chapter 1 Introduction to R
The goal of the first part of this book is to get you up to speed with the basics of R as quickly as possible.
1.1 Installation
Install the latest preview version for getting all features.
Follow the procedures according to your operating system.
- Linux: You need to have
blas
andgfortran
installed on your Linux, for installing thecoin
package. - Rgraphviz requires installation from
source("http://bioconductor.org/biocLite.R")
, thenbiocLite("Rgraphviz")
. - Run the following lines for installing all needed packages (this may take some time):
## listofpackages <- c("arules","arulesViz", "bookdown", "ggplot2", "vioplot", "UsingR", "fpc", "reshape", "party", "C50", "utils", "rpart", "rpart.plot", "class", "klaR", "e1071", "popbio", "boot", "dplyr", "doParallel", "gbm", "DMwR", "pROC", "neuralnet", "igraph", "RMySQL", "caret", "randomForest", "tm", "wordcloud", "xts", "lubridate", "forecast", "urca", "glmnet", "FSelector", "pls", "emoa", "foreign" )
# newpackages <- listofpackages[!(listofpackages %in% installed.packages()[,"Package"])]
# if(length(newpackages)>0) install.packages(newpackages,dependencies = TRUE)
# # install from archive (RPG is no maintained anymore)
# if (!is.element("rgp", installed.packages()[,1]))
# { install.packages("https://cran.r-project.org/src/contrib/Archive/rgp/rgp_0.4-1.tar.gz",
# repos = NULL)
# }
## end of installing packages
# in Linux you may need to run several commands (in the terminal) and install additional libraries, e.g.
# sudo R CMD javareconf
# sudo apt-get install build-essential
# sudo apt-get install libxml2-dev
# sudo apt-get install libpq
# sudo apt-get install libpq-dev
# sudo apt-get install -y libmariadb-client-lgpl-dev
# sudo apt-get install texlive-xetex
# sudo apt-get install r-cran-rmysql
1.2 R and RStudio
R is a programming language for statistical computing and data analysis that supports a variety of programming styles. See R in Wikipedia
R has multiple online resources and books.
Getting help in R
R as a calculator. Console: It uses the command-line interface.
This document is an RMarkdown document. See bookdown.org
Examples:
<- c(1,2,3,4,5,6) # Create ordered collection (vector)
x <- x^2 # Square the elements of x
y print(y) # print (vector) y
## [1] 1 4 9 16 25 36
mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar
## [1] 15.16667
var(y) # Calculate sample variance
## [1] 178.9667
<- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)"
lm_1 # store the results as lm_1
print(lm_1) # Print the model from the (linear model object) lm_1
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -9.333 7.000
summary(lm_1) # Compute and print statistics for the fit
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## 1 2 3 4 5 6
## 3.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.3333 2.8441 -3.282 0.030453 *
## x 7.0000 0.7303 9.585 0.000662 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.055 on 4 degrees of freedom
## Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478
## F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662
# of the (linear model object) lm_1
par(mfrow=c(2, 2)) # Request 2x2 plot layout
plot(lm_1) # Diagnostic plot of regression model
help(lm)
?lmapropos("lm")
## [1] ".colMeans" ".lm.fit" "colMeans" "confint.lm"
## [5] "contr.helmert" "dummy.coef.lm" "glm" "glm.control"
## [9] "glm.fit" "KalmanForecast" "KalmanLike" "KalmanRun"
## [13] "KalmanSmooth" "kappa.lm" "lm" "lm_1"
## [17] "lm.fit" "lm.influence" "lm.wfit" "model.matrix.lm"
## [21] "nlm" "nlminb" "predict.glm" "predict.lm"
## [25] "residuals.glm" "residuals.lm" "summary.glm" "summary.lm"
example(lm)
##
## lm> require(graphics)
##
## lm> ## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## lm> ## Page 9: Plant Weight Data.
## lm> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
##
## lm> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
##
## lm> group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
##
## lm> weight <- c(ctl, trt)
##
## lm> lm.D9 <- lm(weight ~ group)
##
## lm> lm.D90 <- lm(weight ~ group - 1) # omitting intercept
##
## lm> ## No test:
## lm> ##D anova(lm.D9)
## lm> ##D summary(lm.D90)
## lm> ## End(No test)
## lm> opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
##
## lm> plot(lm.D9, las = 1) # Residuals, Fitted, ...
##
## lm> par(opar)
##
## lm> ## Don't show:
## lm> ## model frame :
## lm> stopifnot(identical(lm(weight ~ group, method = "model.frame"),
## lm+ model.frame(lm.D9)))
##
## lm> ## End(Don't show)
## lm> ### less simple examples in "See Also" above
## lm>
## lm>
## lm>
R script. # A file with R commands
# comments
source("filewithcommands.R")
sink("recordmycommands.lis")
savehistory()
From command line:
- Rscript
- Rscript file with
-e
(e.g.Rscript -e 2+2
)
- To exit R:
quit()
- Rscript
Variables. R is case sensitive
<- 1:10
var1 <- 11:20
vAr1 var1
## [1] 1 2 3 4 5 6 7 8 9 10
vAr1
## [1] 11 12 13 14 15 16 17 18 19 20
Operators
- assign operator
<-
- sequence operator, for example:
mynums <- 0:20
- arithmetic operators: + - = / ^ %/% (integer division) %% (modulus operator)
- assign operator
The Workspace. Objects.
ls()
objects()
ls.str()
lists and describes the objects
rm(x)
delete a variable. E.g.,rm(totalCost)
s.str()
objects()
str()
The structure function provides information about the variable
RStudio, RCommander and RKWard are the well-known IDEs for R (more later).
Four # (‘####’) create an environment in RStudio. An environment binds a set of names to a set of values. You can think of an environment as a bag of names.
Working directories:
# set your working directory
# setwd("~/workingDir/")
getwd()
## [1] "/home/drg/Projects/DASE"
# record R commands:
# sink("recordmycommands.txt", append = TRUE)
1.3 Basic Data Types
class( )
logical:
TRUE
,FALSE
numeric, integer:
is.numeric( )
is.integer( )
character
Examples:
TRUE
## [1] TRUE
class(TRUE)
## [1] "logical"
FALSE
## [1] FALSE
NA # missing
## [1] NA
class(NA)
## [1] "logical"
T
## [1] TRUE
F
## [1] FALSE
NaN
## [1] NaN
class(NaN)
## [1] "numeric"
# numeric data type
2
## [1] 2
class(2)
## [1] "numeric"
2.5
## [1] 2.5
# integer 2L
## [1] 2
class(2L)
## [1] "integer"
is.numeric(2)
## [1] TRUE
is.numeric(2L)
## [1] TRUE
is.integer(2)
## [1] FALSE
is.integer(2L)
## [1] TRUE
is.numeric(NaN)
## [1] TRUE
data type coercion:
as.numeric( )
as.character( )
as.integer( )
Examples:
<- as.numeric(TRUE)
truenum truenum
## [1] 1
class(truenum)
## [1] "numeric"
<- as.numeric(FALSE)
falsenum falsenum
## [1] 0
<- as.character(55)
num2char num2char
## [1] "55"
<- as.numeric("55.3")
char2num
<- as.integer("55.3") char2int
1.4 Vectors
Examples:
<- c("reqs", "dev", "test1", "test2", "maint")
phases str(phases)
## chr [1:5] "reqs" "dev" "test1" "test2" "maint"
is.vector(phases)
## [1] TRUE
<- c(15, 60, 30, 35, 22)
thevalues names(thevalues) <- phases
str(thevalues)
## Named num [1:5] 15 60 30 35 22
## - attr(*, "names")= chr [1:5] "reqs" "dev" "test1" "test2" ...
thevalues
## reqs dev test1 test2 maint
## 15 60 30 35 22
A single value is a vector! Example:
<- 44
aphase is.vector(aphase)
## [1] TRUE
length(aphase)
## [1] 1
length(thevalues)
## [1] 5
1.4.1 Coercion for vectors
<- c(15, 60, "30", 35, 22)
thevalues1 class(thevalues1)
## [1] "character"
thevalues1
## [1] "15" "60" "30" "35" "22"
# <- is equivalent to assign ( )
assign("costs", c(50, 100, 30))
1.4.2 Vector arithmetic
The operation is carried out in all the elements of the vector. For example:
assign("costs", c(50, 100, 30))
/3 costs
## [1] 16.66667 33.33333 10.00000
- 5 costs
## [1] 45 95 25
<- costs - 5
costs
<- c(200, 800, 10)
incomes <- incomes - costs
earnings sum(earnings)
## [1] 845
# R recycles values in vectors!
<- c(1,2,3)
vector1 <- c(10,11,12,13,14,15,16)
vector2 + vector2 vector1
## Warning in vector1 + vector2: longer object length is not a multiple of shorter
## object length
## [1] 11 13 15 14 16 18 17
Subsetting vectors
### Subsetting vectors []
<- phases[1]
phase1 phase1
## [1] "reqs"
<- phases[3]
phase3 phase3
## [1] "test1"
thevalues[phase1]
## reqs
## 15
"reqs"] thevalues[
## reqs
## 15
<- phases[c(3,4)]
testphases thevalues[testphases]
## test1 test2
## 30 35
### Negative indexes
<- phases[-5]
phases1 phases
## [1] "reqs" "dev" "test1" "test2" "maint"
phases1
## [1] "reqs" "dev" "test1" "test2"
#phases2 <- phases[-testphases] ## error in argument
<- phases[-c(3,4)]
phases2 phases2
## [1] "reqs" "dev" "maint"
### subset using logical vector
<- phases[c(FALSE, TRUE, TRUE, FALSE)] #recicled first value
phases3 phases3
## [1] "dev" "test1"
<- c(FALSE, TRUE, TRUE, FALSE)
selectionv <- phases[selectionv]
phases3 phases3
## [1] "dev" "test1"
<- c(TRUE, FALSE)
selectionvec2
<- thevalues[selectionvec2]
thevalues2 thevalues2
## reqs test1 maint
## 15 30 22
### Generating regular sequences with `:` and `seq`
<- 1:20
aseqofvalues
<- seq(from=-3, to=3, by=0.5 )
aseqofvalues2 aseqofvalues2
## [1] -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
<- seq(0, 100, by=10)
aseqofvalues3 <- aseqofvalues3[c(2, 4, 6, 8)]
aseqofvalues4 aseqofvalues4
## [1] 10 30 50 70
<- aseqofvalues3[-c(2, 4, 6, 8)]
aseqofvalues4 aseqofvalues4
## [1] 0 20 40 60 80 90 100
c(1,2)] <- c(666,888)
aseqofvalues3[ aseqofvalues3
## [1] 666 888 20 30 40 50 60 70 80 90 100
### Logical values in vectors TRUE/FALSE
> 50 aseqofvalues3
## [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
<- aseqofvalues3[aseqofvalues3 > 50]
aseqofvalues5 aseqofvalues5
## [1] 666 888 60 70 80 90 100
<- aseqofvalues3[!(aseqofvalues3 > 50)]
aseqofvalues6 aseqofvalues6
## [1] 20 30 40 50
### Comparison functions
<- aseqofvalues3[aseqofvalues3 == 50]
aseqofvalues7 aseqofvalues7
## [1] 50
<- aseqofvalues3[aseqofvalues3 == 22]
aseqofvalues8 aseqofvalues8
## numeric(0)
<- aseqofvalues3[aseqofvalues3 != 50]
aseqofvalues9 aseqofvalues9
## [1] 666 888 20 30 40 60 70 80 90 100
<- aseqofvalues3 >= 50
logicalcond <- aseqofvalues3[logicalcond]
aseqofvalues10 aseqofvalues10
## [1] 666 888 50 60 70 80 90 100
### Remove Missing Values (NAs)
c(1,2)] <- c(NA,NA)
aseqofvalues3[ aseqofvalues3
## [1] NA NA 20 30 40 50 60 70 80 90 100
<- aseqofvalues3[!is.na(aseqofvalues3)]
aseqofvalues3 aseqofvalues3
## [1] 20 30 40 50 60 70 80 90 100
1.5 Arrays and Matrices
Matrices are actually long vectors.
<- matrix(1:12, nrow =2)
mymat mymat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 3 5 7 9 11
## [2,] 2 4 6 8 10 12
<- matrix(1:12, ncol =3)
mymat mymat
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
<- matrix(1:12, nrow=2, byrow = TRUE)
mymat mymat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 2 3 4 5 6
## [2,] 7 8 9 10 11 12
<- matrix(1:12, nrow=3, ncol=4)
mymat mymat
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
<- matrix(1:12, nrow=3, ncol=4, byrow=TRUE)
mymat mymat
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
### recycling
<- matrix(1:5, nrow=3, ncol=4, byrow=TRUE) mymat
## Warning in matrix(1:5, nrow = 3, ncol = 4, byrow = TRUE): data length [5] is not
## a sub-multiple or multiple of the number of rows [3]
mymat
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 1 2 3
## [3,] 4 5 1 2
### rbind cbind
cbind(1:3, 1:3)
## [,1] [,2]
## [1,] 1 1
## [2,] 2 2
## [3,] 3 3
rbind(1:3, 1:3)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 1 2 3
<- matrix(1)
mymat
<- matrix(1:8, nrow=2, ncol=4, byrow=TRUE)
mymat mymat
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
rbind(mymat, 9:12)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
<- cbind(mymat, c(5,9))
mymat mymat
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 5 6 7 8 9
<- matrix(1:8, byrow = TRUE, nrow=2)
mymat mymat
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
rownames(mymat) <- c("row1", "row2")
mymat
## [,1] [,2] [,3] [,4]
## row1 1 2 3 4
## row2 5 6 7 8
colnames(mymat) <- c("col1", "col2", "col3", "col4")
mymat
## col1 col2 col3 col4
## row1 1 2 3 4
## row2 5 6 7 8
<- matrix(1:12, byrow=TRUE, nrow=3, dimnames=list(c("row1", "row2", "row3"),
mymat2 c("col1", "col2", "col3", "col4")))
mymat2
## col1 col2 col3 col4
## row1 1 2 3 4
## row2 5 6 7 8
## row3 9 10 11 12
### Coercion in Arrays
<- matrix(1:8, ncol = 2)
matnum matnum
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
<- matrix(LETTERS[1:6], nrow = 4, ncol = 3)
matchar matchar
## [,1] [,2] [,3]
## [1,] "A" "E" "C"
## [2,] "B" "F" "D"
## [3,] "C" "A" "E"
## [4,] "D" "B" "F"
<- cbind(matnum, matchar)
matchars matchars
## [,1] [,2] [,3] [,4] [,5]
## [1,] "1" "5" "A" "E" "C"
## [2,] "2" "6" "B" "F" "D"
## [3,] "3" "7" "C" "A" "E"
## [4,] "4" "8" "D" "B" "F"
### Subsetting
<- matrix(sample(-8:15, 12), nrow=3) #sample 12 numbers between -8 and 15
mymat3 mymat3
## [,1] [,2] [,3] [,4]
## [1,] 0 5 -3 -2
## [2,] 2 7 3 13
## [3,] -1 1 9 11
2,3] mymat3[
## [1] 3
1,4] mymat3[
## [1] -2
3,] mymat3[
## [1] -1 1 9 11
4] mymat3[,
## [1] -2 13 11
5] # counts elements by column mymat3[
## [1] 7
9] mymat3[
## [1] 9
## Subsetting multiple elements
2, c(1,3)] mymat3[
## [1] 2 3
c(2,3), c(1,3,4)] mymat3[
## [,1] [,2] [,3]
## [1,] 2 3 13
## [2,] -1 9 11
rownames(mymat3) <- c("r1", "r2", "r3")
colnames(mymat3) <- c("c1", "c2", "c3", "c4")
"r2", c("c1", "c3")] mymat3[
## c1 c3
## 2 3
### Subset by logical vector
c(FALSE, TRUE, FALSE),
mymat3[c(TRUE, FALSE, TRUE, FALSE)]
## c1 c3
## 2 3
c(FALSE, TRUE, TRUE),
mymat3[c(TRUE, FALSE, TRUE, TRUE)]
## c1 c3 c4
## r2 2 3 13
## r3 -1 9 11
### matrix arithmetic
<- c(220, 137)
row1 <- c(345, 987)
row2 <- c(111, 777)
row3
<- rbind(row1, row2, row3)
mymat4 rownames(mymat4) <- c("row_1", "row_2", "row_3")
colnames(mymat4) <- c("col_1", "col_2")
mymat4
## col_1 col_2
## row_1 220 137
## row_2 345 987
## row_3 111 777
/10 mymat4
## col_1 col_2
## row_1 22.0 13.7
## row_2 34.5 98.7
## row_3 11.1 77.7
-100 mymat4
## col_1 col_2
## row_1 120 37
## row_2 245 887
## row_3 11 677
<- rbind(c(50,50), c(10,10), c(100,100))
mymat5 mymat5
## [,1] [,2]
## [1,] 50 50
## [2,] 10 10
## [3,] 100 100
- mymat5 mymat4
## col_1 col_2
## row_1 170 87
## row_2 335 977
## row_3 11 677
* (mymat5/100) mymat4
## col_1 col_2
## row_1 110.0 68.5
## row_2 34.5 98.7
## row_3 111.0 777.0
### index matrices
<- array(1:20, dim=c(4,5))
m1 m1
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
<- array(c(1:3, 3:1), dim=c(3,2))
index index
## [,1] [,2]
## [1,] 1 3
## [2,] 2 2
## [3,] 3 1
#use the "index matrix" as the index for the other matrix
<-0
m1[index] m1
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 0 13 17
## [2,] 2 0 10 14 18
## [3,] 0 7 11 15 19
## [4,] 4 8 12 16 20
1.6 Factors
- Factors are variables in R which take on a limited number of different values; such variables are often referred to as ‘categorical variables’ and ‘enumerated type’.
- Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed.
- The function
factor
is used to encode a vector as a factor
<- c("Analyst1", "ManagerL2", "Analyst1", "Analyst2",
personnel "Boss", "ManagerL1", "ManagerL2", "Programmer1",
"Programmer2", "Programmer3", "Designer1","Designer2",
"OtherStaff") # staff in a company
<- factor(personnel)
personnel_factors #sorted alphabetically personnel_factors
## [1] Analyst1 ManagerL2 Analyst1 Analyst2 Boss ManagerL1
## [7] ManagerL2 Programmer1 Programmer2 Programmer3 Designer1 Designer2
## [13] OtherStaff
## 11 Levels: Analyst1 Analyst2 Boss Designer1 Designer2 ManagerL1 ... Programmer3
str(personnel_factors)
## Factor w/ 11 levels "Analyst1","Analyst2",..: 1 7 1 2 3 6 7 9 10 11 ...
<- factor(personnel,
personnel2 levels = c("Boss", "ManagerL1", "ManagerL2",
"Analyst1", "Analyst2", "Designer1",
"Designer2", "Programmer1", "Programmer2",
"Programmer3", "OtherStaff"))
#do not duplicate the same factors
personnel2
## [1] Analyst1 ManagerL2 Analyst1 Analyst2 Boss ManagerL1
## [7] ManagerL2 Programmer1 Programmer2 Programmer3 Designer1 Designer2
## [13] OtherStaff
## 11 Levels: Boss ManagerL1 ManagerL2 Analyst1 Analyst2 Designer1 ... OtherStaff
str(personnel2)
## Factor w/ 11 levels "Boss","ManagerL1",..: 4 3 4 5 1 2 3 8 9 10 ...
# a factor's levels will always be character values.
levels(personnel2) <- c("B", "M1", "M2", "A1", "A2",
"D1", "D2", "P1", "P2", "P3", "OS")
personnel2
## [1] A1 M2 A1 A2 B M1 M2 P1 P2 P3 D1 D2 OS
## Levels: B M1 M2 A1 A2 D1 D2 P1 P2 P3 OS
<- factor(personnel,
personnel3 levels = c("Boss", "ManagerL1", "ManagerL2",
"Analyst1", "Analyst2", "Designer1",
"Designer2", "Programmer1", "Programmer2",
"Programmer3", "OtherStaff"),
c("B", "M1", "M2", "A1", "A2", "D1", "D2",
"P1", "P2", "P3", "OS"))
personnel3
## [1] A1 M2 A1 A2 B M1 M2 P1 P2 P3 D1 D2 OS
## Levels: B M1 M2 A1 A2 D1 D2 P1 P2 P3 OS
### Nominal versus ordinal, ordered factors
1] < personnel3[2] # error, factors not ordered personnel3[
## Warning in Ops.factor(personnel3[1], personnel3[2]): '<' not meaningful for
## factors
## [1] NA
<- c("M", "L", "S", "S", "L", "M", "L", "M")
tshirts
<- factor(tshirts, ordered = TRUE,
tshirt_factor levels = c("S", "M", "L"))
tshirt_factor
## [1] M L S S L M L M
## Levels: S < M < L
1] < tshirt_factor[2] tshirt_factor[
## [1] TRUE
1.7 Lists
Lists are the R objects which contain elements of different types: numbers, strings, vectors and other lists. A list can also contain a matrix or a function as one of their elements.
A list is created using list()
function.
Operators for subsetting lists include: - ‘[’ returns a list - ’[[’ returns the list element - ‘$’ returns the content of that element in the list]<-1
cbo <- jdt\(cbo bugs <- jdt\)bugs