R is an interpreted computer language used primarily for statistical analysis.
From the R FAQ:
R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.
The design of R has been heavily influenced by two existing languages: Becker, Chambers & Wilks’ S (see What is S?) and Sussman’s Scheme. Whereas the resulting language is very similar in appearance to S, the underlying implementation and semantics are derived from Scheme. See What are the differences between R and S?, for further details.
The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. Most of the user-visible functions in R are written in R. It is possible for the user to interface to procedures written in the C, C++, or FORTRAN languages for efficiency. The R distribution contains functionality for a large number of statistical procedures. Among these are: linear and generalized linear models, nonlinear regression models, time series analysis, classical parametric and nonparametric tests, clustering and smoothing. There is also a large set of functions which provide a flexible graphical environment for creating various kinds of data presentations. Additional modules (‘add-on packages’) are available for a variety of specific purposes (see R Add-On Packages).
The name is partly based on the (first) names of the first two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs language ‘S’, which stood for System.’
I highly recommend using a package manager to install R and RStudio. Otherwise, R and RStudio must be downloaded and installed separately. R cannot be updated from within RStudio, and installing a new R version can require reinstallation of packages. It’s difficult to manage.
Install Homebrew. Homebrew includes Homebrew-Cask to manage other macOS applications.
pip
, and will integrate into
the Anaconda environment.$ conda list
and
$ pip list
.$ conda search
.$ conda update --prefix <PATH> anaconda
$ conda update --all
When I was working with R, I preferred to install R and RStudio via Anaconda for easier version and package management. Otherwise, R and RStudio must be downloaded and installed separately. R cannot be updated from within RStudio, and installing a new R version can require reinstallation of packages. It’s just too difficult to manage.
Install R and RStudio via Homebrew
Update Homebrew regularly with brew update
and
brew upgrade
Check health of installation with
brew doctor
See docs for further info.
I highly recommend working with a notebook file format. Jupyter Notebook and R Markdown are the most popular formats.
pip
module
manager. I previously used Anaconda to manage my Python and R
distributions, and now use Homebrew. I switched because Anaconda is a
very large installation, and not as flexible or general as
Homebrew.jupyter notebook
or{r}
after the first
set of backticks. R will run this code when knitting the RMarkdown
document. If you want to include a standard Markdown code chunk without
running the code in R, leave out the curly braces.Example:
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Change size of figures (at beginning of r code chunk in brackets):
{r, out.width = '1000px', out.height = '2000px'}
Prevent printing of code in HTML output:
echo=FALSE
github_document
output format outputs a GitHub-friendly Markdown file. It is
RStudio-specific, not a Pandoc standard, and provides some
RStudio-specific benefits like local HTML preview.
At the time of this writing, the markdown output formats (either
github_document
or md_document: variant: gfm
)
don’t seem to be outputting a table of contents, even when
toc: true
is included. The solution is to include a
separate header file containing the header and TOC.
blogdown
package is used to build the site within
RStudio.?rmarkdown::html_document
.RStudio
Projects are the best way to manage code in RStudio. RStudio
projects make it straightforward to divide your work into multiple
contexts, each with their own working directory, workspace, history, and
source documents. Projects allow use of version control with Git.
Installing packages into the project locally with renv
is
very helpful.
renv
is a package dependency management system for R projects. It helps avoid
problems caused by different package versions and installations by
giving each project its own isolated package library. renv
is separate from general package managers like Homebrew used to install
R and RStudio.renv
that is now “soft-deprecated.”
renv
is simple. Run the
following command in the console:
renv::migrate("~/path/to/repo")
.renv::restore()
, or in the Packages pane, navigate to
renv
-> Restore Library.Hands-on coding practice is essential.
Cmd+Enter
(or
use the Run toolbar button).Cmd+Enter
(or use the
Run toolbar button)Cmd+option+R
(or use
the Source toolbar button). Using the
Knit command in R Markdown documents will run the
entire document.5+5
,
5-5
, (5+5)/2
, etc)<-
or =
to define a
variable (also called an object in
other languages like Python). For example, x <- 42
or
x = 42
set x equal to 42.~
is similar to =
in Python or SAS or a
normal written equation. It means ‘versus’ or ‘as a function of.’'forty-two'
will treat
forty-two as a word. Both single and double quotes can be used, but
double quotes are preferred.TRUE
or
T
, and FALSE
or F
.class(var_name).
print(variable)
like Python or PROC PRINT
like in SAS.apropos()
to help identify the proper function to
use.?
followed
by the command of interest#
, similar to how you would use a
star to begin a note and a semicolon to conclude it in SAS. A note in R
will begin after #
and continue until the end of the line
(until you insert a line break with enter).R vs. SAS from Quick-R:
Unlike SAS, which has DATA and PROC steps, R has data structures (vectors, matrices, arrays, data frames) that you can operate on through functions that perform statistical analyses and create graphs. In this way, R is similar to PROC IML.
See R in Action 2.2 for further info.
The vector is a way to store data in R. Vectors are created with
the combine function c()
. Enter values or variable names
separated by commas:
You can also name the elements of a vector, and create a variable with the vector names for ease of use.
Select parts of a vector with brackets. For example, to create a new vector with a specific portion of another vector, type
R stores vectors as strings of data. To put the data in table format, use a data frame (see below).
From DataCamp introduction to R course:
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional. You can construct a matrix in R with the matrix function, for example:
The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use
1:9
which constructs the vectorc(1,2,3,4,5,6,7,8,9)
. The argumentbyrow
indicates that the matrix is filled by the rows. If we want the vector to be filled by the columns, we just placebycol=TRUE
orbyrow=FALSE
. The third argumentnrow
indicates that the matrix should have three rows.
You can use vectors to name rows and columns of a matrix, for example:
The rowSums(matrix_name)
and
colSums(matrix_name)
functions will sum the matrix (the
first S is capital).
Use the cbind()
function to merge vectors and
matrices.
Similar to vectors, use square brackets [ ]
to
select elements from a matrix. Whereas vectors have one dimension,
matrices have two dimensions. You should therefore use a comma to
separate that which you want to select from the rows from that which you
want to select from the columns. For example:
my_matrix[1,2]
first row, second element.my_matrix[1:3,2:4]
rows 1,2,3 and columns 2,3,4.my_matrix[,1]
all elements of the first column.my_matrix[1,]
all elements of the first row.2*my_matrix
multiplied every element of
my_matrix by two, my_matrix1 - my_matrix2
creates a matrix
where each element is the product of the corresponding elements in
my_matrix1
and my_matrix2
.Similar to vectors, the standard operators like
+ - / *
etc. work in an element-wise way on matrices in R.
For example: 2*my_matrix
multiplies each element of
my_matrix
by two.
The factor is an R function used to store categorical variables.
To create factors in R, use the function factor()
.
First, create a vector that contains the observations to be converted to
a factor. For example, gender_vector
contains the sex of 5
different individuals and participants_1
contains discrete
numbers representing deidentified study participants:
When you first get a data set, you will often notice that it
contains factors with specific factor levels. However, sometimes you
will want to change the names of these levels for clarity or other
reasons. R allows you to do this with the function
levels()
:
To create an ordered factor, you have to add two additional arguments:
By setting the argument ordered=TRUE
in the function
factor()
, you indicate that the factor is ordered. With the
argument levels
you give the values of the factor in the
correct order:
Here’s an example within a functional R Markdown code block:
# Create a vector of temperature observations
temperature_vector <- c('High', 'Low', 'High', 'Low', 'Medium')
# Specify ordinal variables with given levels
factor_temperature_vector <- factor(temperature_vector,
order = TRUE,
levels = c('Low', 'Medium', 'High'))
# Display the vector
factor_temperature_vector
## [1] High Low High Low Medium
## Levels: Low < Medium < High
A data frame is a combination of vectors that functions like a spreadsheet within R.
Variables are columns, observations are rows.
Different types of vectors can be combined into a data frame, as long as the vectors have the same length.
This is also how datasets are organized in SAS.
Example data frame mtcars
:
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## [1] 32 11
str()
shows you the structure of your
dataset. For a data frame it provides:
Example from DataCamp Introduction to R course:
Remember that a data frame is essentially a combination
of vectors. The DataCamp example above created the data frame
in two steps for instructional purposes, but you can usually build this
in to the data.frame
command.
To create a data frame on planets sorted by decreasing diameter (example from DataCamp):
[ ]
in the form
my_data_frame[rows,columns]
.
my_data_frame[1,2]
selects the second element from the
first row in my_data_frame
.my_data_frame[1:3,2:4]
selects rows 1,2,3 and columns
2,3,4 in my_data_frame
.my_data_frame[1,]
selects all elements of the first
row.type
from the planets
data frame.planets_df[1:3,1]
. You have to
know (or look up) the position of the variable type
, which
gets hard if you have a lot of variables.planets_df[1:3,'type']
.rings
, both planets_df[,5]
and
planets_df[,'rings']
work.$
sign to tell R that it only has to look up
all the elements of the variable behind the sign:
df_name$var_name
. You can combine this with the element
section above: df_name$var_name[rows]
. Note that you do not
need to specify a column because it is already specified by the variable
name.The subset()
function selects a subset of a data
frame according to conditions you specify.
subset(df_name, some.condition== 'condition')
. The first
argument of subset()
specifies the dataset for which you
want a subset. By adding the second argument, you give R the necessary
information and conditions to select the correct subset. Examples:
subset(planets_df, subset=(planets_df$rings == TRUE))
subset(planets_df, subset=(planets_df$diameter<1))
very_good = subset(cdc,cdc$genhlth=='very good')
age_gt50 = subset(cdc,cdc$age>50)
red_wine <- subset(wine_data, wine_data[, 1]=='Red')
red_usa <- subset(red_wine_data, red_wine_data$condition == 'USA')
The second subset=
inside the parentheses may not be
necessary.
Conditions can be used with logical operators &
and |
.
&
is read ‘and’ so that
subset(cdc, cdc$gender == 'f' & cdc$age > 30)
will
give you the data for women over the age of 30.
under23_and_smoke = subset(cdc,cdc$age<23 & cdc$smoke100==T)
will select respondents who are under the age of 23 and have smoked
>100 cigarettes.|
character is read ‘or’ so that
subset(cdc, cdc$gender == 'f' | cdc$age > 30)
will
select people who are women or over the age of 30.Sort data frames with order()
.
order()
is a function that gives you the position of
each element when it is applied on a variable. Let us look at the vector
a: a <- c(100,9,101)
. Now order(a)
returns
2,1,3. Since 100 is the second largest element of the vector, 9 is the
smallest element and 101 is the largest element.a[order(a)]
will give you the ordered
vector 9,100,101, since it first picks the second element of a, then the
first and then the last.Example from my postdoc Nrf1 proteomics experiments:
# Groups in alphabetical order
groups <- c('lacZ',
'Nrf1-HA bort',
'Nrf1-HA chol',
'Nrf1-HA chow')
groups[order(groups)]
## [1] "lacZ" "Nrf1-HA bort" "Nrf1-HA chol" "Nrf1-HA chow"
# Put groups in my preferred order
ordered = factor(groups,
levels = c('lacZ',
'Nrf1-HA chow',
'Nrf1-HA chol',
'Nrf1-HA bort'))
# Verify desired order of groups
groups[order(ordered)]
## [1] "lacZ" "Nrf1-HA chow" "Nrf1-HA chol" "Nrf1-HA bort"
Orderedgroups
in this case) in subsequent code.Example: Reordering columns in a data frame from Cookbook for R:
# A sample data frame
data <- read.table(header = TRUE,
text = '
id weight size
1 20 small
2 27 large
3 24 medium
')
# Reorder by column number
data[c(1, 3, 2)]
## id size weight
## 1 1 small 20
## 2 2 large 27
## 3 3 medium 24
## size id weight
## 1 small 1 20
## 2 large 2 27
## 3 medium 3 24
The above examples index into the data frame by treating it as a list (a data frame is essentially a list of vectors). You can also use matrix-style indexing, as in data[row, col], where row is left blank.
## id size weight
## 1 1 small 20
## 2 2 large 27
## 3 3 medium 24
The drawback to matrix indexing is that it gives different results when you specify just one column. In these cases, the returned object is a vector, not a data frame. Because the returned data type isn’t always consistent with matrix indexing, it’s generally safer to use list-style indexing, or the drop=FALSE option:
## weight
## 1 20
## 2 27
## 3 24
## [1] 20 27 24
# Matrix-style indexing with drop=FALSE
# preserves dimension to remain data frame
data[, 2, drop = FALSE]
## weight
## 1 20
## 2 27
## 3 24
Another example from Stack Overflow
Create the data frame
dd <- data.frame(
b = factor(
c('Hi', 'Med', 'Hi', 'Low'),
levels = c('Low', 'Med', 'Hi'),
ordered = TRUE
),
x = c('A', 'D', 'A', 'C'),
y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2)
)
dd
## b x y z
## 1 Hi A 8 1
## 2 Med D 3 1
## 3 Hi A 9 1
## 4 Low C 9 2
Sort the frame by column z (descending) then by column b (ascending).
decreasing=TRUE
like the
DataCamp examples, you can just put a negative sign before the column
name (-z).You can use the
order()
function directly without resorting to add-on tools – see this simpler answer which uses a trick right from the top of the example(order) code:
Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the order() function, rather than using the name of the column (and
with()
for easier/more direct access).
To merge two data frames (datasets) horizontally, use the
merge
function. In most cases, you join two data frames by
one or more common key variables (i.e., an inner join).
To join two data frames (datasets) vertically, use the
rbind
function. The two data frames must have the same
variables, but they do not have to be in the same order.
If data_frame_A
has variables that
data_frame_B
does not, then either:
rbind( )
.Stack and unstack using the PlantGrowth dataset included with R:
Unstack
separates out the groups into a data
frame.To create a boxplot of the three groups:
boxplot(independent_var~dependent_var)
. See the Plots section.A list in R gathers a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, other lists, etc. The objects do not need to be related to each other.
To construct a list you use the function list()
:
The arguments to the list function are the list components. To
name items in the list:
list(name1=your.component1, name2=component2,…)
Lists will often be built out of numerous elements and components. Getting a single element, multiple elements, or a component out of the list is not always straightforward.
List components can be selected with square brackets as usual,
like list[1]
.
length()
returns the total number of components in
the list. For example, to identify the last actor in a list of actors in
the movie The Shining:
As with data frames, components can be referenced by name:
c()
is used to add elements to
a list: c(list1,some.object)
c(list1,new.item.name=some.object)
In RStudio, datasets can be loaded from CSV, from web using Tools -> Import dataset or by clicking Import dataset in the environment pane.
This is one of the best ways to read in a dataset. Note that the file path should be in UNIX format. UNIX file paths have forward slashes, Windows file paths have backslashes (‘whacks’).
There are many other parameters that can be specified when reading in a dataset. See R help.
Export the excel file to a CSV and follow instructions above, or use the xlsx package:
URL must be enclosed in quotes to make it a string.
scan()
function can be used to import a vector or
list.dplyr
package is one of the most popular for
working with data in R. It is similar to Python Pandas. See Tom
Augspurger’s Gist on dplyr and Pandas.na.omit(dataset_name$optional_var_name)
will omit any
missing values (usually labeled ‘NA’)Exporting to a delimited text file:
x
is the object and outfile
is the
target file. For example, the statement
write.table(mydata, 'mydata.txt', sep=',')
would save the
dataset mydata
to a comma-delimited file named mydata.txt
in the current working directory. Include a path (for example,
'c:/myprojects/mydata. txt'
) to save the output file
elsewhere. Replacingsep=','
with sep='\t'
would save the data in a tab-delimited file. By default, strings are
enclosed in quotes (’’) and missing values are written as
NA
.Exporting to an Excel workbook using the xlsx package:
Mean: mean()
Groups of means can be compared with the by()
function:
Median: median()
Variance: variance()
Absolute value: abs(x)
Square root: sqrt(x)
Log: log(x)
Computing summary statistics of data subsets:
aggregate(model_formula, df_name, mean)
. This function is
useful if you want to calculate means of each group in a
dataset.
Frequency table (to count the number of times each category
occurs in a data frame): table(df_name)
For a specific variable:
table(df_name$var_name)
The table
command can also be used to combine
multiple vectors of different types into a table:
Relative frequency table:
The summary()
function gives you an overview
description of a variable (minimum, first quartile, median, mean, third
quartile, and maximum). The describe()
function is similar,
but includes more information on the distribution (sd, range, skew,
kurtosis). Remember that skew is classified by the direction of the data
tail.
Using comparisons is a good way to facilitate data analysis.
To select all elements of a vector that are greater than zero,
type new_vector_name <- vector_name >0
To compare vectors/variables (can also use <
or
==
):
To compare multiple elements of a vector:
To create a ratio:
Proportion:
You can calculate a proportion based on a subset of the data. Example:
Sum vectors with the sum()
function, and compare
sums with var_name>other_var_name
.
The scale()
function converts values to z
scores.
95% confidence for a sample size n means that 95% of random samples of size n will yield confidence intervals that contain the true average value.
The margin or error is equal to the radius (or half the width) of the confidence interval.
Sample code:
Calculating the critical value for a confidence interval:
qnorm(0.95)
Margin of error:
Imagine you’ve set out to survey 1000 people on two questions: are you female? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.
The first step is to make a vector
p
that is a sequence from 0 to 1 with each number separated by 0.01:
We then create a vector of the margin of error (
me
) associated with each of these values ofp
using the familiar approximate formula (ME = 2 X SE):
## Warning in sqrt(p - (1 - p)/n): NaNs produced
Finally, plot the two vectors against each other to reveal their relationship:
Simulating a coin flip with sample()
:
The code above simulates 100 flips of a fair coin.
Add different probability with
The sample can also be saved as a vector for further analysis (such as calculating the mean, or other statistics)
The for loop is a cornerstone of computer programming. The idea behind the for loop is iteration: it allows you to execute code as many times as you want without having to type out every iteration.
Example for loop (from DataCamp data analysis and statistical inference course, also see Verzani simpleR p.48):
In this example, we want to iterate the two lines of code inside the curly braces that take a random sample of size 50 from area then save the mean of that sample into the sample means50 vector. Without the for loop, this would be painful:
sample_means50 <- rep(NA, 5000)
samp <- sample(area, 50)
sample_means50[1] <- mean(samp)
samp <- sample(area, 50)
sample_means50[2] <- mean(samp)
samp <- sample(area, 50)
sample_means50[3] <- mean(samp)
samp <- sample(area, 50)
sample_means50[4] <- mean(samp)
and so on… With the for loop, these thousands of lines of code are compressed into a handful of lines:
# The 'ames' data frame and 'area' and 'price' objects are already loaded
# Set up an empty vector of 5000 NAs to store sample means:
sample_means50 = rep(NA, 5000)
# Take 5000 samples of size 50 of 'area' and store in 'sample_means50'.
# Also print 'i' (the index counter):
for (i in 1:5000)
{
samp = sample(area, 50)
sample_means50[i] = mean(samp)
print(i)
}
In the first line we initialize a vector. In this case, we created a vector of 5000 NAs called
sample means50
. This vector will store values generated within the for loop. NA means not available, and in this case they’re used as placeholders until we fill in the values with actual sample means. NA is also often used for missing data in R.The second line calls the for loop itself. The syntax can be loosely read as, ‘for every element
i
from 1 to 5000, run the following lines of code’. You can think ofi
as the counter that keeps track of which loop you’re on. Therefore, more precisely, the loop will run once wheni
=1, then once wheni
=2, and so on up toi
=5000.The body of the for loop is the part inside the curly braces, and this set of code is run for each value of
i
. Here, on every loop, we take a random sample of size 50 fromarea
(note: the sample size can be previously defined by creating a vector), take its mean, and store it as the ith element ofsample_means50
. In order to display that this is really happening, we asked R to printi
at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset. You can also run multiple distributions through the same for loop:
# The 'ames' data frame and 'area' and 'price' objects are already loaded
# Initialize the sample distributions:
sample_means10 = rep(NA, 5000)
sample_means100 = rep(NA, 5000)
# Run the for loop:
for (i in 1:5000) {
samp = sample(area, 10)
sample_means10[i] = mean(samp)
samp = sample(area, 100)
sample_means100[i] = mean(samp)
}
Don’t worry about the fact that
samp
is used for the name of two different objects. In the second command of the for loop, the mean ofsamp
is saved to the relevant place in the vectorsample_means10
. With the mean saved, we’re now free to overwrite the object samp with a new sample, this time of size 100. In general, anytime you create an object using a name that is already in use, the old object will get replaced with the new one, i.e. R will write over the existing object with the new one, which is something you want to be careful about if you don’t intend to do so.’Another example: Obtain a random sample, calculate the sample’s mean and standard deviation, use these statistics to calculate a confidence interval, repeat steps (1)-(3) 50 times.
For loops can also be used for bootstrapping.
Bootstrapping is a way of amplifying your data for estimation.
Definition (from Google):
a technique of loading a program into a computer by means of a few initial instructions that enable the introduction of the rest of the program from an input device.
There is also a boot
package available for
R
Sampling vs. bootstrapping: The sampling distribution is calculated by resampling from the population, the bootstrap distribution is calculated by resampling from the sample.
Example: bootstrap confidence interval for the average weight gained by all mothers during pregnancy. (From Datacamp data analysis and statistical inference course lab 4)
Writing a for loop every time you want to calculate a bootstrap
interval or run a randomization test is cumbersome. This is a custom
function that automates the process. By default the
inference
function takes 10,000 bootstrap samples (instead
of the 100 you’ve taken above), creates a bootstrap distribution, and
calculates the confidence interval, using the percentile
method.
Example
# The 'nc' data frame is already loaded into the workspace
# Load the 'inference' function:
load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/inference.Rdata'))
# Run the inference function:
inference(nc$gained, type='ci', method='simulation',
conflevel=0.9, est='mean', boot_method='perc'
)
method='se'
will perform the method based on the
standard errorest='median'
will create an interval for the median
instead of the meanExample (Datacamp data analysis and statistical inference course lab 4):
use the inference function for evaluating whether there is a difference between the average birth weights of babies born to smoker and non-smoker mothers:
# The 'nc' data frame is already loaded into the workspace
inference(y = nc$weight, x = nc$habit,
est = 'mean', type = 'ci', null = 0,
alternative = 'twosided', method = 'theoretical',
order=c('smoker','nonsmoker')
)
The first argument is y, which is the response variable that we are interested in:
nc$weight
.The second argument is the grouping variable
x
, which is the explanatory variable multiplied by the grouping variable across the levels of which we’re comparing the average value for the response variable, smokers and non-smokers:nc$habit
.The third argument
est
is the parameter we’re interested in: ‘mean’ (other options are ‘median’, or ‘proportion’.)Next we decide on the type of inference we want: a hypothesis test (‘ht’) or a confidence interval(‘ci’).
When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other.
The alternative hypothesis can be ‘less’, ‘greater’, or ‘twosided’.
The method of inference can be ‘theoretical’ or ‘simulation’ based.
The order function changes the order of the variable levels.
You can also add
success=
, which defines a success for survey data-type questions.
Another example:
Take a subset
spain
ofatheism
with only the respondents from Spain. Calculate the proportion of atheists within this subset. Use theinference
function on this subset, grouping them by year. (From Datacamp data analysis and statistical inference course lab 4)
# Take the 'spain' subset:
spain = subset(atheism, atheism$nationality == 'Spain')
# Calculate the proportion of atheists from the 'spain' subset:
proportion = nrow(subset(spain,response == 'atheist'))/nrow(spain)
# Use the inference function:
inference(spain$response, spain$year, est = 'proportion',
type = 'ci', method = 'theoretical', success = 'atheist'
)
t.test(y~x, var.equal = TRUE)
# when both variables are numeric
t.test(y1,y2, var.equal = TRUE)
# one-sample t-test, with null hypothesis mean mu=3.
t.test(y,mu=2, var.equal = TRUE)
# paired t-test
t.test(y~x, paired=TRUE, var.equal = TRUE)
# one-tailed t-test
t.test(y~x, alternative = 'less', var.equal = TRUE)
# one-tailed t-test
t.test(y~x, alternative = 'greater', var.equal = TRUE)
var.equal = TRUE
to indicate homogeneity of variance.Shapiro-Wilk W test (W>0.9 desired, see flowchart from PATH 591 lecture 03 p.11 in linear regression section):
Normal quantile-quantile plot
See R in action ch.8.
See R in Action tables 8.2 and 8.3 for full syntax info.
Linear regression formula symbols. Adapted from R in Action.
~
: dependent variable on left, independent variables on
right, like y~a+b
. -+
: Separates independent
variables.:
Interaction, such as y~a:b
, meaning
interaction of independent variables a and b.*
: All possible combinations of the variables.^
: Restricts interactions to a certain degree. The
formula y ~ (a + b + c)^2
would be read by R as y ~ a + b +
c + a:b + a:c + b:c..
: References all independent variables in the data
frame.-
: Subtracts a variable from the model.-1
: Suppresses the intercept.I()
: Allows use of arithmetic within the
parentheses.Linear regression functions that can be used after specifying the model. Adapted from R in Action.
aic()
or bic()
: Prints Aikake’s or Bayes’
Information Criterion.anova()
: ANOVA table.coefficients()
: Shows intercept and slopes for the
regression model.confint()
: Confidence intervals (95 percent by
default).fitted()
: Lists predicted values.plot()
: Plots for evaluating model fit. To select
specific plots, add ‘which=’.residuals()
: Residuals are basically the deviation from
the model fit.
rstandard()
for standardized residuals (divided by
standard deviation, fitted including the current data point).
rstudent()
for studentized residuals (divided by
standard deviation, but fitted ignoring the current data
point).
Stdres()
and studres()
can also be used
but require the MASS package (Modern Applied Statistics with
S).
See PATH 591 lecture 03 for a clear visual explanation of residuals.
summary()
: Summary of the regression model. To view
a specific part, such as just the R2, add
$r.squared
or $adj.r.squared
vcov()
: Lists the covariance matrix for model
parameters.Shapiro-Wilk W statistic: based on mathematics of normal probability plots. Compares variance of distribution to variance expected under normal distribution. Shapiro-Wilk W statistic trumps p value. Normality testing workflow from PATH 591 lecture 03 p.11:
Graphical evaluation by normal quantile-quantile plot
Skewness: measure of asymmetry in distribution. A perfect normal distribution has skewness=0. Positively skewed distributions=tail to the right, negatively skewed=tail to the left.
Kurtosis: another measure of asymmetry in distribution. Leptokurtic distributions (kurtosis<0) will have more observations at the center, platykurtic distributions (kurtosis>0) are spread evenly.
car
and lmtest
packages.acf(lm_name$residuals)
,
acf(rstudent(lm_name))
, or
acf(studres(lm_name))
.studres
(in the MASS package) will produce an
autocorrelation plot. The dashed lines indicate the 95% confidence
interval, and correlations ideally lie within the lines.Cook’s D is a distance statistic that measures change in regression parameter estimates when an observation is deleted.
A Bonferroni outlier test can also be performed with the
outlierTest()
function in the car
package.
Plots of Cook’s D values and residuals vs leverage allow graphical assessment of influential observations. Add a critical line to the Cook’s Distance plot with:
gvlma
packagegvlma
returns five summary values:
F0.05(k, n-1 df)=, r2, p<
This information is usually adequate for publication purposes. See PATH 591 lecture 04 A p.8.
ANOVA is a type of linear regression analysis with categorical independent variables. R in Action ch.9 contains an excellent section on ANOVA.
Options for basic ANOVA:
aov(model, data=data.frame)
lm()
anova()
glm()
See R in Action table 9.5 for more syntax info and model examples.
lm()
function is
more flexible than a basic ANOVA analysis with the aov()
function. It allows inclusion of both categorical and
quantitative independent variables and enables prediction.aov()
function uses Type I sums of squares, which
is not appropriate for imbalanced designs (groups of different sizes).
The Anova()
function (with a capital A) in the
car
package allows use of Type II or III sums of
squares.glm()
uses iteratively reweighted least squares (IWLS).
Further information in R in Action chapter 13.Shapiro-Wilk W test (W>0.9 desired, see flowchart from PATH 591 lecture 03 p.11 in linear regression section):
Normal quantile-quantile plot
Bartlett test: bartlett.test()
. Bartlett test is
sensitive to deviations from normality.
Levene test
Brown-Forsythe test
These HOV tests are only appropriate for one-way designs. A different approach is required for more complex designs.
Factorial designs can be converted to one-way by splitting each treatment combination into a separate group and identifying it as a single level in one independent variable, calling the one new independent variable ‘treatment,’ and then performing the Brown-Forsythe test. This does not work for repeated measures data because values for each group at one timepoint are correlated to values at other timepoints.
Variance assessment for multivariate, factorial, and repeated measures designs
I am more familiar with this in SAS. Explanation below.
In general, you want to use the simplest model possible that minimizes AIC, AICC, and BIC while still meeting the assumptions of the model. Mind the assumptions of each covariance structure. See CPSC 542 Lecture 10 pp. 4-7.
Donald Bullock (Illinois CPSC statistics professor):
You select an error structure with heterogeneous variances [like unstructured] and compare that to a model with everything the same with the exception of homogeneous variances [like compound symmetry]. The differences in the log likelihoods will be a chi-square with degrees of freedom equal to the differences in the number of estimates in the two models.
This is also described in the Littell ‘SAS for mixed models’ text p. 172.
In other words, run the code using the unstructured model, and copy/paste the ‘Fit Statistics’ table and the number of covariance parameter estimates into Excel. The number of estimates (also called covariance parameters) is given under ‘Dimensions’ in the output. Also look at the ‘-2 Res Log Likelihood’ number under ‘Fit Statistics’ in the output.
Then run the code several more times, using different covariance structures each time (cs, ar(1), arh(1), toep, and toeph), and paste the fit statistics and estimates into excel also. Then, for each test other than unstructured, type =chidist(x, df), using the difference between the ‘-2 Res Log Likelihood’ for unstructured and the other test as the ‘x’ or chi square, and the difference in #parameters between unstructured and the other test as the df.
The answer will be a p value. A significant p indicates that the unstructured model provides superior fit.
Independence of residuals is required. Any pairing in the data, such as in repeated measures designs, must be appropriately specified in the model.
or
Download and install the multcomp
package.
First, print the covariance matrix with
model.matrix(model_name)
to identify coefficients for
contrasts.
Next, use the numbers from error matrix to indicate group comparisons of interest.
Options for creation of interaction plots:
Using aov()
or lm()
is the simplest method,
but also the least flexible. It assumes sphericity, and thus is not
suitable for most real-world datasets. A mixed models approach is
required to appropriately model covariance. R in Action does not cover
mixed models in depth.
A few options:
Anova()
in the car
packagelmer()
in the lme4
package. See CRAN
7.35 for discussion of p values in this function.lme()
or gls()
in the nlme
packageAnalysis of covariance includes both fixed effects (independent variables) and random effects (covariates included to model random variation). If the effect can be reproduced, it’s fixed. With fixed effects you are interested in the means, but with random effects you are interested in variances. Blocks could be either fixed or random; for example, if you are doing a test on trees and you want to block for forest, setting blocks as random would indicate that you want to block for just all forests everywhere in general and these forests are representative of all forests, whereas setting blocks as fixed would indicate that you want to block for these specific forests you’re evaluating. See CPSC 440 Lecture Notes Chapter 08 ANOVA.
The covariance structure must be appropriately
specified. Fit statistics and information criteria can be used
to identify an appropriate covariance structure:
AIC(model)
. See ‘Selection of regression models’ in the
Linear regression section, and ‘Variance assessment for multivariate,
factorial, and repeated measures designs’ in the ANOVA section
above.
Useful when both dependent and independent variables are categorical, or when dependent variable is a count.
It evaluates the association, or contingency, between two kinds of classifications.
Data must first be put into a contingency table:
Next, run the test and reference the table
See R in action ch. 7.2, Crawley The R Book ch. 14-15, UCSC R tips.
Example from my grad school lab: Prostate cancer incidence data from TRAMP Tomato Soy paper
Non-parametric one-way ANOVA, where y
is a dependent
variable and a
is a categorical independent variable with
≥2 levels.
A non-parametric Kruskal-Wallis test is used when data fail the assumption of normality, and are not amenable to transformation. It is still worthwhile to perform the usual normality testing to demonstrate this.
HOV is still required. Use a test that is not affected by non-normality, such as Brown-Forsythe.
This function allows control of the experiment-wide error rate and is the best non-parametric multiple comparisons method available in R.
Adjustment options:
holm
hochberg
hommel
bonferroni
BH
BY
fdr
none
From R Documentation:
The first four methods are designed to give strong control of the family-wise error rate. There seems no reason to use the unmodified Bonferroni correction because it is dominated by Holm’s method, which is also valid under arbitrary assumptions.
PMCMRplus
package (Pairwise Multiple
Comparisons of Mean Rank Sums). Note that the original
PMCMR
package has been deprecated.npmc
packageThere used to be an npmc package that is referred to in some documentation. It has been deprecated.
y
is the numeric outcome variable, A
is a grouping variable, and B
is a blocking variable that
identifies matched observations. In both cases, data
is an
option argument specifying a matrix or data frame containing the
variables.Try the nlme
, MultNonParam
,
npmlreg
, or nparLD
packages.
See R in Action ch.10
See R in Action ch.14
Bioconductor is a repository of R packages for bioinformatic analysis.
Install core packages (run the commands sequentially). These commands can also be used to update packages.
Install specific packages:
Troubleshoot installations to flag packages that are either
out-of-date or too new for your version of Bioconductor. The output
suggests ways to solve identified problems, and the help page
?biocValid
lists arguments influencing the behavior of the
function.
The ddCt package for Bioconductor performs automated ΔΔCt calculations on qPCR data. Install with
It seems to require a very specific format for the files output from SDS, and I was not able to get it to work with my data. It would be far easier to create a VBA macro in Excel.
Scatterplots: plot(x=,y=,options)
If your data frame only has two variables, you only need to
specify the data frame: plot(data.frame)
To select specific variables from a data frame to display on the
graph, use
plot(x=data.frame$variable, y= data.frame$variable)
Line chart: plot(x,y,type='l')
Plotting >2 variables:
plot(data.frame[rows,columns])
Boxplot: boxplot()
See the multiple comparisons ANOVA section above
Plotting one variable as a function of another is very similar to
creating a linear model: boxplot(y ~ a)
attach()
the dataset you do not need
to specify the data frame name here.New variables can also be calculated and plotted even if they are not in the original data frame:
Setting line widths for individual parts of the boxplot:
lwd=
medlwd=
outlwd=
Adding symbols for group means:
After creating the boxplot, create a new vector and use the
aggregate()
function: `means <- aggregate(model,
data.frame, mean)
Next, create a separate points()
function:
Note that, as with other functions in R, variables will
be sorted alphabetically or numerically. To change the order, you must
create a separate vector and use order()
or
factor()
to specify the order. See the section on data
frames above, and previous code.
Plotting factorial datasets: Sample code from the R Graphics Cookbook ch.2 ‘Creating a box plot’.
With the ggplot2 package, you can get a similar result using qplot() (Figure 2-11), with geom=‘boxplot’:
If the two vectors are already in the same data frame, you can use the following syntax:
This is equivalent to:
It’s also possible to make box plots for multiple variables, by combining the variables with interaction(), as in Figure 2-11. In this case, the dose variable is numeric, so we must convert it to a factor to use it as a grouping variable:
# Using three separate vectors
qplot(interaction(ToothGrowth$supp, ToothGrowth$dose),
ToothGrowth$len,
geom = 'boxplot')
Alternatively, get the columns from the data frame:
This is equivalent to:
Dot plot
Histogram: hist()
Histograms are useful for inspecting the distribution of the dataset.
To control histogram divisions and limits of x axis:
Combined boxplot and histogram:
simple.hist.and.boxplot()
Bar chart: barplot()
Pie chart: pie(data.frame)
Mosaic plot (useful for showing how different categories are related):
Volcano plots
See Quick-R, and of course R in Action, for thorough documentation.
Many graphical options are specified with the par()
function.
cex=
pch=
lty=
lwd=
Colors
Colors can be specified by index number, name, hexadecimal, or RGB.
See Earl Glynn’s color chart for a visual listing of colors in R by index number.
View all available color names: color()
The basic palette consists of 8 colors. Other color vectors with
n contiguous colors can be created with gray(n)
,
rainbow(n)
, heat.colors(n)
,
terrain.colors(n)
, topo.colors(n)
, and
cm.colors(n)
.
The RColorBrewer
package can also be used for color
palettes. See website
for more info.
To make palettes available in R:
palette.name <- brewer.pal(n, 'name')
. Next, when
selecting a color palette for a chart, use the vector name
(palette.name
in this example).
There are 3 types of palettes, sequential, diverging, and qualitative.
To view all palettes as a plot:
Use bracketed numbers to select specific colors or a range from
within a palette: col=set1[5:9]
. Note that colors will be
repeated if necessary. To start from a specific color in the palette,
but still retain all colors in the palette, type the number for the
starting color and then add that to the total number of colors in the
palette.
For Set1: col=set1[5:14]
To color a plot by category (from Stackoverflow)
DF <- data.frame(x=1:10, y=rnorm(10)+5, z=sample(letters[1:3], 10, replace=TRUE))
attach(DF); plot(x, y, col=c('red','blue','green')[z]); detach(DF)
Fonts: See Quick-R.
Legend: legend()
. See Quick-R.
Adding
jitter when there are too many points sitting on top of each other:
plot(evals$score ~ jitter(evals$bty_avg))
or
plot(jitter(evals$score) ~ evals$bty_avg)
Margins: mar
for margin size in lines,
mai
for margin size in inches. For complete information on
margins, see Earl
Glynn’s margins tutorial.
Axes
box(which = 'plot', bty = 'l')
plot([options], ylab=expression(bold(paste('Thickness, ',mu, 'm'))))
xlab=, ylab=
xaxt='n'
suppresses the x axis,
yaxt='n'
suppresses the y axis, and axes=FALSE
suppresses both x and y axes.text()
, rather than mtext()
, as the latter
does not support par('srt')
.srt = 45
for
text rotation angle, adj = 1
to place the right end of text
at the tick marks, and xpd = TRUE
to allow for text outside
the plot region. You can adjust the value of the 0.25offset
as required to move the axis labels up or down relative to the x axis.
See ?par
for more information.# Increase bottom margin to make room for rotated labels
par(mar = c(7, 4, 4, 2) + 0.1)
# Create plot with no x axis and no x axis label
plot(1:8, xaxt = 'n', xlab = '')
# Set up x axis with tick marks alone
axis(1, labels = FALSE)
# Create some text labels
labels <- paste('Label', 1:8, sep = ' ')
# Plot x axis labels at default tick marks
text(
1:8,
par('usr')[3] - 0.25,
srt = 45,
adj = 1,
labels = labels,
xpd = TRUE
)
# Plot x axis label at line 6 (of 7)
mtext(1, text = 'X Axis Label', line = 6)
Create figure panels with
help(par)
, An Introduction to R 12.5.4 p.73,
Muenchen_2009_R for SAS and SPSS Users Ch. 21.3 for more info.The ggplot2 package can generate advanced graphs (see ggplot2 documentation here and here).
qplot()
ggplot()
.ggobi is a data visualization program that interfaces with R.
plotly is an online tool that can interface with R and Shiny. Shiny is a nice tool for creating interactive plots and presentations.