R is an interpreted computer language used primarily for statistical analysis.
From the R FAQ:
R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.
The design of R has been heavily influenced by two existing languages: Becker, Chambers & Wilks’ S (see What is S?) and Sussman’s Scheme. Whereas the resulting language is very similar in appearance to S, the underlying implementation and semantics are derived from Scheme. See What are the differences between R and S?, for further details.
The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. Most of the user-visible functions in R are written in R. It is possible for the user to interface to procedures written in the C, C++, or FORTRAN languages for efficiency. The R distribution contains functionality for a large number of statistical procedures. Among these are: linear and generalized linear models, nonlinear regression models, time series analysis, classical parametric and nonparametric tests, clustering and smoothing. There is also a large set of functions which provide a flexible graphical environment for creating various kinds of data presentations. Additional modules (‘add-on packages’) are available for a variety of specific purposes (see R Add-On Packages).
The name is partly based on the (first) names of the first two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs language ‘S’, which stood for System.’
I highly recommend using a package manager to install R and RStudio. Otherwise, R and RStudio must be downloaded and installed separately. R cannot be updated from within RStudio, and installing a new R version can require reinstallation of packages. It’s difficult to manage.
Install Homebrew. Homebrew includes Homebrew-Cask to manage other macOS applications.
I previously used Anaconda to manage my Python and R distributions, and now use Homebrew. I switched because Anaconda is a very large installation, and not as flexible or general as Homebrew. Conda and brew are sort of like competing package managers.
Click here for Anaconda info
pip
, and will integrate into the Anaconda environment.$ conda list
and $ pip list
.$ conda search
.$ conda update --prefix <PATH> anaconda
$ conda update --all
When I was working with R, I preferred to install R and RStudio via Anaconda for easier version and package management. Otherwise, R and RStudio must be downloaded and installed separately. R cannot be updated from within RStudio, and installing a new R version can require reinstallation of packages. It’s just too difficult to manage.
conda install -c r r-essentials
$ conda install rstudio $
Install R and RStudio via Homebrew
brew install r
brew cask install r-app
brew cask install rstudio
Update Homebrew regularly with brew update
and brew upgrade
Check health of installation with brew doctor
See docs for further info.
I highly recommend working with a notebook file format. Jupyter Notebook and R Markdown are the most popular formats.
pip
module manager. I previously used Anaconda to manage my Python and R distributions, and now use Homebrew. I switched because Anaconda is a very large installation, and not as flexible or general as Homebrew.jupyter notebook
or{r}
after the first set of backticks. R will run this code when knitting the RMarkdown document. If you want to include a standard Markdown code chunk without running the code in R, leave out the curly braces.Example:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Change size of figures (at beginning of r code chunk in brackets):
{r, out.width = '1000px', out.height = '2000px'}
Prevent printing of code in HTML output: echo=FALSE
github_document
output format outputs a GitHub-friendly Markdown file. It is RStudio-specific, not a Pandoc standard, and provides some RStudio-specific benefits like local HTML preview.
At the time of this writing, the markdown output formats (either github_document
or md_document: variant: gfm
) don’t seem to be outputting a table of contents, even when toc: true
is included. The solution is to include a separate header file containing the header and TOC.
output:
github_document:
includes:
in_header: R-guide-header.md
blogdown
package is used to build the site within RStudio.?rmarkdown::html_document
.RStudio Projects are the best way to manage code in RStudio. RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents. Projects allow use of version control with Git. Installing packages into the project locally with renv
is very helpful.
renv
is a package dependency management system for R projects. It helps avoid problems caused by different package versions and installations by giving each project its own isolated package library. renv
is separate from general package managers like Homebrew used to install R and RStudio.renv
that is now “soft-deprecated.”
renv
is simple. Run the following command in the console: renv::migrate("~/path/to/repo")
.renv::restore()
, or in the Packages pane, navigate to renv
-> Restore Library.Hands-on coding practice is essential.
Cmd+Enter
(or use the Run toolbar button).Cmd+Enter
(or use the Run toolbar button)Cmd+option+R
(or use the Source toolbar button). Using the Knit command in R Markdown documents will run the entire document.5+5
, 5-5
, (5+5)/2
, etc)<-
or =
to define a variable (also called an object in other languages like Python). For example, x <- 42
or x = 42
set x equal to 42.~
is similar to =
in Python or SAS or a normal written equation. It means ‘versus’ or ‘as a function of.’'forty-two'
will treat forty-two as a word. Both single and double quotes can be used, but double quotes are preferred.TRUE
or T
, and FALSE
or F
.class(var_name).
print(variable)
like Python or PROC PRINT like in SAS.apropos()
to help identify the proper function to use.?
followed by the command of interest#
, similar to how you would use a star to begin a note and a semicolon to conclude it in SAS. A note in R will begin after #
and continue until the end of the line (until you insert a line break with enter).R vs. SAS from Quick-R:
Unlike SAS, which has DATA and PROC steps, R has data structures (vectors, matrices, arrays, data frames) that you can operate on through functions that perform statistical analyses and create graphs. In this way, R is similar to PROC IML.
See R in Action 2.2 for further info.
The vector is a way to store data in R. Vectors are created with the combine function c()
. Enter values or variable names separated by commas:
c(45, 6545, 453, 3453, etc) Vector_name <-
You can also name the elements of a vector, and create a variable with the vector names for ease of use.
names(vector_name) <- c('name1', 'name2')
Select parts of a vector with brackets. For example, to create a new vector with a specific portion of another vector, type
old_vector_name[vector_element] new_vector_name <-
R stores vectors as strings of data. To put the data in table format, use a data frame (see below).
From DataCamp introduction to R course:
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional. You can construct a matrix in R with the matrix function, for example:
matrix(1:9, byrow=TRUE, nrow=3)
The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use
1:9
which constructs the vectorc(1,2,3,4,5,6,7,8,9)
. The argumentbyrow
indicates that the matrix is filled by the rows. If we want the vector to be filled by the columns, we just placebycol=TRUE
orbyrow=FALSE
. The third argumentnrow
indicates that the matrix should have three rows.
You can use vectors to name rows and columns of a matrix, for example:
colnames(star_wars_matrix) <- c('US','non-US');
rownames(star_wars_matrix) <- c('A New Hope','The Empire Strikes Back','Return of the Jedi')
The rowSums(matrix_name)
and colSums(matrix_name)
functions will sum the matrix (the first S is capital).
Use the cbind()
function to merge vectors and matrices.
Similar to vectors, use square brackets [ ]
to select elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate that which you want to select from the rows from that which you want to select from the columns. For example:
my_matrix[1,2]
first row, second element.my_matrix[1:3,2:4]
rows 1,2,3 and columns 2,3,4.my_matrix[,1]
all elements of the first column.my_matrix[1,]
all elements of the first row.2*my_matrix
multiplied every element of my_matrix by two, my_matrix1 - my_matrix2
creates a matrix where each element is the product of the corresponding elements in my_matrix1
and my_matrix2
.Similar to vectors, the standard operators like `+ - / etc. work in an element-wise way on matrices in R. For example:
2my_matrixmultiplies each element of
my_matrix` by two.
The factor is an R function used to store categorical variables.
To create factors in R, use the function factor()
. First, create a vector that contains the observations to be converted to a factor. For example, gender_vector
contains the sex of 5 different individuals and participants_1
contains discrete numbers representing deidentified study participants:
c('Male','Female','Female','Male','Male')
gender_vector =1 <- c(2,3,5,7,11,13,17) participants_
When you first get a data set, you will often notice that it contains factors with specific factor levels. However, sometimes you will want to change the names of these levels for clarity or other reasons. R allows you to do this with the function levels()
:
levels(factor_vector) <- c('name1','name2',...)
To create an ordered factor, you have to add two additional arguments:
factor(some_vector,ordered=TRUE,levels=c('Level 1','Level 2'...))
By setting the argument ordered=TRUE
in the function factor()
, you indicate that the factor is ordered. With the argument levels
you give the values of the factor in the correct order:
c('Low','Medium','High') levels =
Here’s an example within a functional R Markdown code block:
# Create a vector of temperature observations
c('High', 'Low', 'High', 'Low', 'Medium')
temperature_vector <-
# Specify ordinal variables with given levels
factor(temperature_vector,
factor_temperature_vector <-order = TRUE,
levels = c('Low', 'Medium', 'High'))
# Display the vector
factor_temperature_vector
## [1] High Low High Low Medium
## Levels: Low < Medium < High
A data frame is a combination of vectors that functions like a spreadsheet within R.
data.frame()
Variables are columns, observations are rows.
Different types of vectors can be combined into a data frame, as long as the vectors have the same length.
This is also how datasets are organized in SAS.
Example data frame mtcars
:
names(mtcars) # names of columns
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
head(mtcars) # first observations
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars) # last observations
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
str(mtcars) # structure of data frame
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars) # dimensions
## [1] 32 11
str()
shows you the structure of your dataset. For a data frame it provides:
Example from DataCamp Introduction to R course:
Remember that a data frame is essentially a combination of vectors. The DataCamp example above created the data frame in two steps for instructional purposes, but you can usually build this in to the data.frame
command.
To create a data frame on planets sorted by decreasing diameter (example from DataCamp):
order(planets_df$diameter, decreasing=TRUE)
positions <-# this does not order the data yet, for some reason
planets_df[positions,] largest_first_df <-
[ ]
in the form my_data_frame[rows,columns]
.
my_data_frame[1,2]
selects the second element from the first row in my_data_frame
.my_data_frame[1:3,2:4]
selects rows 1,2,3 and columns 2,3,4 in my_data_frame
.my_data_frame[1,]
selects all elements of the first row.type
from the planets
data frame.planets_df[1:3,1]
. You have to know (or look up) the position of the variable type
, which gets hard if you have a lot of variables.planets_df[1:3,'type']
.rings
, both planets_df[,5]
and planets_df[,'rings']
work.$
sign to tell R that it only has to look up all the elements of the variable behind the sign: df_name$var_name
. You can combine this with the element section above: df_name$var_name[rows]
. Note that you do not need to specify a column because it is already specified by the variable name.The subset()
function selects a subset of a data frame according to conditions you specify. subset(df_name, some.condition== 'condition')
. The first argument of subset()
specifies the dataset for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset. Examples:
subset(planets_df, subset=(planets_df$rings == TRUE))
subset(planets_df, subset=(planets_df$diameter<1))
subset(cdc,cdc$genhlth=='very good')
very_good = subset(cdc,cdc$age>50)
age_gt50 = subset(wine_data, wine_data[, 1]=='Red')
red_wine <- subset(red_wine_data, red_wine_data$condition == 'USA') red_usa <-
The second subset=
inside the parentheses may not be necessary.
Conditions can be used with logical operators &
and |
.
&
is read ‘and’ so that subset(cdc, cdc$gender == 'f' & cdc$age > 30)
will give you the data for women over the age of 30. under23_and_smoke = subset(cdc,cdc$age<23 & cdc$smoke100==T)
will select respondents who are under the age of 23 and have smoked >100 cigarettes.|
character is read ‘or’ so that subset(cdc, cdc$gender == 'f' | cdc$age > 30)
will select people who are women or over the age of 30.Sort data frames with order()
.
order()
is a function that gives you the position of each element when it is applied on a variable. Let us look at the vector a: a <- c(100,9,101)
. Now order(a)
returns 2,1,3. Since 100 is the second largest element of the vector, 9 is the smallest element and 101 is the largest element.a[order(a)]
will give you the ordered vector 9,100,101, since it first picks the second element of a, then the first and then the last.Example from my postdoc Nrf1 proteomics experiments:
# Groups in alphabetical order
c('lacZ',
groups <-'Nrf1-HA bort',
'Nrf1-HA chol',
'Nrf1-HA chow')
order(groups)] groups[
## [1] "lacZ" "Nrf1-HA bort" "Nrf1-HA chol" "Nrf1-HA chow"
# Put groups in my preferred order
factor(groups,
ordered =levels = c('lacZ',
'Nrf1-HA chow',
'Nrf1-HA chol',
'Nrf1-HA bort'))
# Verify desired order of groups
order(ordered)] groups[
## [1] "lacZ" "Nrf1-HA chow" "Nrf1-HA chol" "Nrf1-HA bort"
Orderedgroups
in this case) in subsequent code.Example: Reordering columns in a data frame from Cookbook for R:
# A sample data frame
read.table(header = TRUE,
data <-text = '
id weight size
1 20 small
2 27 large
3 24 medium
')
# Reorder by column number
c(1, 3, 2)] data[
## id size weight
## 1 1 small 20
## 2 2 large 27
## 3 3 medium 24
# Reorder by column name
c('size', 'id', 'weight')] data[
## size id weight
## 1 small 1 20
## 2 large 2 27
## 3 medium 3 24
# Overwrite data variable with new info:
# data <- data[c(1,3,2)]
The above examples index into the data frame by treating it as a list (a data frame is essentially a list of vectors). You can also use matrix-style indexing, as in data[row, col], where row is left blank.
c(1, 3, 2)] data[,
## id size weight
## 1 1 small 20
## 2 2 large 27
## 3 3 medium 24
The drawback to matrix indexing is that it gives different results when you specify just one column. In these cases, the returned object is a vector, not a data frame. Because the returned data type isn’t always consistent with matrix indexing, it’s generally safer to use list-style indexing, or the drop=FALSE option:
# List-style indexing of one column
2] data[
## weight
## 1 20
## 2 27
## 3 24
# Matrix-style indexing of one column
# drops dimension to become a vector
2] data[,
## [1] 20 27 24
# Matrix-style indexing with drop=FALSE
# preserves dimension to remain data frame
2, drop = FALSE] data[,
## weight
## 1 20
## 2 27
## 3 24
Another example from Stack Overflow
Create the data frame
data.frame(
dd <-b = factor(
c('Hi', 'Med', 'Hi', 'Low'),
levels = c('Low', 'Med', 'Hi'),
ordered = TRUE
),x = c('A', 'D', 'A', 'C'),
y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2)
) dd
## b x y z
## 1 Hi A 8 1
## 2 Med D 3 1
## 3 Hi A 9 1
## 4 Low C 9 2
Sort the frame by column z (descending) then by column b (ascending).
decreasing=TRUE
like the DataCamp examples, you can just put a negative sign before the column name (-z).You can use the
order()
function directly without resorting to add-on tools – see this simpler answer which uses a trick right from the top of the example(order) code:
dd[with(dd, order(-z, b)), ] R <-
Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the order() function, rather than using the name of the column (and
with()
for easier/more direct access).
dd[ order(-dd[,4], dd[,1]), ] R <-
To merge two data frames (datasets) horizontally, use the merge
function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).
# merge two data frames by ID
merge(data frameA,data frameB,by='ID')
total <-# merge two data frames by ID and Country
merge(data frameA,data frameB,by=c('ID','Country')) total <-
To join two data frames (datasets) vertically, use the rbind
function. The two data frames must have the same variables, but they do not have to be in the same order.
rbind(data_frame_A, data_frame_B) total <-
If data_frame_A
has variables that data_frame_B
does not, then either:
rbind( )
.Stack and unstack using the PlantGrowth dataset included with R:
unstack(PlantGrowth)
Unstack
separates out the groups into a data frame.To create a boxplot of the three groups:
boxplot(unstack(PlantGrowth))
boxplot(independent_var~dependent_var)
. See the Plots section.A list in R gathers a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, other lists, etc. The objects do not need to be related to each other.
To construct a list you use the function list()
:
list(component1, component2...) my_list <-
The arguments to the list function are the list components. To name items in the list: list(name1=your.component1, name2=component2,…)
Lists will often be built out of numerous elements and components. Getting a single element, multiple elements, or a component out of the list is not always straightforward.
List components can be selected with square brackets as usual, like list[1]
.
length()
returns the total number of components in the list. For example, to identify the last actor in a list of actors in the movie The Shining:
$actors[length(shining_list$actors)] shining_list
As with data frames, components can be referenced by name:
$reviews
shining_list'reviews']] shining_list[[
c()
is used to add elements to a list: c(list1,some.object)
c(list1,new.item.name=some.object)
In RStudio, datasets can be loaded from CSV, from web using Tools -> Import dataset or by clicking Import dataset in the environment pane.
This is one of the best ways to read in a dataset. Note that the file path should be in UNIX format. UNIX file paths have forward slashes, Windows file paths have backslashes (‘whacks’).
read.table('c:/mydata.csv', header=TRUE, sep=',') mydata <-
There are many other parameters that can be specified when reading in a dataset. See R help.
Export the excel file to a CSV and follow instructions above, or use the xlsx package:
# read in the first worksheet from the workbook myexcel.xlsx
# first row contains variable names
library(xlsx)
read.xlsx('c:/myexcel.xlsx', 1)
mydata <-
# read in the worksheet named mysheet
read.xlsx('c:/myexcel.xlsx', sheetName = 'mysheet') mydata <-
load(url('dataset_url'))
# or
read.table('dataset_url') Dataset_name=
URL must be enclosed in quotes to make it a string.
scan()
function can be used to import a vector or list.dplyr
package is one of the most popular for working with data in R. It is similar to Python Pandas. See Tom Augspurger’s Gist on dplyr and Pandas.na.omit(dataset_name$optional_var_name)
will omit any missing values (usually labeled ‘NA’)attach(dataset_name, options)
Exporting to a delimited text file:
write.table(x, outfile, sep=delimiter, quote=TRUE, na='NA')
x
is the object and outfile
is the target file. For example, the statement write.table(mydata, 'mydata.txt', sep=',')
would save the dataset mydata
to a comma-delimited file named mydata.txt in the current working directory. Include a path (for example, 'c:/myprojects/mydata. txt'
) to save the output file elsewhere. Replacingsep=','
with sep='\t'
would save the data in a tab-delimited file. By default, strings are enclosed in quotes (’’) and missing values are written as NA
.Exporting to an Excel workbook using the xlsx package:
library(xlsx) write.xlsx(x, outfile, col.Names=TRUE, row.names=TRUE, sheetName='Sheet 1', append=FALSE)
Mean: mean()
Groups of means can be compared with the by()
function:
by(df_name$var_name1,df_name$var_name2,mean)
Median: median()
Variance: variance()
Absolute value: abs(x)
Square root: sqrt(x)
Log: log(x)
Computing summary statistics of data subsets: aggregate(model_formula, df_name, mean)
. This function is useful if you want to calculate means of each group in a dataset.
Frequency table (to count the number of times each category occurs in a data frame): table(df_name)
For a specific variable: table(df_name$var_name)
The table
command can also be used to combine multiple vectors of different types into a table:
c('Y','N','N','Y','N','Y','Y','Y','N','Y')
smokes = c(1,2,2,3,3,1,2,1,3,2)
amount =table(smokes,amount)
Relative frequency table:
table(df_name$var_name1)/number_observations
The summary()
function gives you an overview description of a variable (minimum, first quartile, median, mean, third quartile, and maximum). The describe()
function is similar, but includes more information on the distribution (sd, range, skew, kurtosis). Remember that skew is classified by the direction of the data tail.
Using comparisons is a good way to facilitate data analysis.
To select all elements of a vector that are greater than zero, type new_vector_name <- vector_name >0
To compare vectors/variables (can also use <
or ==
):
$var_name1>df_name$var_name2 df_name
To compare multiple elements of a vector:
>vector_name[element2] vector_name[element1]
To create a ratio:
df_name$var_name1/df_name$var_name2 ratio_name <-
Proportion:
df_name$var_name1/(df_name$var_name1+df_name$var_name2) proportion_name <-
You can calculate a proportion based on a subset of the data. Example:
# The 'atheism' data frame has already been loaded
# Your code to create the 'us12' subset:
subset(atheism, atheism$nationality == 'United States' & atheism$year =='2012')
us12 =# Calculate the proportion of atheist responses:
nrow(subset(us12,response == 'atheist'))/nrow(us12) proportion =
Sum vectors with the sum()
function, and compare sums with var_name>other_var_name
.
The scale()
function converts values to z scores.
95% confidence for a sample size n means that 95% of random samples of size n will yield confidence intervals that contain the true average value.
The margin or error is equal to the radius (or half the width) of the confidence interval.
Sample code:
# Calculate standard error
sd(vector_name)/sqrt(# observations)
se =
# Calculate lower and upper CI bounds
lower = sample_mean - 1.96 - se
upper = sample_mean + 1.96 - se
c(lower, upper)
Calculating the critical value for a confidence interval: qnorm(0.95)
Margin of error:
Imagine you’ve set out to survey 1000 people on two questions: are you female? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.
The first step is to make a vector
p
that is a sequence from 0 to 1 with each number separated by 0.01:
1000
n = seq(0, 1, 0.01) p =
We then create a vector of the margin of error (
me
) associated with each of these values ofp
using the familiar approximate formula (ME = 2 X SE):
1.96 - sqrt(p - (1 - p) / n) me =
## Warning in sqrt(p - (1 - p)/n): NaNs produced
Finally, plot the two vectors against each other to reveal their relationship:
plot(me ~ p)
Simulating a coin flip with sample()
:
c('heads', 'tails')
outcomes =sample(vector_name, size=100, replace=TRUE)
The code above simulates 100 flips of a fair coin.
Add different probability with
sample(vector_name, size=100, replace=TRUE,prob=c(0.2,0.8))
The sample can also be saved as a vector for further analysis (such as calculating the mean, or other statistics)
The for loop is a cornerstone of computer programming. The idea behind the for loop is iteration: it allows you to execute code as many times as you want without having to type out every iteration.
Example for loop (from DataCamp data analysis and statistical inference course, also see Verzani simpleR p.48):
In this example, we want to iterate the two lines of code inside the curly braces that take a random sample of size 50 from area then save the mean of that sample into the sample means50 vector. Without the for loop, this would be painful:
rep(NA, 5000)
sample_means50 <- sample(area, 50)
samp <-1] <- mean(samp)
sample_means50[ sample(area, 50)
samp <-2] <- mean(samp)
sample_means50[ sample(area, 50)
samp <-3] <- mean(samp)
sample_means50[ sample(area, 50)
samp <-4] <- mean(samp) sample_means50[
and so on… With the for loop, these thousands of lines of code are compressed into a handful of lines:
# The 'ames' data frame and 'area' and 'price' objects are already loaded
# Set up an empty vector of 5000 NAs to store sample means:
rep(NA, 5000)
sample_means50 =# Take 5000 samples of size 50 of 'area' and store in 'sample_means50'.
# Also print 'i' (the index counter):
for (i in 1:5000)
{ sample(area, 50)
samp = mean(samp)
sample_means50[i] =print(i)
}
In the first line we initialize a vector. In this case, we created a vector of 5000 NAs called
sample means50
. This vector will store values generated within the for loop. NA means not available, and in this case they’re used as placeholders until we fill in the values with actual sample means. NA is also often used for missing data in R.The second line calls the for loop itself. The syntax can be loosely read as, ‘for every element
i
from 1 to 5000, run the following lines of code’. You can think ofi
as the counter that keeps track of which loop you’re on. Therefore, more precisely, the loop will run once wheni
=1, then once wheni
=2, and so on up toi
=5000.The body of the for loop is the part inside the curly braces, and this set of code is run for each value of
i
. Here, on every loop, we take a random sample of size 50 fromarea
(note: the sample size can be previously defined by creating a vector), take its mean, and store it as the ith element ofsample_means50
. In order to display that this is really happening, we asked R to printi
at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset. You can also run multiple distributions through the same for loop:
# The 'ames' data frame and 'area' and 'price' objects are already loaded
# Initialize the sample distributions:
rep(NA, 5000)
sample_means10 = rep(NA, 5000)
sample_means100 =
# Run the for loop:
for (i in 1:5000) {
sample(area, 10)
samp = mean(samp)
sample_means10[i] = sample(area, 100)
samp = mean(samp)
sample_means100[i] = }
Don’t worry about the fact that
samp
is used for the name of two different objects. In the second command of the for loop, the mean ofsamp
is saved to the relevant place in the vectorsample_means10
. With the mean saved, we’re now free to overwrite the object samp with a new sample, this time of size 100. In general, anytime you create an object using a name that is already in use, the old object will get replaced with the new one, i.e. R will write over the existing object with the new one, which is something you want to be careful about if you don’t intend to do so.’Another example: Obtain a random sample, calculate the sample’s mean and standard deviation, use these statistics to calculate a confidence interval, repeat steps (1)-(3) 50 times.
# The 'ames' data frame is already loaded
# Initialize 'samp_mean', 'samp_sd' and 'n':
rep(NA, 50)
samp_mean = rep(NA, 50)
samp_sd = 60
n =
# For loop goes here:
for (i in 1:50)
{ sample(population, n)
samp = mean(samp)
samp_mean[i] = sd(samp)
samp_sd[i] = }
For loops can also be used for bootstrapping.
Bootstrapping is a way of amplifying your data for estimation.
Definition (from Google):
a technique of loading a program into a computer by means of a few initial instructions that enable the introduction of the rest of the program from an input device.
There is also a boot
package available for R
Sampling vs. bootstrapping: The sampling distribution is calculated by resampling from the population, the bootstrap distribution is calculated by resampling from the sample.
Example: bootstrap confidence interval for the average weight gained by all mothers during pregnancy. (From Datacamp data analysis and statistical inference course lab 4)
# The 'nc' data frame is already loaded
# Initialize the 'boot_means' object:
rep(NA,100)
boot_means=
# Insert your for loop:
for (i in 1:100) {
sample(gained_clean, n, replace=T)
boot_sample = mean(boot_sample)
boot_means[i] =
}
# Make a histogram of 'boot_means':
hist(boot_means)
Writing a for loop every time you want to calculate a bootstrap interval or run a randomization test is cumbersome. This is a custom function that automates the process. By default the inference
function takes 10,000 bootstrap samples (instead of the 100 you’ve taken above), creates a bootstrap distribution, and calculates the confidence interval, using the percentile method.
Example
# The 'nc' data frame is already loaded into the workspace
# Load the 'inference' function:
load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/inference.Rdata'))
# Run the inference function:
inference(nc$gained, type='ci', method='simulation',
conflevel=0.9, est='mean', boot_method='perc'
)
method='se'
will perform the method based on the standard errorest='median'
will create an interval for the median instead of the meanExample (Datacamp data analysis and statistical inference course lab 4):
use the inference function for evaluating whether there is a difference between the average birth weights of babies born to smoker and non-smoker mothers:
# The 'nc' data frame is already loaded into the workspace
inference(y = nc$weight, x = nc$habit,
est = 'mean', type = 'ci', null = 0,
alternative = 'twosided', method = 'theoretical',
order=c('smoker','nonsmoker')
)
The first argument is y, which is the response variable that we are interested in:
nc$weight
.The second argument is the grouping variable
x
, which is the explanatory variable multiplied by the grouping variable across the levels of which we’re comparing the average value for the response variable, smokers and non-smokers:nc$habit
.The third argument
est
is the parameter we’re interested in: ‘mean’ (other options are ‘median’, or ‘proportion’.)Next we decide on the type of inference we want: a hypothesis test (‘ht’) or a confidence interval(‘ci’).
When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other.
The alternative hypothesis can be ‘less’, ‘greater’, or ‘twosided’.
The method of inference can be ‘theoretical’ or ‘simulation’ based.
The order function changes the order of the variable levels.
You can also add
success=
, which defines a success for survey data-type questions.
Another example:
Take a subset
spain
ofatheism
with only the respondents from Spain. Calculate the proportion of atheists within this subset. Use theinference
function on this subset, grouping them by year. (From Datacamp data analysis and statistical inference course lab 4)
# Take the 'spain' subset:
subset(atheism, atheism$nationality == 'Spain')
spain =# Calculate the proportion of atheists from the 'spain' subset:
nrow(subset(spain,response == 'atheist'))/nrow(spain)
proportion =# Use the inference function:
inference(spain$response, spain$year, est = 'proportion',
type = 'ci', method = 'theoretical', success = 'atheist'
)
t.test(y~x, var.equal = TRUE)
# when both variables are numeric
t.test(y1,y2, var.equal = TRUE)
# one-sample t-test, with null hypothesis mean mu=3.
t.test(y,mu=2, var.equal = TRUE)
# paired t-test
t.test(y~x, paired=TRUE, var.equal = TRUE)
# one-tailed t-test
t.test(y~x, alternative = 'less', var.equal = TRUE)
# one-tailed t-test
t.test(y~x, alternative = 'greater', var.equal = TRUE)
var.equal = TRUE
to indicate homogeneity of variance.Shapiro-Wilk W test (W>0.9 desired, see flowchart from PATH 591 lecture 03 p.11 in linear regression section):
shapiro.test(dependent_var, options)
Normal quantile-quantile plot
qqnorm(dependent_var, options)
qqline(dependent_var, options)
var.test(dependent_var~independent_var)
See R in action ch.8.
lm(model, data, ...) model.name <-
See R in Action tables 8.2 and 8.3 for full syntax info.
Linear regression formula symbols. Adapted from R in Action.
~
: dependent variable on left, independent variables on right, like y~a+b
. -+
: Separates independent variables.:
Interaction, such as y~a:b
, meaning interaction of independent variables a and b.*
: All possible combinations of the variables.^
: Restricts interactions to a certain degree. The formula y ~ (a + b + c)^2
would be read by R as y ~ a + b + c + a:b + a:c + b:c..
: References all independent variables in the data frame.-
: Subtracts a variable from the model.-1
: Suppresses the intercept.I()
: Allows use of arithmetic within the parentheses.Linear regression functions that can be used after specifying the model. Adapted from R in Action.
aic()
or bic()
: Prints Aikake’s or Bayes’ Information Criterion.anova()
: ANOVA table.coefficients()
: Shows intercept and slopes for the regression model.confint()
: Confidence intervals (95 percent by default).fitted()
: Lists predicted values.plot()
: Plots for evaluating model fit. To select specific plots, add ‘which=’.residuals()
: Residuals are basically the deviation from the model fit.
rstandard()
for standardized residuals (divided by standard deviation, fitted including the current data point).
rstudent()
for studentized residuals (divided by standard deviation, but fitted ignoring the current data point).
Stdres()
and studres()
can also be used but require the MASS package (Modern Applied Statistics with S).
See PATH 591 lecture 03 for a clear visual explanation of residuals.
summary()
: Summary of the regression model. To view a specific part, such as just the R2, add $r.squared
or $adj.r.squared
vcov()
: Lists the covariance matrix for model parameters.Shapiro-Wilk W statistic: based on mathematics of normal probability plots. Compares variance of distribution to variance expected under normal distribution. Shapiro-Wilk W statistic trumps p value. Normality testing workflow from PATH 591 lecture 03 p.11:
Graphical evaluation by normal quantile-quantile plot
Skewness: measure of asymmetry in distribution. A perfect normal distribution has skewness=0. Positively skewed distributions=tail to the right, negatively skewed=tail to the left.
Kurtosis: another measure of asymmetry in distribution. Leptokurtic distributions (kurtosis<0) will have more observations at the center, platykurtic distributions (kurtosis>0) are spread evenly.
car
and lmtest
packages.acf(lm_name$residuals)
, acf(rstudent(lm_name))
, or acf(studres(lm_name))
.studres
(in the MASS package) will produce an autocorrelation plot. The dashed lines indicate the 95% confidence interval, and correlations ideally lie within the lines.Cook’s D is a distance statistic that measures change in regression parameter estimates when an observation is deleted.
A Bonferroni outlier test can also be performed with the outlierTest()
function in the car
package.
Plots of Cook’s D values and residuals vs leverage allow graphical assessment of influential observations. Add a critical line to the Cook’s Distance plot with:
4/(nrow(dataframe)-length(lm1$coefficients)-2)
cookscritical <-plot(lmlog, which=4, cook.levels=cookscritical)
abline(h=cookscritical, lty=2, col='red')
gvlma
packagegvlma
returns five summary values:
F0.05(k, n-1 df)=, r2, p<
This information is usually adequate for publication purposes. See PATH 591 lecture 04 A p.8.
ANOVA is a type of linear regression analysis with categorical independent variables. R in Action ch.9 contains an excellent section on ANOVA.
Options for basic ANOVA:
aov(model, data=data.frame)
lm()
anova()
glm()
See R in Action table 9.5 for more syntax info and model examples.
lm()
function is more flexible than a basic ANOVA analysis with the aov()
function. It allows inclusion of both categorical and quantitative independent variables and enables prediction.aov()
function uses Type I sums of squares, which is not appropriate for imbalanced designs (groups of different sizes). The Anova()
function (with a capital A) in the car
package allows use of Type II or III sums of squares.glm()
uses iteratively reweighted least squares (IWLS). Further information in R in Action chapter 13.Shapiro-Wilk W test (W>0.9 desired, see flowchart from PATH 591 lecture 03 p.11 in linear regression section):
shapiro.test(dependent_var, options)
Normal quantile-quantile plot
qqnorm(dependent_var, options)
qqline(dependent_var, options)
Bartlett test: bartlett.test()
. Bartlett test is sensitive to deviations from normality.
Levene test
# Download and install the car package first
library(car)
leveneTest(y~a, location= 'mean')
Brown-Forsythe test
# Download and install the car package first
library(car)
leveneTest(y~a)
These HOV tests are only appropriate for one-way designs. A different approach is required for more complex designs.
Factorial designs can be converted to one-way by splitting each treatment combination into a separate group and identifying it as a single level in one independent variable, calling the one new independent variable ‘treatment,’ and then performing the Brown-Forsythe test. This does not work for repeated measures data because values for each group at one timepoint are correlated to values at other timepoints.
Variance assessment for multivariate, factorial, and repeated measures designs
I am more familiar with this in SAS. Explanation below.
In general, you want to use the simplest model possible that minimizes AIC, AICC, and BIC while still meeting the assumptions of the model. Mind the assumptions of each covariance structure. See CPSC 542 Lecture 10 pp. 4-7.
Donald Bullock (Illinois CPSC statistics professor):
You select an error structure with heterogeneous variances [like unstructured] and compare that to a model with everything the same with the exception of homogeneous variances [like compound symmetry]. The differences in the log likelihoods will be a chi-square with degrees of freedom equal to the differences in the number of estimates in the two models.
This is also described in the Littell ‘SAS for mixed models’ text p. 172.
In other words, run the code using the unstructured model, and copy/paste the ‘Fit Statistics’ table and the number of covariance parameter estimates into Excel. The number of estimates (also called covariance parameters) is given under ‘Dimensions’ in the output. Also look at the ‘-2 Res Log Likelihood’ number under ‘Fit Statistics’ in the output.
Then run the code several more times, using different covariance structures each time (cs, ar(1), arh(1), toep, and toeph), and paste the fit statistics and estimates into excel also. Then, for each test other than unstructured, type =chidist(x, df), using the difference between the ‘-2 Res Log Likelihood’ for unstructured and the other test as the ‘x’ or chi square, and the difference in #parameters between unstructured and the other test as the df.
The answer will be a p value. A significant p indicates that the unstructured model provides superior fit.
Independence of residuals is required. Any pairing in the data, such as in repeated measures designs, must be appropriately specified in the model.
# note that this is adjusted for multiple comparisons
TukeyHSD(model)
pairwise.t.test(y, a)
or
library(multcomp)
# load RColorBrewer for additional color palettes.
library(RColorBrewer)
glht(aov1, linfct=mcp(var_name='Tukey'))
tuk <-cld(tuk)
plot(cld(tuk, level=.05), col=brewer.pal(8,'Pastel1'))
# 'I'm a very vivacious man... Earth tones are so passe.
# I'm a pastel man.' Larry David
Download and install the multcomp
package.
First, print the covariance matrix with model.matrix(model_name)
to identify coefficients for contrasts.
Next, use the numbers from error matrix to indicate group comparisons of interest.
library(multcomp)
rbind('contrast name' = c(3, -1, -1, -1))
contrast <-summary(glht(model_name, linfct=mcp(x=contrast)))
Options for creation of interaction plots:
interaction.plot(variable1, variable2, options)
library(effects)
plot(effect(factorial_term, model_name, xlevels),
multiline=TRUE
)
library(gplots)
plotmeans(y ~ interaction(variable1, variable2, sep=' '),
connect=list(c(1,3,5),c(2,4,6)),
col=c('red', 'darkgreen'),
main = 'Interaction Plot with 95% CIs',
xlab='Treatment and Dose Combination'
)
library(HH)
interaction2wt(model)
Using aov()
or lm()
is the simplest method, but also the least flexible. It assumes sphericity, and thus is not suitable for most real-world datasets. A mixed models approach is required to appropriately model covariance. R in Action does not cover mixed models in depth.
A few options:
Anova()
in the car
packagelmer()
in the lme4
package. See CRAN 7.35 for discussion of p values in this function.lme()
or gls()
in the nlme
packageAnalysis of covariance includes both fixed effects (independent variables) and random effects (covariates included to model random variation). If the effect can be reproduced, it’s fixed. With fixed effects you are interested in the means, but with random effects you are interested in variances. Blocks could be either fixed or random; for example, if you are doing a test on trees and you want to block for forest, setting blocks as random would indicate that you want to block for just all forests everywhere in general and these forests are representative of all forests, whereas setting blocks as fixed would indicate that you want to block for these specific forests you’re evaluating. See CPSC 440 Lecture Notes Chapter 08 ANOVA.
The covariance structure must be appropriately specified. Fit statistics and information criteria can be used to identify an appropriate covariance structure: AIC(model)
. See ‘Selection of regression models’ in the Linear regression section, and ‘Variance assessment for multivariate, factorial, and repeated measures designs’ in the ANOVA section above.
Useful when both dependent and independent variables are categorical, or when dependent variable is a count.
It evaluates the association, or contingency, between two kinds of classifications.
Data must first be put into a contingency table:
xtabs(formula, data=df_name) table_name <-
Next, run the test and reference the table
fisher.test(mytable)
See R in action ch. 7.2, Crawley The R Book ch. 14-15, UCSC R tips.
Example from my grad school lab: Prostate cancer incidence data from TRAMP Tomato Soy paper
wilcox.test(y~x, data)
wilcox.test(y~x, data, paired=TRUE)
kruskal_test(y~a, data)
Non-parametric one-way ANOVA, where y
is a dependent variable and a
is a categorical independent variable with ≥2 levels.
A non-parametric Kruskal-Wallis test is used when data fail the assumption of normality, and are not amenable to transformation. It is still worthwhile to perform the usual normality testing to demonstrate this.
HOV is still required. Use a test that is not affected by non-normality, such as Brown-Forsythe.
pairwise.wilcox.test(y, x, p.adjust.method= 'holm')
This function allows control of the experiment-wide error rate and is the best non-parametric multiple comparisons method available in R.
Adjustment options:
holm
hochberg
hommel
bonferroni
BH
BY
fdr
none
From R Documentation:
The first four methods are designed to give strong control of the family-wise error rate. There seems no reason to use the unmodified Bonferroni correction because it is dominated by Holm’s method, which is also valid under arbitrary assumptions.
library(PMCMRplus)
kwAllPairsNemenyiTest(y, x, method= c('Tukey', 'Chisq'))
PMCMRplus
package (Pairwise Multiple Comparisons of Mean Rank Sums). Note that the original PMCMR
package has been deprecated.nparcomp
packagelibrary(nparcomp)
nparcomp(y~a, data, type=)
npmc
packageThere used to be an npmc package that is referred to in some documentation. It has been deprecated.
friedman.test(y~a | b, data)
y
is the numeric outcome variable, A
is a grouping variable, and B
is a blocking variable that identifies matched observations. In both cases, data
is an option argument specifying a matrix or data frame containing the variables.Try the nlme
, MultNonParam
, npmlreg
, or nparLD
packages.
See R in Action ch.10
See R in Action ch.14
Bioconductor is a repository of R packages for bioinformatic analysis.
Install core packages (run the commands sequentially). These commands can also be used to update packages.
source('http://bioconductor.org/biocLite.R')
biocLite()
Install specific packages:
biocLite(c('GenomicFeatures', 'AnnotationDbi'))
Troubleshoot installations to flag packages that are either out-of-date or too new for your version of Bioconductor. The output suggests ways to solve identified problems, and the help page ?biocValid
lists arguments influencing the behavior of the function.
library(BiocInstaller)
biocValid() # R version 3.0 or later
The ddCt package for Bioconductor performs automated ΔΔCt calculations on qPCR data. Install with
source('http://bioconductor.org/biocLite.R')
biocLite('ddCt')
It seems to require a very specific format for the files output from SDS, and I was not able to get it to work with my data. It would be far easier to create a VBA macro in Excel.
Scatterplots: plot(x=,y=,options)
If your data frame only has two variables, you only need to specify the data frame: plot(data.frame)
To select specific variables from a data frame to display on the graph, use plot(x=data.frame$variable, y= data.frame$variable)
Line chart: plot(x,y,type='l')
Plotting >2 variables: plot(data.frame[rows,columns])
Boxplot: boxplot()
See the multiple comparisons ANOVA section above
Plotting one variable as a function of another is very similar to creating a linear model: boxplot(y ~ a)
attach()
the dataset you do not need to specify the data frame name here.New variables can also be calculated and plotted even if they are not in the original data frame:
+vector
new_calculation=vectorboxplot(new_calculation~df_name$var_name)
Setting line widths for individual parts of the boxplot:
lwd=
medlwd=
outlwd=
Adding symbols for group means:
After creating the boxplot, create a new vector and use the aggregate()
function: `means <- aggregate(model, data.frame, mean)
Next, create a separate points()
function:
points(means$dependent_var, cex=2, pch=23, bg=0, lwd=4)
Note that, as with other functions in R, variables will be sorted alphabetically or numerically. To change the order, you must create a separate vector and use order()
or factor()
to specify the order. See the section on data frames above, and previous code.
Plotting factorial datasets: Sample code from the R Graphics Cookbook ch.2 ‘Creating a box plot’.
# Formula syntax
boxplot(len ~ supp, data = ToothGrowth)
# Put interaction of two variables on x-axis
boxplot(len ~ supp + dose, data = ToothGrowth)
With the ggplot2 package, you can get a similar result using qplot() (Figure 2-11), with geom=‘boxplot’:
library(ggplot2)
qplot(ToothGrowth$supp, ToothGrowth$len, geom = 'boxplot')
If the two vectors are already in the same data frame, you can use the following syntax:
qplot(supp, len, data = ToothGrowth, geom = 'boxplot')
This is equivalent to:
ggplot(ToothGrowth, aes(x = supp, y = len)) + geom_boxplot()
It’s also possible to make box plots for multiple variables, by combining the variables with interaction(), as in Figure 2-11. In this case, the dose variable is numeric, so we must convert it to a factor to use it as a grouping variable:
# Using three separate vectors
qplot(interaction(ToothGrowth$supp, ToothGrowth$dose),
$len,
ToothGrowthgeom = 'boxplot')
Alternatively, get the columns from the data frame:
qplot(interaction(supp, dose), len, data = ToothGrowth, geom = 'boxplot')
This is equivalent to:
ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) + geom_boxplot()
Dot plot
Histogram: hist()
Histograms are useful for inspecting the distribution of the dataset.
To control histogram divisions and limits of x axis:
hist(df_name$var_name,breaks=number,xlim=number or variable you created)
Combined boxplot and histogram: simple.hist.and.boxplot()
Bar chart: barplot()
Pie chart: pie(data.frame)
Mosaic plot (useful for showing how different categories are related):
mosaicplot(data.frame$variable1~data.frame$variable2)
Volcano plots
See Quick-R, and of course R in Action, for thorough documentation.
Many graphical options are specified with the par()
function.
cex=
pch=
lty=
lwd=
Colors
Colors can be specified by index number, name, hexadecimal, or RGB.
See Earl Glynn’s color chart for a visual listing of colors in R by index number.
View all available color names: color()
The basic palette consists of 8 colors. Other color vectors with n contiguous colors can be created with gray(n)
, rainbow(n)
, heat.colors(n)
, terrain.colors(n)
, topo.colors(n)
, and cm.colors(n)
.
The RColorBrewer
package can also be used for color palettes. See website for more info.
To make palettes available in R: palette.name <- brewer.pal(n, 'name')
. Next, when selecting a color palette for a chart, use the vector name (palette.name
in this example).
There are 3 types of palettes, sequential, diverging, and qualitative.
To view all palettes as a plot:
display.brewer.all(n=NULL, type='all', select=NULL, exact.n=TRUE, colorblindFriendly=FALSE)
Use bracketed numbers to select specific colors or a range from within a palette: col=set1[5:9]
. Note that colors will be repeated if necessary. To start from a specific color in the palette, but still retain all colors in the palette, type the number for the starting color and then add that to the total number of colors in the palette.
For Set1: col=set1[5:14]
To color a plot by category (from Stackoverflow)
data.frame(x=1:10, y=rnorm(10)+5, z=sample(letters[1:3], 10, replace=TRUE))
DF <-attach(DF); plot(x, y, col=c('red','blue','green')[z]); detach(DF)
Fonts: See Quick-R.
Legend: legend()
. See Quick-R.
Adding jitter when there are too many points sitting on top of each other: plot(evals$score ~ jitter(evals$bty_avg))
or plot(jitter(evals$score) ~ evals$bty_avg)
Margins: mar
for margin size in lines, mai
for margin size in inches. For complete information on margins, see Earl Glynn’s margins tutorial.
Axes
box(which = 'plot', bty = 'l')
plot([options], ylab=expression(bold(paste('Thickness, ',mu, 'm'))))
xlab=, ylab=
xaxt='n'
suppresses the x axis, yaxt='n'
suppresses the y axis, and axes=FALSE
suppresses both x and y axes.text()
, rather than mtext()
, as the latter does not support par('srt')
.srt = 45
for text rotation angle, adj = 1
to place the right end of text at the tick marks, and xpd = TRUE
to allow for text outside the plot region. You can adjust the value of the 0.25offset
as required to move the axis labels up or down relative to the x axis. See ?par
for more information.# Increase bottom margin to make room for rotated labels
par(mar = c(7, 4, 4, 2) + 0.1)
# Create plot with no x axis and no x axis label
plot(1:8, xaxt = 'n', xlab = '')
# Set up x axis with tick marks alone
axis(1, labels = FALSE)
# Create some text labels
paste('Label', 1:8, sep = ' ')
labels <-# Plot x axis labels at default tick marks
text(
1:8,
par('usr')[3] - 0.25,
srt = 45,
adj = 1,
labels = labels,
xpd = TRUE
)# Plot x axis label at line 6 (of 7)
mtext(1, text = 'X Axis Label', line = 6)
Create figure panels with
layout()
par(mfrow=)
par(mfcol)
split.screen
help(par)
, An Introduction to R 12.5.4 p.73, Muenchen_2009_R for SAS and SPSS Users Ch. 21.3 for more info.The ggplot2 package can generate advanced graphs (see ggplot2 documentation here and here).
qplot()
ggplot()
.ggobi is a data visualization program that interfaces with R.
plotly is an online tool that can interface with R and Shiny. Shiny is a nice tool for creating interactive plots and presentations.