The latest version of this document will always be found here.

Note that this is an extremely brief and limited introduction to R, with the examples writ with the UC Irvine BDUC system in mind. There are many more complete introductory docs, some of the best listed in the section below. This is only to get your toes wet.

Some assumptions

The following tutorial assumes that you’re using a Linux system from the bash shell, you’re familiar navigating directories on a Linux system using cd, using the basic shell utilities such as head, tail, less, and grep, and the minimum R system has been installed on your system. If not, you should peruse a basic introduction to the bash shell. The bash prompt is shown as bash %. The R shell prompt is > and R commands will be prefixed by the > to denote it. Inline comments are prefixed by and can be copied into your shell along with the R or bash commands they comment - the shields them from being executed, but do NOT copy in the R or bash shell prompt. For a quick introduction to basic data manipulation with Linux, please refer to the imaginatively named Manipulating Data on Linux.

1. R’s operational modes

R is a programming language designed explicitly for statistical computation. As such, it has many of the characteristics of a general-purpose language including iterators, control loops, network and database operations, many of which are useful, but in general not as easy to use as the more general Python or Perl

R can operate in 2 modes, as an interactive interpreter (the R shell) and as a scripting language, much like Perl or Python. Typically the R shell is used to try things out and the serial commands written in the R language and saved in a file are used to automate operations once the sequence is well-defined and debugged.

While it is not a great language for procedural programming, it does excel at mathematical and statistical manipulations. However, it does so in an odd way, especially for those who have done procedural programming before. R is quite object-oriented in that you tend to deal with data objects rather than with individual integers, floats, arrays, etc. The best way to think of data if you have programmed in C is to think of R data (typically termed tables or frames) as C structs, arbitrary constructs that can be dealt with by name. If you haven’t programmed in a procedural language, it may actually be easier for you. R manipulates chunks of data similar to how you might think of them. For example, "multiply that column by 3.54" or "pivot that spreadsheet".

For a still-brief but more complete overview of R, see Wikipedia’s entry for R.

2. Graphics and Graphical User Interfaces for R

R was developed as a commandline language. However, it has gained progressively more graphics capabilities and graphical user interfaces (GUIs). Some notable examples are the de facto standard R GUI R Commander, the elegant and powerful interactive graphics utility ggobi, and the fully graphical statistics package gretl which was developed external to R for time-series econometrics but which now supports R as an external module.

In addition, many routines in R are packaged with their own GUIs so that when called from the commandline interface, the GUI will pop up and allow the user to interact with a mouse rather than the keyboard.

As opposed to fully integrated commercial applications which have a cohesive interface, these packages differ slightly in the way that they approach getting things done, but in general the mechanisms follow generally accepted user interface conventions.

3. Getting help on R

After starting R, you can usually start R’s internal help with:

bash %  R  # more comments on R's startup state below
# R startup messages deleted
> help.start().

This will start a browser page with a lot of links to R documentation, both introductory and advanced.

If that help is not installed, most R help can be found in PDF and HTML at the R documentation page.

There is a very useful 100 page PDF book called: An Introduction to R (also in HTML). Especially note Chapter 1 and Appendix A on page 78, which is a more sophisticated, (but less gentle) tutorial than this.

Note also that if you have experience in SAS or SPSS, there is a good introduction written especially with this experience in mind. It has been expanded into a large book (to be released imminently), but much of the core is still available for free as R for SAS and SPSS Users

There is another nice Web page that provides a quick, broad intro to R appropriately called Using R for psychological research: A simple guide to an elegant package which manages to compress a great deal of what you need to know about R into about 30 screens of well-cross-linked HTML.

There is also a website that expands on simple tutorials and examples, so after buzzing thru this very simple example, please continue to the QuickR website

Thomas Girke at UC Riverside has a very nice HTML Introduction to R and Bioconductor as part of his Bioinformatics Core, which also runs frequent R-related courses which may be of interest for the Southern California area.

4. Start the R interpreter

On the BDUC cluster, we host several versions of R. You if you want a particular version, you’ll have to specify it via the module command:

bash %  module load R/2.12.2

which causes the module info to be spat out and then you can start R with:

bash %  R  # simple, no?

# if you get the line shown below:

[Previously saved workspace restored]

you had saved the previous data environment and if you type ls() you’ll see all the previously created variables:

> ls()
 [1] "aa"          "data.matrix" "fil_ss"      "i"           "ma5"
 [6] "mn"          "mn_tss"      ""   "sg5"         "ss"
[11] "ss_mn"       "sss"         "tss"         "t_ss"        "tss_mn"
[16] "x"

If the previous session was not saved, there will be no saved variables in the environment

> ls()
character(0) # denotes an empty character (= nothing there)

NB: In many cases the data objects can be manipulated as you would files in the Unix shell, with the takes-some-getting-used-to of adding the almost-always-required parens afterwards:

> ls() # lists all data objects in view
> ls(someframe) # may give additional inforamtion on the frame
> rm(someframe) # deletes the 'someframe' data object

To determine what an object is, you can use a variety of methods, but one of the most useful is str (get the structure of an abject).

> str(aa)  # aa is a table or data.frame resulting from reading in a 25MB file.
'data.frame':   385238 obs. of  6 variables:
 $ ID    : Factor w/ 1649 levels "ENSXETG00000000007_SCAFFOLD_1356_47264_67263",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ BEGIN : int  1 70 139 232 301 370 439 508 577 658 ...
 $ END   : int  50 119 188 281 350 419 488 557 626 707 ...
 $ Blue  : num  -0.67 -0.47 0.36 0.42 0.09 0.22 -0.42 0.61 0.71 1.61 ...
 $ Red   : num  2.8 2.56 2.96 -0.76 -2.6 -1.52 -0.36 -3.24 -1.8 -3.36 ...
 $ RedNeg: num  -0.7 -0.64 -0.74 0.19 0.65 0.38 0.09 0.81 0.45 0.84 ...

> str(y) # y is a vector of size 50
 num [1:50] -0.202 -0.418  0.325 -0.377 -0.405 ...

You can also use dim() to get the bare dimensions of the data structure. Always a good idea before you try to view it by referencing the data directly without limits.

> dim (aa)  # aa is a table or data.frame resulting from reading in a 25MB file.
[1] 385238      6

And to address the individual vectors composing the structure, just use the form Structure$Vector.

ie: aa$Blue refers to the structure aa, vector Blue (see above).

Getting the size of objects

You can also obtain the actual byte-size of the objects in memory via:

# if ls() gives you the following:
> ls ()
 [1] "d"          "e"          "eset"       "eSet"       "gset"
 [6] "gset_wgcna" "gset60862"  "idx"        "o"          "pd60862"
[11] "pheno"

# then you can list the RAM-resident sizes via:
> for (thing in ls()) { message(thing); print(object.size(get(thing)), units='auto') }
63.4 Kb
153.4 Mb
96.6 Mb
6.2 Mb
96.6 Mb
153.4 Mb
7.6 Mb
48 bytes
63.4 Kb
4 Mb
129.5 Kb

5. Some basic R data types

Before we get too far, note that R has a variety of data structures that may be required for different functions.

Some of the main ones are:

  • numeric - (aka double) the default double-precision (64bit) float representation for a number.

  • integer - a single-precision (32bit) integer.

  • single - a single precision (32bit) float

  • string - any character string ("harry", "r824_09_hust", "888764"). Note the last is a number defined as a string so you couldn’t add "888764" + 765.

  • vector - a series of data types of the same type. (3, 5, 6, 2, 15) is a vector of numerics. ("have", "third", "start", "grep) is a vector of strings.

  • matrix - an array of identical data types - all numerics, all strings, all booleans, etc.

  • data.frame - a table (another name for it) or array of mixed data types. A column of strings, a column of integers, a column of booleans, 3 columns of numerics.

  • list - a concatenated series of data objects.

6. Loading data into R

This process is described in much more detail in the document R Data Import/Export, but this will give you a short example of one of the most popular ways of loading data into R.

The following reads the example data file above (red+blue_all.txt) into an R table called aa; note the ,header=TRUE option which uses the file header line to name the columns. In order for this to work, the column headers have to be separated by the same delimiter as the data (usually a space or a tab).

> aa <- read.table(file="red+blue_all.txt",header=TRUE) # try it

I have run into situations where read.table() would simply not read in a file which I had validated with an external Perl script:

> em <- read.table(file="long_complicated_data_file",header=TRUE,sep="\t",comment.char="#") # try it
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  line 134 did not have 29 elements

In these cases, using read.delim() may be more successful:

em <- read.delim(file="long_complicated_data_file",na.strings = "", fill=TRUE, header=T, sep="\t")

Note that in R, you can work left to right or right to left, altho the left pointing arrow is the usual syntax. The above command could also have been given as :

> read.table(file="red+blue_all.txt",header=TRUE)  -> aa   # try it

Did we get the col names labeled correctly?

> names(aa)  # try it
[1] "ID"    "BEGIN" "END"   "Blue"  "Red"

Many functions will require the data as a matrix (a data structure whose components are identical data types, usually strings. The following will load the file into a matrix, doing the conversions along the way. This command also shows the use of the sep="\t" option which explicitly sets the data delimiter to the TAB character.

Data <- as.matrix(read.table("red+blue_all.txt", sep="\t", header=TRUE))

7. Viewing data

To view some of the table, we only need to reference it. R assumes that a naked variable is a request to view it. An R table is referenced in [rows,column] order. That is, the index is [row indices, col indices], so that a reference like aa[1:6,1:5] will show a square window of the table; rows 1-6, columns 1-5

Also: R parameters are 1-indexed, not 0-indexed, tho it seems pretty forgiving (if you ask for [0-6,0-5], R will assume you meant [1-6,1-5]. also note that R requires acknowledging that the data structures are n-dimensional: to see rows 1-11, you can’t just type:

> aa[1:11] # try it!

but must instead type:

> aa[1:11,]   # another dimension implied by the ','

To slice the table even further, you can view only those elements that interest you

  # R will provide row and column headers if they were provided.
  # if they weren't provided, it will provide column indices.
> aa[3:12,4:5]  # try it.

8. Using HDF5 files with R

HDF5 (see also the Wikipedia entry) is a very sophisticated format for data which is being used increasingly in domains that were not previously using it, chiefly in Financial Record Archiving, and Bioinformatics.

As the name suggests, internally, it is a hierarchically organized hyperslab, a structure of multi-dimensional arrays of the same data type (ints, floats, bools, etc), sort of like the file equivalent of a C struct. There can be multiple different such arrays but each contains data of the same type.

R supports reading and writing HDF5 files with the rhdf5 library, among others.

8.1. Reading HDF5 files into R

If you are only going to be reading HDF5 files provided by others, the approach is very easy:

# assuming the rhdf5 library is installed
> library(rhdf5)
> h5 <- h5read("SVM_noaa_ops.h5")  # a 12MB HDF5 file
> h5ls("SVM_noaa_ops.h5")         # describe the structure of the HDF5 file
                         group                  name       otype    dclass
0                            /              All_Data   H5I_GROUP
1                    /All_Data      VIIRS-M1-SDR_All   H5I_GROUP
2   /All_Data/VIIRS-M1-SDR_All              ModeGran H5I_DATASET   INTEGER
3   /All_Data/VIIRS-M1-SDR_All              ModeScan H5I_DATASET   INTEGER
4   /All_Data/VIIRS-M1-SDR_All  NumberOfBadChecksums H5I_DATASET   INTEGER
5   /All_Data/VIIRS-M1-SDR_All NumberOfDiscardedPkts H5I_DATASET   INTEGER
6   /All_Data/VIIRS-M1-SDR_All   NumberOfMissingPkts H5I_DATASET   INTEGER
7   /All_Data/VIIRS-M1-SDR_All         NumberOfScans H5I_DATASET   INTEGER
8   /All_Data/VIIRS-M1-SDR_All              PadByte1 H5I_DATASET   INTEGER
10  /All_Data/VIIRS-M1-SDR_All          QF2_SCAN_SDR H5I_DATASET   INTEGER
11  /All_Data/VIIRS-M1-SDR_All          QF3_SCAN_RDR H5I_DATASET   INTEGER
12  /All_Data/VIIRS-M1-SDR_All          QF4_SCAN_SDR H5I_DATASET   INTEGER
14  /All_Data/VIIRS-M1-SDR_All              Radiance H5I_DATASET   INTEGER
15  /All_Data/VIIRS-M1-SDR_All       RadianceFactors H5I_DATASET     FLOAT
16  /All_Data/VIIRS-M1-SDR_All           Reflectance H5I_DATASET   INTEGER
17  /All_Data/VIIRS-M1-SDR_All    ReflectanceFactors H5I_DATASET     FLOAT
18                           /         Data_Products   H5I_GROUP
19              /Data_Products          VIIRS-M1-SDR   H5I_GROUP
2           1
3          48
4          48
5          48
6          48
7           1
8           3
9  3200 x 768
10         48
11         48
12        768
13         16
14 3200 x 768
15          2
16 3200 x 768
17          2
20         16
21         16

# in a slightly different format, 'h5dump' provides similar information:
> h5dump(file="SVM_noaa_ops.h5", load=FALSE)
  group     name       otype  dclass dim
1     / ModeGran H5I_DATASET INTEGER   1

  group     name       otype  dclass dim
1     / ModeScan H5I_DATASET INTEGER  48

 ... <etc>

Other R data manipulations are fairly easy as well.

# you can read parts of the HDF5 hyperslab into an R variable as well
# the following reads a subgroup into an R object
> h5_aa <- h5read("SVM_noaa_ops.h5", "/All_Data/VIIRS-M1-SDR_All")

# and this is what h5_aa contains.
> names(h5_aa)
 [1] "ModeGran"              "ModeScan"              "NumberOfBadChecksums"
 [4] "NumberOfDiscardedPkts" "NumberOfMissingPkts"   "NumberOfScans"
 [7] "PadByte1"              "QF1_VIIRSMBANDSDR"     "QF2_SCAN_SDR"
[10] "QF3_SCAN_RDR"          "QF4_SCAN_SDR"          "QF5_GRAN_BADDETECTOR"
[13] "Radiance"              "RadianceFactors"       "Reflectance"
[16] "ReflectanceFactors"

# see each sub-object by referencing it in the standard R way:
> h5_aa$NumberOfScans
[1] 48

8.2. Writing HDF5 files from R

Writing HDF5 files is slightly more difficult, but only because you have to be careful that you’re writing to the right group and that there is congruence between R and HDF5 data types and attributes. The rhdf5 docs have good examples for this.

9. Visualizing Data with ggobi

While there are many ways in R to plot your data, one unique benefit that R provides is its interface to ggobi, an advanced visualization tool for multivariate data, which for the sake of argument is more than 3 variables.

10. Operations on Data

To operate on a column, ie: to multiply it by some value, you only need to reference the column ID, not each element. This is a key idea behind R’s object-oriented approach. The following multiplies the column Red of the table aa by -1 and stores the result in the column 'RedNeg. There’s no need to pre-define the RedNeg column; R will allocate space for it as needed.

> aa$RedNeg = aa$Red * -1   # try it

# the table now has a new column called 'RedNeg'
> aa[1:4,]      # try it

If you want to have the col overwritten by the new value, simply use the original col name as the left-hand value. You can also combine statements by separating them with a ;

> aa$Red = aa$Red * -2; aa[1:4,] # mul aa$Red by -2 and print the 1st 4 lines

To transpose (aka pivot) a table:

> aacopy <- aa # copy the table

> aac_t <- t(aacopy) # transpose the table

# show the all the rows & 1st 3 columns of the transposed table
> aac_t[,1:3]

To bin a vector and visualize the binning:

> bp <- read.table(file="just.bayesp",header=TRUE) # 36K fp values
> hist(bp[1:36000,],breaks=50,plot="True")
# see result below
Result of binning 36K points with hist()

binned data

To draw a histogram of a data vector and spline the distribution curve, using the file MS21_Denat_C_Native_E.txt

# read the file into a dataframe
$ cat MS21_Denat_C_Native_E.txt
Denat   Native
1.83    4.80
1.78    3.95
1.75    3.72
1.61    3.60
1.60    3.32
# start R
$ R

> bth <- read.table(file="MS21_Denat_C_Native_E.txt",header=TRUE,sep="\t",comment.char="#")
Check it
> str(bth)
'data.frame':   233 obs. of  2 variables:
 $ Denat : num  1.83 1.78 1.75 1.61 1.6 1.6 1.58 1.55 1.51 1.5 ...
 $ Native: num  4.8 3.95 3.72 3.6 3.32 3.23 3.1 2.9 2.7 2.66 ...

# Create a histogram for the 'Denat' column.  The following cmd not only plots the histogram
# but also populates the 'hhde' histogram object.
# breaks =  how many bins there should be (controls the size of the bins
# plot = to plot or not to plot
# labels = if T, each bar is labeled with the numeric value
# main, xlab, ylab = plot title, axis labels
# col = color of hist bars
# ylim, xlim = axis limits; this was set to allow another histogram to fit inside
#  the defined limits; if not set, the limits will be those of the data.

> hhde <-hist(bth$Denat,breaks=16,plot="True",labels=F, main="Ratio Distribution (Denatured)",xlab="Ratio",ylab="Frequency", col="lightblue",ylim=c(0,70), xlim=c(0,5) )

# now plot a smoothed distribution using the hhde$mids (x) and hhde$counts (y)
# df = degrees of freedom, how much smoothing is applied;
#    (how closely the line will follow the values)
> lines(sm.spline(hhde$mids,hhde$counts, df=10), lty=1, col = "green")
The image generated by the above sequence is shown below.

Ratio Histogram with Distribution curve

11. Exporting Data

Ok, you’ve imported your data, analyzed it, generated some nice plots and dataframes. How do you export it to a file so it can be read by other applications? As you might expect, R provides some excellent documentation for this as well. The previously noted document R Data Import/Export describes it well. It’s definitely worthwhile reading it in full, but the very short version is that the most frequently used export function is to write dataframes into tabular files, essentially the reverse of reading Excel spreadsheets, b ut in text form. For this, we most often use the write.table() function. Using the aa dataframe already created, let’s output some of it:

> aa[1:3,] # lets see what it looks like 'raw'
                                            ID BEGIN END  Blue  Red
1 ENSXETG00000000007_SCAFFOLD_1356_47264_67263     1  50 -0.67 0.70
2 ENSXETG00000000007_SCAFFOLD_1356_47264_67263    70 119 -0.47 0.64
3 ENSXETG00000000007_SCAFFOLD_1356_47264_67263   139 188  0.36 0.74

> write.table(aa[1:3,])
"ID" "BEGIN" "END" "Blue" "Red"
"1" "ENSXETG00000000007_SCAFFOLD_1356_47264_67263" 1 50 -0.67 0.7
"2" "ENSXETG00000000007_SCAFFOLD_1356_47264_67263" 70 119 -0.47 0.64
"3" "ENSXETG00000000007_SCAFFOLD_1356_47264_67263" 139 188 0.36 0.74

# in the write.table() format above, it's subtly different.  Note the quoting and spacing.
# You can address all these in the various arguments to write.table(), for example:

> write.table(aa[1:3,], quote = FALSE, sep = "\t",row.names = FALSE,col.names = TRUE)
ID      BEGIN   END     Blue    Red
ENSXETG00000000007_SCAFFOLD_1356_47264_67263    1       50      -0.67   0.7
ENSXETG00000000007_SCAFFOLD_1356_47264_67263    70      119     -0.47   0.64
ENSXETG00000000007_SCAFFOLD_1356_47264_67263    139     188     0.36    0.74

# note that I omitted a file specification to dump the output to the screen.  To send
# the same output to a file, you just have to specify it:

> write.table(aa[1:3,], file="/path/to/the/file.txt", quote = FALSE, sep = "\t",row.names = FALSE,col.names = TRUE)

# play around with the arguments, reversing or modifying the argument values.
# the entire documentation for write.table is available from the commandline with
# '?write.table'

12. Basic Statistics

Now, let’s get some basic descriptive stats out of the original table. 1st, load the lib that has the functions we’ll need

> library(pastecs)  # load the lib that has the correct functions
> attach(aa) # make the Red & Blue columns usable as variables.

# following line combines an R data structure by column.
# type 'help(cbind)' or '?cbind' for more info on cbind, rbind

> redblue <- cbind(Red,Blue)

# the following calculates a table of descriptive stats for both the Red & Blue variables
> stat.desc(redblue)
                       Red          Blue
nbr.val       3.852380e+05  3.852380e+05
nbr.null      2.486000e+03  2.188000e+03        0.000000e+00  0.000000e+00
min          -1.504000e+01 -5.130000e+00
max           2.256000e+01  5.340000e+00
range         3.760000e+01  1.047000e+01
sum          -6.929548e+04  3.060450e+03
median        4.000000e-02  0.000000e+00
mean         -1.798771e-01  7.944310e-03
SE.mean       3.826047e-03  1.115451e-03
CI.mean.0.95  7.498938e-03  2.186250e-03
var           5.639358e+00  4.793247e-01       2.374733e+00  6.923328e-01
coef.var     -1.320198e+01  8.714826e+01

You can also feed column numbers to stat.desc to generate similar results:

             stress_ko_n1 stress_ko_n2 stress_ko_n3 stress_wt_n1 stress_wt_n2
nbr.val      3.555700e+04 3.555700e+04 3.555700e+04 3.555700e+04 3.555700e+04
nbr.null     0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00       0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
min          1.013390e-04 1.041508e-04 1.003806e-04 1.013399e-04 1.014944e-04
max          2.729166e+04 2.704107e+04 2.697153e+04 2.664553e+04 2.650478e+04
range        2.729166e+04 2.704107e+04 2.697153e+04 2.664553e+04 2.650478e+04
sum          3.036592e+07 2.995104e+07 2.977157e+07 3.054361e+07 3.055979e+07
median       2.408813e+02 2.540591e+02 2.377578e+02 2.396799e+02 2.382967e+02
mean         8.540068e+02 8.423388e+02 8.372915e+02 8.590042e+02 8.594593e+02
SE.mean      8.750202e+00 8.608451e+00 8.599943e+00 8.810000e+00 8.711472e+00
CI.mean.0.95 1.715066e+01 1.687283e+01 1.685615e+01 1.726787e+01 1.707475e+01
var          2.722458e+06 2.634967e+06 2.629761e+06 2.759796e+06 2.698411e+06      1.649987e+03 1.623258e+03 1.621654e+03 1.661263e+03 1.642684e+03
coef.var     1.932054e+00 1.927085e+00 1.936785e+00 1.933941e+00 1.911300e+00


There is also a basic function called summary, which will give summary stats on all components of a dataframe.

> summary(aa)
                                               ID             BEGIN
 ENSXETG00000004313_SCAFFOLD_317_890727_910726  :   289   Min.   :    1
 ENSXETG00000010135_SCAFFOLD_646_44207_64206    :   289   1st Qu.: 4875
 ENSXETG00000017293_SCAFFOLD_132_1100669_1120668:   289   Median : 9722
 ENSXETG00000020596_SCAFFOLD_474_677230_697229  :   289   Mean   : 9781
 ENSXETG00000021861_SCAFFOLD_101_1701655_1721654:   289   3rd Qu.:14639
 ENSXETG00000023499_SCAFFOLD_486_295358_315357  :   289   Max.   :19950
 (Other)                                        :383504
      END             Blue                Red               RedNeg
 Min.   :   50   Min.   :-5.130000   Min.   :-15.0400   Min.   :-5.64000
 1st Qu.: 4924   1st Qu.:-0.470000   1st Qu.: -1.7600   1st Qu.:-0.39000
 Median : 9771   Median : 0.000000   Median :  0.0400   Median :-0.01000
 Mean   : 9830   Mean   : 0.007944   Mean   : -0.1799   Mean   : 0.04497
 3rd Qu.:14688   3rd Qu.: 0.480000   3rd Qu.:  1.5600   3rd Qu.: 0.44000
 Max.   :19999   Max.   : 5.340000   Max.   : 22.5600   Max.   : 3.76000

13. The commercial Revolution R is free to Academics

Recently a company called Revolution Analytics has started to commercialize R, smoothing off the rough edges, providing a GUI, improving the memory management, speeding up some core routines. If you’re interested in it, you can download it from here.