Analysis of DNA microarray data using the KTH-package and the R environment for statistical computing

 

 

 

Valtteri Wirta

 

 

 

Department of Biotechnology,
Royal Institute of Technology,
Stockholm, Sweden

valtteri@biotech.kth.se

 

 

 

Version 0.7

Updated 040405

 

 

 

 

 

 

 

 

 

 

Questions?

Comments, suggestions and other issues regarding this document: valtteri@biotech.kth.se

Questions on the use of the kth-package or general data analysis questions: kth-package@freelists.org

Acess to the arrays: use the contact form on our homepage: www.biotech.kth.se/molbio/microarray

 


Table of Contents

Text

Table of Contents. 2

Text. 2

Figures and tables in this Guide. 3

Introduction.. 5

Golden Path.. 5

The data set used in the examples. 6

Syntax usage in this document. 6

Installation and additional sources of information.. 8

Requirements. 9

The data.. 9

Hardware. 9

Software. 9

R. 9

Bioconductor. 10

aroma. 10

kth-package. 10

How to show the version of a certain package. 10

Data import and transformation.. 12

Data Import. 12

Extraction of foreground and background signals. 13

Transformation into MA-format. 15

Data pre-processing.. 17

Filtering of the data.. 17

Available tools to investigate the effects of filtering. 17

slideSummary function. 17

Between duplicates correlation. 18

Between slides correlation. 19

Numeric presentation of the filtering results. 19

Available tools to carry out the filtering. 20

Examples of usage of filter tools. 21

filterFlags and filterFlags2. 21

filterSize. 22

filterB2SD.. 22

filterCircularity. 22

filterMtoM... 22

filterRatioComp. 23

filterSaturated. 23

Within-slide normalisation.. 24

Between-slides normalisation.. 28

Identification of differentially expressed genes. 30

The B-test. 30

Estimation of the within-slide duplicate spot correlation. 31

Estimation of the M-value for each probe. 31

Multiple testing and toptables. 32

Additional things we can do with the toptables. 33

Strange tailing phenomenon. 34

How to create a big matrix of M-values starting from M-value matrices in different workspaces. 35

How to create a design object for reference design experiments with multiple comparisons. 35

Case study 1: Reference study where four samples are compared to a common reference. 36

How to visualise the results. 42

The basics. 42

MA-plot. 44

Volcano-plot. 45

Venn Diagram... 45

Clustering.. 48

GeneSpring.. 48

Multiexperiment viewer.. 48

Clustering functions available in a collection of R-packages. 48

How to export results from R.. 49

exportTopTable. 49

export1, export2 and export3. 49

exportMEV.. 49

Suggestion for naming objects in R.. 51

Where to find more information?. 52

How to contribute.. 53

Acknowledgements. 54

 

 

Figures and tables in this Guide

Table – Most recent versions of the different packages. 5

Figure – Data processing path for Golden Path. 6

Table – An example of a targets file. 7

Figure – Example of information displayed for each package. 11

Figure – Properties window.. 12

Figure – Imageplot of the background intensities. 14

Figure – MA-plot 15

Figure – Four MA-plots. 16

Figure – Example of the output from slideSummary. 17

Figure – Correlation of M-values between the duplicated fields on the slide. 18

Figure – Correlation of M-values between the two replicated hybridisations. 19

Figure – Results from the filtering step. 20

Figure – Box plot for slide 1. 25

Figure – Nine box plots. 26

Figure – Nine spatial plots. 27

Figure – Box plots before and after the between-slides normalisation. 29

Figure – Volcano-plot 32

Figure – Strange tailing phenomenon. 34

Figure – Standard plot 42

Figure – Modified plot 43

Figure – Zooming in/out 44

Figure – Averaged MA-plot 45

Figure – Venn Diagrams. 46

Figure – Venn Diagram, example. 46

 

 

 


Introduction

DNA microarrays have rapidly become the platform of choice for analysis of gene expression levels at a genome-wide level. Gene expression in several organisms can today be analysed fairly easily using either in-house produced or commercial arrays.

 

The User’s Guide

When this document was updated these were the latest versions of each package. Before starting an analysis of before asking questions, make sure you have the latest versions of each package

 

Package

Version

R (strictly speaking not a package)

1.8.1

kth-package

0.3.13

aroma

0.69

limma

1.5.9

statmod

1.0.6

Bioconductor (bundle of packages)

1.4

Table – Most recent versions of the different packages

Make sure these are the ones you have installed on your computer

 

Golden Path

The purpose of this document is to present the workflow, called “Golden path”, for DNA microarray analysis. The aim of this approach is to help new people to get started with the DNA microarray data analysis and also to provide tools for this.

 

The steps in the Golden Path are divided into four subgroups, also shown schematically in the figure below. The first group coloured in purple, contains steps related to data import and transformation. The second group, red, deals with the filtering issues. The third group, green, considers data normalisation and the last group, grey, contains all subsequent data mining steps.

 

 

Figure – Data processing path for Golden Path.

All steps of the Golden Path are not included in this figure. The *.gpr denotes the results files generated by the GenePix software.

 

The KTH-package is a collection of tools, which have been created to assist new users to get started with the analysis, without the need of having in-depth knowledge of the R platform for statistical computing. Excellent help pages are available for the R software, but for the majority of the users, the command-driven interface is still a major hurdle. The KTH-package is not a stand-alone package for microarray data analysis, but relies heavily on other packages, such as the excellent packages aroma by Henrik Bengtsson, Lund Technical University, and LIMMA by Gordon Smyth (WEHI, New Zealand). LIMMA is also part of the Bioconductor package bundle and we will be using also additional features and functionalities of this bundle.

 

The majority of the steps are carried out in the R environment for statistical computing. The aim is to use only open-source software and therefore we have chosen to use the very nice MEV-software (Multiexperiment viewer) from TIGR for clustering. More information on clustering is found in the later chapters, but the reader is encouraged to look for more details in the material provided with the MEV software (http://www.tigr.org/software/tm4/menu/mev.pdf and http://www.tigr.org/software/tm4/menu/MeV.ppt).

 

The purpose of this document is to make the data analysis as easy as possible. Please read through this document before asking any questions. The premiere forum for asking the questions is the mailing list for the kth-package at kth-package@freelists.org.

Use the following link to subscribe to this list: http://www.freelists.org/cgi-bin/list?list_id=kth-package

An additional excellent forum for questions is the Bioconductor mailing list, but this is strictly limited to questions regarding the Bioconductor tools.

The data set used in the examples

In this Guide, text, figures and R commands are used to explain what is done to the data. Some of the data (in gpr-format) used in the examples can be obtained by sending e-mail to the corresponding author of this document. Data from several different projects is used to exemplify the different functions and steps.

 

A good start for each new project is to create a targets file containing as much information as possible of the experiment and the way it was carried out. An example of such a targets file is shown below, and also available as a tab-separated text file:

 

http://biobase.biotech.kth.se/~valtteri/R/data/targets.txt

 

Barcode         FileName                                Cy3             Cy5     Scanner          Date                        Blocker   Primer                     Slide

12664015        Hexa1.gpr                              Control       IFN     Agilent           2003 11 0840          60mer      hexamer                  cDNA

12664012        Nona1.gpr                             Control       IFN     Agilent           2003 11 0840          60mer      nonamer                 cDNA

12664013        HexaNona1.gpr                     Control       IFN     Agilent           2003 11 0840          60mer      hexanonamer         cDNA

12664010        HexaNonaNoBlock1.gpr     Control       IFN     Agilent           2003 11 08              None      hexanonamer         cDNA

12554758        Operon802.gpr                      Control       IFN     Agilent           2003 11 02              80mer      20TVN                    cDNA

12559204        Operon801.gpr                      Control       IFN     Agilent           2003 11 02              80mer      20TVN                    cDNA

12559205        Amersham1.gpr                    Contro        lIFN    Agilent           2003 11 0240          60mer      20TVN                    cDNA

12559266        Amersham2.gpr                    ControlI     FN       Agilent           2003 11 0240          60mer      20TVN                    cDNA

           

Table – An example of a targets file

The Cy3 and Cy5 are the only compulsory columns, and must be labelled exactly in the same way. Additional columns may be added. The information in this file can be used in several useful ways during the subsequent data analysis.

 

Syntax usage in this document

In this document the commands entered into the R console are formatted in this way

 

print(“This is an example of colour, font and font size used for the commands which should be entered to the R console”)

 

For clarity and line break reasons all R commands which are a continuation from the previous line have a “+”-sign in front of them shown in the example below. Remove this sign when writing the command into the R console.

 
print(“This is an example of colour, font and font size used 
+ for the commands which should be entered to the R console”)

 

 

 


Installation and additional sources of information

 

The main portal to the Microarray Resource Center at KTH and the KTH-package is

http://biobase.biotech.kth.se/~valtteri/R/

 

Instructions on how to install the R software and the additional packages can be found at

 http://biobase.biotech.kth.se/~valtteri/pages/kthpackageinstall.html

 

Additional information on the R software is available at

http://www.r-project.org

 

Information on the aroma package is available at

http://www.maths.lth.se/help/R/aroma/

 

Information on the Bioconductor package bundle is available at

http://www.bionconductor.org

 

Information on the Multiexperiment viewer software and other software by TIGR is available at

http://www.tigr.org/software/tm4

 

To subscribe to the kth-package mailing list, sign up at

http://www.freelists.org/cgi-bin/list?list_id=kth-package

or

send the word subscribe in the Subject: line of an e-mail to

kth-package-request@freelists.org


Requirements

The data

Gridding is the process in which the exact location of a feature on the array is identified, and intensity values corresponding to this feature are extracted. Different software is available for this task, but we have chosen to work with the GenePix software from Axon Instruments. When using the kth-package and this User’s Guide your data should, therefore, be in GenePix format. Currently no other data formats are supported.

For each slide there should be one gpr-file. As the gpr-files differ between versions of GenePix, it is suggested that all slides are analysed using the same version of GenePix (preferentially GenePix 5.0 or later). The gpr-files must contain the genes in the same order, which means the slides/arrays used must be from the same print batch or at least have the same design/layout. If duplicate spots are used, they should be in two or more identical fields, and the distance (in rows) between duplicate spots must be constant.

Each probe on the chip must be named using a unique identifier and this placed in either the ID or NAME column. Usage of Gene Description is advised against due to problems with formatting in the R console.

 

Hardware

The kth-package is intended to be used on the Windows 32-bit platform. It has, however, been successfully installed on Linux OS after some minor modifications to the installation procedure, but the complete functionality has not yet been verified.

 

The package runs smoothly on a Celeron 2.0 GHz processor and 512 Mb RAM, but extra RAM is always good (and cheap!).

 

The amount of virtual memory on the computer must be increased to approximately 2 Gb. Until today this has been ok for all the analysis I have carried out. In some cases the memory simply is not enough, but there are always alternative ways to do the analysis. If you encounter this problem, you may try to close all other applications on the computer, save your R workspace, quit R, re-start R and finally reload the workspace and kth-package.

 

Current versions of R are able to handle up to 2 Gb of memory, so there is no use to increase the amount of RAM to >2 Gb unless you are running several parallel processes. Newer versions of R are rumoured to be better at the memory allocation.

 

Software

R

Always use the latest version of R. Currently (early April 2004) this corresponds to R 1.8.1.

You should definitely upgrade to R.1.8.1 if you do not already use it!!!

New major versions of R are usually released in April (R 1.9.0 in 2004) and October every year. Remember to update the packages every second week or so.

 

By default each session of R will be allowed to use a maximum of 1024 MB of memory. For larger projects this is not enough, but can be increased by typing

 

memory.limit(size=2000)   # this is the maximum amount of memory accessible for R

 

A successful memory increase gives NULL as output. During an R-session the memory limit cannot be reduced.

 

If a message indicating memory allocation error is obtained during the R session make sure you have increased the memory.limit as shown above. You can also try to free memory reserved by the R software by calling the “garbage collector” through typing

 

gc()

Bioconductor

New major versions of the Bioconductor bundle are released approximately at the same time as new versions of R or soon thereafter. For Bioconductor it is even more important that you frequently update the packages. Especially the excellent LIMMA package (by Gordon Smyth) should be updated every week as improvements are made and new functions are added frequently.

 

aroma

The aroma package is updated every now and then. Check Henrik Bengtsson’s web page for more information.

http://www.maths.lth.se/help/R/aroma/

 

kth-package

The kth-package is updated when new functions are available and I have the time to prepare the new version for distribution. To get information when new versions are available you can subscribe to the mailing list. Send the word subscribe in the Subject: line of an e-mail to

kth-package-request@freelists.org

 

The list is a closed list (only subscribers can send messages) but not moderated.

Please include the version information on your packages (kth-package, limma, aroma, R base) when asking questions.

 

It is important to remember that information on urgent and annoying bugs will not be sent out to individual users, but only to the mailing list! Do subscribe!

 

How to show the version of a certain package

To display information on a certain package in R, type

help(package=”kth”)

 

 

 

 

Figure – Example of information displayed for each package

 

 

 

 


Data import and transformation

Data Import

The first step is to read the data into R. It is assumed that you have installed R and the add-on packages (see Installation).

 

Create a folder for your project (try to keep one separate folder for each of your project) and place your gpr-files there. The TIFF-files are not needed and can be stored elsewhere.

 

Create a copy of the R shortcut onto your desktop (there might already be one there, but create a new project-specific shortcut). Select the newly created shortcut icon and press Alt+Enter to open the Properties window of the shortcut (see figure). Rename the shortcut to something like “R 1.8.1 Your Project” and into the Start In field enter the path to your newly created folder containing your gpr-files.

 

 

Figure – Properties window

 

Start R and load the kth-library

 

library(kth)

 

This will automatically also load the aroma package and some important packages of the Bioconductor bundle. Some information is displayed when the kth-package is loaded. It is good to read through this part, as some important messages might be displayed there.

 

To import the data into R we use functions in the aroma package. First we verify that the working directory contains your gpr-files:

 

getwd()
dir()

 

You should now see all your gpr-files.

Then continue by creating a gpr-object

 

gpr <- MicroarrayData$read(pattern=”.gpr”)

 

This will import all gpr-files in your current folder (which now should be the project-specific folder). Be aware of that this might take several minutes, depending on your computer and the number of slides you are importing. It is not uncommon to see R become non-responding during this period of time.

 

You have now created the gpr-object, which is central to all subsequent analysis steps. This object contains the raw data and should never be modified by the user. To see information on the gpr-object one can type

 

gpr

 

in which case you will see something like this:

 

[1] "GenePixData: Number of slides: 8. Number of fields: 56. Layout: Grids: 12x4 (=48), spots in grids: 25x25 (=625), total number of spots: 30000. Spot name's are specified. Spot id's are specified."

 

which indicates that the object contains data from 8 slides and each slide contains 30 000 rows of data.

Of course the object contains a lot of more information, which will be used later on. There are several ways of extracting the information. To do this one has to have some knowledge of the indexing rules in R. Below are some examples, but with only minor explanations. More information on indexing in R can be found in the help documents listed in. “Where to find more information” part of this document.

 

gpr$fields                                         # shows all fields of the gpr-object

gpr$Name[,1]                                    # shows the Name of each probe on slide 1 (the Names are of course the same for the probes on the other slides)

gpr$Name[1:10,]                             # as above, but only for the first 10 probes

gpr$Dia.[1:20,]                             # shows spot diameter for the first twenty probes on all slides

gpr$Dia.[1:20, c(1,3,5,7)] # as above, but for slides 1, 3, 5 and 7

 

All other fields can be accessed in a similar way. If the name of the field contains more then one word the name has to be included into quotation marks. For example:

 

gpr$”F635 Median”[1:10,]      # is the Cy5 forefround signal for the first ten probes on all slides

 

Extraction of foreground and background signals

In the next step we extract the foreground and background intensities from the gpr-file and enter them into a new object called raw. The user may choose between the median or the mean values, but defaults are the median values (the arguments fg and bg may be omitted if median values are desired). Median values are considered less susceptible to misaligning errors and interfering dust particles and should be used for the majority of the cases.

 

raw <- as.RawData(gpr, fg=”median”, bg=”median”)

 

The next step is to decide whether to subtract the local background or not. It is important to point out that the actual subtraction step is not carried out until at the data transformation step explained in the next part of this chapter.

 

A discussion on the effects of the subtraction can be found on Henrik Bengtsson’s web page:

 

http://www.maths.lth.se/help/R/aroma/#6.%Subtracting%background

 

 

To investigate how even the background is over the slide surface, the imageplot function (from the Limma package) can be used. Limma package is loaded together with the kth-package.

 

Draw the imageplot for green channel and red channel, respectively.

 

imageplot(log(raw$Gb[,1],2),layout,low="white",high=“green")
imageplot(log(raw$Rb[,1],2),layout,low="white",high="red")
 

 

Figure – Imageplot of the background intensities

This imageplot function is not restricted for the background data, but can be used to visualise any other data column from the gpr-files.

 

The version 0.3.14 of the kth-package contains the bgAnalysis() function which can be used to more easily analyse the background of each slide. This functions can be used to create plots for both Cy3 and Cy5 background intensities over the slide surface, their CV distribution as well as boxplots showing the background intensity value distributions for each block. For more details see

 

?bgAnalysis

 

If subtraction is carried out, some features (spots) may turn out to contain negative or zero-values. These are problematic in subsequent analysis steps because division by zero is not possible and because logarithmic transformation of a negative value is not possible. To get rid of these we identify features with negative or non-zero values in one of the two channels, but not in the other, and set their negative or zero value to 1. Of course one has to be aware of the fact that this might give very large ratios.

 

Before we do the adjustment of negative or zero values, we will have a look at the filterLog-option. Later on in the analysis steps it might be of interest to know why a certain feature was removed from the analysis. To be able to do this, there is an option to log the results from the different filtering steps into a filter log-object. To do this we first create the filterLog-object, and then in the subsequent filtering steps enter data into it.

 

filterLog <- createFilterLog(gpr)  # creates the object

 

The filterLog is an array type of object and can be browsed using standard indexing rules for arrays. An array may be considered a book, where each cell is defined by a combination of a row and column number and finally by a page number. The first number refers to the rows, the second to the columns and the third to the “page”. In the filterLog-object row, column and page correspond to probe, filter and slide, respectively.

 

filterLog[1:10, , 1:2]     # shows all filters for the first ten features on slides 1 and 2 before any filtering is carried out

 

filterLog <- filterZero(raw, logFilter=TRUE, filterLog.obj=filterLog)

 

note that when using the filterLog option, the results must be assigned to the filterLog object!

If filterLog is not used, then simply

 

filterZero(raw)

 

filterLog[1:10, , 1:2]     # shows all filters for the first ten features on slides 1 and 2 after the filterZero filter is applied

 

Transformation into MA-format

The next step is to transform the possibly background subtracted R and G (red and green, respectively) raw intensities to MA-format. A description of this step is available at Henrik Bengtsson’s web page at:

 

http://www.maths.lth.se/help/R/aroma/#5.2.1%20More%20on%20the%20(R,G)%20to%20(M,A)%20transformation

 

At this stage the actual background subtraction is carried out if so determined by the user. In this case we subtract the background and create a ma-object containing the transformed data (at this point it might be useful to review the naming suggestion of the objects explained in section Suggestion for naming objects in R).

 

ma.bs <- getSignal(raw, bgSubtract=TRUE)

 

We can now have a look at the actual data

 

plot(ma.bs, slide=1)

 

 

Figure – MA-plot

Shows the data after background subtraction, but before any filtering steps. Low intensity artefacts are clearly visible, as well as the bend in the data cloud.

 

or for the first four slides

 

subplots(4)
for(k in 1:4)
plot(ma.bs, slide=k)

 

Figure – Four MA-plots.

 

 

The data is now ready for the filtration and subsequent analysis.


Data pre-processing

In the following part we remove features with unreliable data and then proceed with the normalisation steps.

Filtering of the data

A set of filtering tools have been created into the kth-package. Currently there are more than 10 tools for filtering using various spot characteristics extracted from the gpr-object. New tools are rather easy to implement if there is a need from the end-users. Send e-mail to the kth-package mailing list if you think we are missing some useful filtering tools.

 

The purpose is not to use all these filter tools, as this will probably lead to an over-filtering of the data. Currently there is no final decision on which filters to apply, but instead the user is encouraged to test them all and identify the best combination of filters for each particular project.

Available tools to investigate the effects of filtering

As guidance for making this decision the user may use the following functions:

 

slideSummary function

This function calculates the number of good and bad features (i.e. those that have passed the filter and those that did not pass). A graphical output is also obtained. Note that this function assumes that ALL positions of the printed surface are occupied by probes. If empty positions are present, the number obtained by this function is an over-estimation of the filtering.

 

slideSummary(ma.bs)             # for more arguments type ?slideSummary

 

 

Figure – Example of the output from slideSummary

In this study a very strict filtering is carried out, shown by the fact that >50 % of the features are removed for some slides. This amount of filtering is probably too much in the majority of the cases!

 

Between duplicates correlation

The duplCor function calculates the Pearson, Spearman or Kendall-Tau correlation between duplicate features on the slide and produces a graphical output. The duplicate features are required to be in two identical fields. The Pearson is calculated by default, but can be changed using the method argument.

 

duplCor(ma.bs, slideVec=c(1,2,3,4))         # displays correlation for first four slides in the ma.bs-object. For more arguments type ?duplCor

 

 

Figure – Correlation of M-values between the duplicated fields on the slide.

 

Between slides correlation

The detCor function calculates the Pearson, Spearman or Kendall-Tau correlation between replicated hybridisations. The Pearson is calculated by default, but can be changed using the method argument.

 

detCor(ma.bs, slideVec=c(1,2,3,4))    #displays correlation between the first and second slide, and also between the third and fourth slide of the ma.bs-object. For more arguments type ?duplCor

 

 

Figure – Correlation of M-values between the two replicated hybridisations

 

 

Numeric presentation of the filtering results

The argument graphics=TRUE adds the number of remaining features for each slide before and after the filtration. For convenience, is also calculates the number of features in each plot, and displays the number in the upper left corner, as shown in the example figure below.

 

 

Figure – Results from the filtering step.

n= indicates the number of features in each plot.

 

Available tools to carry out the filtering

The following filtering tools are currently available

  • filterZero
  • filterFlags
  • filterFlags2
  • filterSize
  • filterB2SD
  • filterCircularity
  • filterMtoM
  • filterRatioComp
  • filterControls
  • filterSNR
  • filterSaturated

 

 

To see detailed explanations on the use of these, see the provided help files by typing ?NameOfFilter, where you replace “NameOFFilter” with the name of the filter function.

 

For the majority of these filters the following general arguments are accepted in addition to the more filter specific ones:

 

Argument

Description

ma.obj

Name of the object on which to carry out the filtering.
Must always be specified

gpr.obj

Name of the gpr-object from which to get the required data.
Defaults to
gpr.obj=gpr

graphics = TRUE or
graphics = FALSE

Controls whether ma-plots are drawn.
Defaults to
graphics = FALSE

logFilter = TRUE or
logFilter =FALSE

Determines whether the results of the filtering are logged

filterLog.obj = filterLog

Of interest only if logFilter=TRUE;
name of the filterLog-object, defaults to filterLog

filt.A = TRUE or
filt.A =FALSE

Controls if also the A-values (intensity values) are set to NA during the filtration process. Defaults to filt.A=FALSE

 

Examples of usage of filter tools

On the following pages the use of each of these filters is described using examples. Please note that this does not imply that all of these filters should be used for a particular data set. Again, it might be good to review the rules for naming objects.

filterFlags and filterFlags2

In the ma.bs object we now have the background-subtracted fold-change values on the log2-scale (M-values) and the total intensity values also on the log2-scale (A-values). We have 30 000 measurements for each slide, however many of these were flagged by GenePix as unreliable. By using the first filter, i.e. filterFlags, we remove these features by setting the M-value to NA (not available). Note that by default we will keep their A-values unchanged. Before the filtration, however, we create a copy of the ma.bs-object, which makes it possible for us to go back if necessary.

 

ma.bs.ff <- clone(ma.bs)
filterLog <- filterFlags(ma.obj=ma.bs.ff, gpr.obj=gpr, logFilter=TRUE, 
+ filterLog.obj=filterLog, graphics=TRUE)

 

As this is the first example, all the arguments are defined using the formal way of ARGUMENT = VALUE. In case no ARGUMENT= is specified, R assumes that the VALUES entered are in the same order as in the formal usage definition of the function. This definition can be found under the USAGE heading in each help-file. The command above could have been given in a shorter way as

 

filterLog <- filterFlags(ma.bs.ff, 0, gpr, TRUE, TRUE, filterLog, TRUE)

 

Please note that there are other ARGUMENTS that were not used above.

 

When the logFilter-argument is set to TRUE, the results of the filtering must be assigned to a new object, and that object should always be the same as the filterLog.obj, i.e. filterLog in this case.

 

The filterFlags2 function closely resembles the filterFlags function, but the Flag value usage is different. In filterFlags every feature with a flag value of less than the threshold value (flag.level=0 by default) are removed, while filterFlags2 removes features with a certain Flag value (flag.rem-argument).

 

filterSize

Features can also be accepted and rejected on the basis of their physical spot size. filterSize filter uses the feature diameter defined in gpr$Dia. to remove features outside the accepted boundaries. By default accepted feature sizes are everything between and including 61 and 159 µm, but these should be verified for each batch of arrays. The sizes given by GenePix are with 10-µm intervals, i.e. 60, 70, 80, …, n (assuming one is using the 10-µm scanner resolution).

 

In the example below where we remove features outside the acceptance range we again choose to enter the results into the filterLog and to keep the A-values.

 

ma.bs.ff.fsiz <- clone(ma.bs.ff)
filterLog <- filterSize(ma.bs.ff.fsiz, filt.level=c(71,139), 
+ logFilter=TRUE, graphics=TRUE)

 

filterB2SD

Features with too low signal intensity compared with the local background may also be filtered away. The filterB2SD compares the foreground and background signal intensity values and rejects features for which less than a certain percentage of foreground pixels for both channels (default) are above the background + 2SD. A modification is available that uses the coefficient of variation of the background to determine if the mean (background)+2SD(background) term is driven by an extremely large SD component caused by a dust particle in the background region. [The user may choose to use background + 1 SD as cut-off in stead of the default background + 2 SD].

 

For the majority of the projects a 70% threshold is commonly used, but again the user should verify this using their own data set.

 

ma.bs.ff.fsiz.ab70 <- clone(ma.bs.ff.fsiz)
filterLog <- filterB2SD(ma.bs.ff.fsiz.ab70, filt.level=70, 
+ logFilter=TRUE, graphics=TRUE)

 

filterCircularity

In GenePix 5.0 there are two alternative methods for identification of the features. The first and classical approach uses an adaptive circle method that will adjust the diameter of the circle to match the spot. The second approach uses a growing-seed approach, where a central pixel is identified and then the grid is allowed to grow to match the size of the spot. Seed growing is achieved by comparing the intensity of the pixel to the neighbouring pixels. For adaptive circle method (default) the circularity is always by definition 100% and this filter cannot be applied. For growing-seed gridding approach the circularity is calculated and may be used as a filtering tool. Features witch are less circular than a certain user-defined threshold value are removed.

 

As the data in this example is obtained through the adaptive circle gridding, no filtering can be carried out using this tool. For an example of the usage, please see the help documentation (?filterCircularity).

 

filterMtoM

For a good-quality feature one would estimate that the median and the mean values of the foreground signal for a certain channel are close to each other. Deviation of these two measurements from each other by more than a certain percentage may be used as a filtering tool. A threshold value of 15% average deviation has been found out to be appropriate for the majority of the projects.

 

ma.bs.ff.fsiz.ab70.fmm <- clone(ma.bs.ff.fsiz.ab70)
filterLog <- filterMtoM(ma.bs.ff.fsiz.ab70.fmm, filt.level=15, 
+ logFilter=TRUE, graphics=TRUE)

 

Note! In my opinion this filter may be too strict at the 15% level and caution should therefore be used when applying this filter!

 

filterRatioComp

GenePix estimates the ratio between the red and green channel in five different ways

  • ratio of medians (for each channel calculate the median of the individual pixels, then calculate the ratio of these)
  • ratio of means (for each channel calculate the mean of the individual pixels, then calculate the ratio)
  • median of ratios (calculate ratio on a pixel-by-pixel basis, then calculate the median of the ratios)
  • mean of ratios (calculate ratio on a pixel-by-pixel basis, then calculate the mean of the ratios)
  • regression ratio (carry out a regression of red channel intensity vs. green channel intensity and take the slope of the regression line)

 

For a good-quality feature the ratio estimate should be similar irrespective of the method used to calculate the ratio. If two of the estimates deviate too much, the feature may be removed. Currently there are methods written for comparison of 1 to 3 and 1 to 5.

The filt.comp-argument defines which ratio estimates are used for the comparison. The value of 1 compares "Ratio of Medians" to "Regression Ratio", while the value of 2 compares "Median of Ratios" to "Ratio of Medians".

 

ma.bs.ff.fsiz.ab70.fmm.frc20 <- clone(ma.bs.ff.fsiz.ab70.fmm)
filterLog <- filterRatioComp(ma.bs.ff.fsiz.ab70.fmm.frc20, 
+ filt.level=20, filt.comp=1, logFilter=TRUE, graphics=TRUE)

 

Be aware that this filter may kick out good-quality features when filt.comp=1, but may not be strict enough when filt.comp=2.

 

filterSaturated

This filtering tool is based on the intensity level of the feature’s foreground. The TIFF-images generated by the scanners are generally 16-bit images, which gives a maximum intensity level of 2^16= ~65500 (scanner specific, our Agilent scanner’s level is approximately 65180, but dependent on the initial slide background adjustment automatically carried out by the scanner software). Signals above this do not accurately reflect the amount of bound target to the immobilised probe and should, therefore, be removed. In cases where only one of the channels is saturated, the obtained ratio might still indicate a differential expression, although the absolute value of the ratio is incorrect. The user has the option to determine whether to keep or remove these features.

The “saturation level” is also changeable by the user, which makes it possible to use this tool to remove data above a certain arbitrary threshold value.

 

ma.bs.ff.fsiz.ab70.fmm.frc20.fsat <- clone(ma.bs.ff.fsiz.ab70.fmm.frc20)
filterLog <- filterSaturated(ma.bs.ff.fsiz.ab70.fmm.frc20.fsat, 
+ logFilter=TRUE, graphics=TRUE)

 

When the filtration process is done, it might be good to rename the ma-object to something more appropriate, such as ma.filt

 

ma.filt <- clone(ma.bs.ff.fsiz.ab70.fmm.frc20.fsat)

 

which is then used in the subsequent normalisation step


Within-slide normalisation

Before the identification of differentially expressed genes can be started, the intensities of the two channels have to be balanced against each other, otherwise the calculated ratios cannot be used. The classical approach has been to scale the sum of the two channels to be equal, but it has been noticed that this does not remove some of the dye-dependent artefacts present in the data and easily visualised using, for example, an MA-plot.

Alternative methods for normalisation are the lowess and the print-tip lowess methods. More on these can be read in

 

Yang YH et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002 Feb 15;30(4):e15

 

You should read the article!

 

These normalisation methods are implemented in the aroma package

 

Normalisation method

Argument

None

“n”

Median (classical)

“m”

Lowess

“l”

Print-tip lowess

“p”

Scaled print-tip lowess

“s”

 

To determine which method is the most appropriate there are two main plots that can be investigated

Box plot of individual print-tip groups for each slide

 

boxplot(ma.filt, groupBy=”printtip”, slide=1)

 

for more help

?MicroarrayData.boxplot

 

Figure – Box plot for slide 1

 

At this point it might be useful to read some more about the box-and-whisker plot (the information can be found in any general statistics text book or http://mathworld.wolfram.com/Box-and-WhiskerPlot.html).

 

Each printtip (block, grid) is symbolised by one boxplot. If all box plots scatter along the same y-value a standard lowess normalisation is appropriate. If they, however, have clearly different median values and spreads, then a print-tip lowess normalisation is the most appropriate way to go.

 

To visualise all nine slides in the ma.filt-object

 

windows()
subplots(9)
for(k in 1:9)
boxplot(ma.filt, groupBy=”printtip”, slide=k)

 

 

 

Figure – Nine box plots

 

The second plot that might be useful is plotSpatial

 

windows()
subplots(8)
for(k in 1:8)
plotSpatial(ma.filt, slide=k)

 

 

Figure – Nine spatial plots

 

These plots show the M-values at the different locations of the slides and indicate their values using colours. An even colour distribution suggests that the lowess normalisation is appropriate, while an uneven distribution suggests print-tip normalisation.

 

Normalisation is carried out using

 

ma.filt.nm <- clone(ma.filt)
ma.filt.nl <- clone(ma.filt)
ma.filt.np <- clone(ma.filt)

normalizeWithinSlide(ma.filt.nm, method=”m”)    # median normalisation

normalizeWithinSlide(ma.filt.nl, method=”l”)    # lowess normalisation

normalizeWithinSlide(ma.filt.np, method=”p”)     # print-tip lowess normalisation

 

If unsure, one could test both lowess and print-tip lowess normalisation and check the effects on within-slide duplicate and between-slides correlation using duplCor and detCor functions (see above).

 

Between-slides normalisation

For each comparison in an experiment one can expect that the range of M-value distribution is approximately the same for all slides. When the range is clearly different, one may consider this an effect of labelling and/or hybridisation, i.e. a technical artefact. The effects of this may be removed by rescaling the M-value distributions using the robust deviation measure Median Absolute Deviation (MAD). This algorithm is available in the aroma function normalizeAcrossSlides.

 

Additional help is available through the help page for normalizeAcrossSlides:

?MAData.normalizeAcrossSlides

 

and a detailed description of the MAD scaling is available through the help pages

?mad

 

Boxplots of M-values for each slide can be investigated to find out whether there is a need for between-slides normalisation or not:

 

boxplot(ma.filt.np)

 

Again, clone the object before normalisation

 

ma.filt.np.nb <- clone(ma.filt.np)
normalizeAcrossSlides(ma.filt.np.nb)
boxplot(ma.filt.np.nb)

 

Figure – Box plots before and after the between-slides normalisation.

In the first plot the distributions of the M-values are clearly different for the nine slides. After the between-slides normalisations the distributions are much more similar. In this particular case it is of no interest to carry out a between-slides normalisation as we are comparing the effects of different blockers and priming methods.


Identification of differentially expressed genes

We are finally entering the more interesting stuff, which is to find out which genes are (statistically) significantly differentially expressed (DE) between two or more samples.

 

In the beginning of the microarray era arbitrarily fold-change cut-offs were used to identify DE genes. For example, a two-fold up- or down-regulation might have been considered DE. Today, however, to consider a gene to be DE, it has to be identified using a statistical test, such as the t-test, SAM or B-test and the results corrected for multiple testing. This takes into account the higher variability of the gene expression measurements of lowly abundant genes and, furthermore, enables the identification of genes, such as transcription factors, for which a smaller fold-change value may have profound biological effects.

 

In this User’s guide we will be considering the B-test.

 

Reference to SAM paper:

Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001 Apr 24;98(9):5116-21

 

The B-test

The B-test is an empirical Bayes approach for identification of DE genes. The details of this have been published in several works, and the user is encouraged to read these to learn about the statistical theory underlying the use of this method:

 

  • Lönnstedt, I. (2001) Replicated Microarray Data. Statistica Sinica 12(2002), 31-46

http://www.stat.sinica.edu.tw/statistica/j12n1/j12n12/j12n12.htm

  • Smyth, G. K., Michaud, J., and Scott, H. (2003). The use of within-array replicate spots for assessing differential expression in microarray experiments. In preparation

http://www.statsci.org/smyth/pubs/dupcor.pdf

  • Smyth, G. K. (2003). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. In preparation

http://www.statsci.org/smyth/pubs/ebayes.pdf

 

The B-test ranks the probes in order of evidence for differential expression. The higher the B-value, the more probable it is that a gene is differentially expressed. The actual cut-off for differential expression is difficult to determine and is dependent among other things on the number of slides and also the a priori assumption of the proportion of differentially expressed genes (default assumption is 0.01). The main use of the B-test is therefore to rank the genes and then continue with as many genes as is feasible to work with, starting from the top of the ranking list.

 

At this stage we limit the discussion to simple two-sample comparisons. Later on in the case-study section we introduce more advanced comparisons. The basic idea is the same for comparisons consisting of more than two samples, but the design object and the interpretation differs somewhat.

 

All the steps in the B-test use functions from the LIMMA package. This package is automatically loaded together with the kth-package. There is an excellent User’s Guide to LIMMA, which is available on your computer if you have installed LIMMA. In R, type

 

help.start()

 

Choose packages -> choose LIMMA -> browse directories and there you have it.

You should definitely read chapters 7-9.

 

The aim of the examples is to provide an example of how to identify DE genes, but not to explain the underlying statistics. For more information see the cited refrences above.

 

We start with a general discussion and a few examples on how to carry out the identification of differentially expressed genes using the B-test. Later on we look at some case-studies in more detail (currently only one case-study is included).

 

The first step is to create a matrix of M-values, in which each row and column represents features and slides, respectively. If all the slides in the ma-object are not part of the comparison of interest, these can be excluded using standard indexing rules (see example below).

 

Ms <- ma.filt.np$M[,]             # includes all slides

 

Example with only five slides included:

 

Ms <- ma.filt.np$M[,c(1,2,3,6,7)]  # we will continue to work with this one

 

NB!

The matrix may now contain also features for which there is no or only one valid measurement. Please see the discussion below regarding the strange tailing phenomenon for more information and how to deal with these.

 

Estimation of the within-slide duplicate spot correlation

Next estimate the within-slide correlation. This can be skipped if there are no duplicate spots on the arrays. The argument design is a vector of same length as the number of slides in the study, and is used to indicate which slides should be dye-swapped. In this case every second slide should be dye-swapped, and therefore we give them the –1 value. (In cases where more comparisons are carried out (for example samples A, B and C against a reference) the “design” object is a matrix containing the three values –1, 0 and 1. An example of this is available below in the How to create a design object for reference design experiments with multiple comparisons section). The “ndups” argument determines how many duplicates there are of each feature and the spacing argument determines their distance in gpr-file. For two consecutive spots the spacing value is 1.

 

This computation may take a long time depending on the number of slides and your computer hardware.

 

cor <- duplicateCorrelation(Ms, design=c(1,-1,1,-1,1,-1,1,-1), ndups=2, 
+ spacing=15000) 

 

Estimation of the M-value for each probe

Next we estimate the M-value for each probe using generalised least squares taking into account the better correlation between the within-slide duplicates. (This should only be used if there are within-slide duplicates, otherwise use the simpler lm.series function below.)

 

fit <- gls.series(Ms, design=c(1,-1,1,-1,1,-1,1,-1), ndups=2, 
+ spacing=15000, correlation=cor$cor)

 

If no within-slide duplicates are present, then use

 

fit <- lm.series(Ms, design=c(1,-1,1,-1,1,-1,1,-1))

 

We continue by calculating the log odds probability of differential expression using the ebayes function

 

eb <- ebayes(fit)

 

The estimated M-values are now in the fit$coef vector, while the B-values are in the eb$lods vector.

The M-values for the first 50 probes are

 

fit$coef[1:50]

 

and the B-values for the corresponding genes are

 

eb$lods[1:50]

 

We can also have a look at the volcano-plot. i.e. B vs. M

 

plot(x=fit$coef, y=eb$lods)

 

 

Figure – Volcano-plot

In this figure the top 500 genes are coloured in red.

Multiple testing and toptables

With many probes, as for example 30000, the problem of multiple testing becomes an issue we have to deal with, otherwise the probability of false positive (genes that by chance are statistically significantly differentially expressed) becomes too high. Therefore we use the false discovery rate to control for the multiple testing. Alternative ways for controlling this can be found by typing ?p.adjust.

 

Using the B-values in the eb$lods vector we rank the probes, calculate p-values corrected for multiple testing and display the results in a table format. This is done using the toptable function of the LIMMA package.

The “genelist” argument determines the probe names, which in this case are extracted from the gpr-object. One could choose gpr$ID[,1], or both by creating a matrix cbind(gpr$ID[,1], gpr$Name[,1]), instead.

 

We start by identifying the top 100 probes and enter the results into a tt.100 table

 

tt.100 <- toptable(number=100, genelist=gpr$Name[,1], fit=fit, eb=eb, 
+ adjust="fdr")

 

To visualise the results type

 

tt.100

 

It is also good to make a complete toptable.

 

tt.all <- toptable(number=30000, genelist=gpr$Name[,1], fit=fit, eb=eb, 
+ adjust="fdr")

 

It might also be useful to export the big toptable into a tab-separated text-file, which is easily read into Excel.

 

exportTopTable(tt.all, file=”tt.all.txt”)
 

Additional things we can do with the toptables

  • Extract the index values of a certain set of probes (the index value of a gene is equivalent to the row number the gene occupies in the input data)

The index values of the probes in the toptable are stored as row names. These index values are useful to have, as they may be used for example to colour corresponding spots in a MA-plot.

To extract the index values for the top 50 genes in the tt.all toptable object the following command can be used

 

idx <- as.numeric(rownames(tt.all[1:50]))

 

  • Extract the data from a toptable using a vector of index values

 

The index values of some interesting genes (perhaps your 140 favourite genes) are assumed to be stored in the idx object. To find these probes in the toptable and create a new toptable-like object of these we can use the extractFromTopTable function:

 

new.tt <- extractFromTopTable(toptable=tt.all, idxVec=idx)

 

This new “toptable” can be treated as a regular toptable, for example exported using the exportTopTable function.

 

  • Create a HTML-document of the toptable and add links to either UniGene or LocusLink.

To be able to use this function your array must be within the list of arrays for which an annotImage-object has been created. Currently the only arrays with this annotation information are the arrays of the Mouse Stem Cell project. For the Mouse Stem Cell project this annotImage file can be found /Arkivet/Molbio/Microarray technologies/R package for data analysis/dataImages of annotations/.

The function requires an object with IDs in one of the columns (default is 1), which correspond to IDs in the annotImage-object. The user has also the option to choose the columns which to export. The produced file may then be opened using any web-browser.

 

load(“annotImage”) # in some cases a saveLoadReference object is created. If this happens, the saveLoadReference object should be assigned to annotImage
makeHtmlLL(object=tt.100, colVecTT=1:5, colVecAnnot=3:10, fileName=”Interferon.html”, dataImage.obj=annotImage)

 

For more information on how to create a annotImage file for your project, read the following description

 

http://biobase.biotech.kth.se/~valtteri/R/AnnotImageReadme.txt

 

Strange tailing phenomenon

In some cases a strange tailing phenomenon in the volcano-plot is be observed as shown in red in the figure below. This is caused by probes with only one valid (non-NA) measurement.

 

Figure – Strange tailing phenomenon

 

For these probes with only one valid measurement there are no degrees of freedom left to estimate the variance, and the B-value becomes proportional to the M-value.

The results obtained for probes with only a few valid measurements are probably not trustworthy and therefore we suggest that these are removed from the data set. This can be done on the matrix of M-values generated above.

 

Ms <- filterOnNAs(Ms, filt.level=2, out.obj=”filtered”)

 

Here we choose to remove all features with two or fewer valid measurements. The results are returned as a filtered matrix of M-values and assigned to the same object as the input object thereby replacing it. For alternative uses of this function please see the help documentation.

 

This Ms matrix is used in the same way as explained above.

 

The filtering can also be carried out directly on the ma-objects, i.e. before the Ms-matrix is created. In this case the filterNbrOK function should be used. The argument filt.level determines how many non-valid measurements are accepted.

 

ma.filt.np.Nbrok2 <- clone(ma.filt.np)
ma.filt.np.Nbrok2 <- filterNbrOK(ma.filt.np.Nbrok2, filt.level=2)

 

This new ma-object may then be used as above in the identification of DE genes.

 

How to create a big matrix of M-values starting from M-value matrices in different workspaces

In cases where there are large numbers of slides all the filtration and analysis steps cannot be done in the same workspace due to memory limits. In these cases the different comparisons are carried out separately, but need to be imported into one final workspace where the identification of differentially expressed genes is carried out.

 

In this example we consider three separate comparisons, each carried out in a separate workspace. The aim is to import M-value matrices from each of these workspaces into a common fourth workspace, where the identification of differentially expressed genes is carried out.

 

In each workspace, name the M-value matrix to some unique

 

Ms.compA <- Ms                                                             # for A-comparison

save(Ms.compA, file=”Ms.compA.RData”) # saves an object into the current working directory

 

Transfer the .RData file to the working directory of the common workspace, then

 

load(“Ms.compA.RData”)                             # imports the Ms.compA object, and assigns it into an object with the same name as in the original workspace

 

Repeat the three steps above for each of the comparison. Remember to replace compA with for example compB and compC.

 

Use the cbind function to make a new big matrix of M-values

 

Ms.all <- cbind(Ms.compA, Ms.compB, Ms.compC)

 

This will concatenate all the original M-value matrices into one big. The order of the columns is determined by the input order of matrices to concatenate. The column names will be retained.

 

 

How to create a design object for reference design experiments with multiple comparisons

For more advanced comparisons (eg three samples compared to a common reference), the design argument used in the B-test functions above, is a matrix of -1, 0 and 1. This matrix can be created either manually or using the modelMatrix function in the Limma package. For manual creation consult your local statistician.

 

If using the modelMatrix function, first import the tab-separated targets file described earlier (this file must contain columns labelled Cy3 and Cy5, but may include additional columns as the user wishes)

 

targets <- readTargets(“targets.txt”)

 

Then create the modelMatrix, but remember to define the reference sample

 

design <- modelMatrix(targets, ref=”ref”)

 

 

Case study 1: Reference study where four samples are compared to a common reference

 

Experimental design: In this example we consider a case where we have two groups of animals
(genotype 1 and genotype 2), and each group is divided into treatment A and treatment B animals as shown in the figure below. Each experimental group (i.e. combination of genotype and treatment) contains 5 individual animals. The gene expression levels for each animal are compared to the common reference using a dye-swap design. This design requires 10 hybridisations per sample to reference comparison, and with four samples adds up to 40 slides.

A design like this contains replicates at different levels, starting from the five individual animals constituting the biological replicates, followed by technical replicates (dye-swap hybridisations; two per animal in this case). Some arrays, depending on the design, contain probes in duplicate, which renders an additional level of replicates. Further levels of replicates would be repeated RNA extractions or RNA amplifications but these are not included in this experiment.

It is important to recognise the fact that the different levels of replicates cannot be considered equal in the statistical tests used. This is clearly evident when the correlation coefficients between different replicates are analysed. For example the within-slide duplicates always correlate really well compared to for example animal to animal correlation, and their contribution should therefore somehow be downweighted in the final analysis.

 

 

In this example we consider the dye-swap replicates be equal to biological replicates while the duplicate spots are considered using the duplicateCorrelation function. We start by reading in the targets file

 

readTargets("targets.txt")

 

               Barcode     Name         Cy3      Cy5

1             1111111    Slide 1        A1       ref

2             1111112    Slide 2        ref        A1

3             1111113    Slide 3        A1       ref

4             1111114    Slide 4        ref        A1

5             1111115    Slide 5        A1       ref

6             1111116    Slide 6        ref        A1

7             1111117    Slide 7        A1       ref

8             1111118    Slide 8        ref        A1

9             1111119    Slide 9        A1       ref

10           1111120    Slide 10      ref        A1

11           1111121    Slide 11      A2       ref

12           1111122    Slide 12      ref        A2

13           1111123    Slide 13      A2       ref

14           1111124    Slide 14      ref        A2

15           1111125    Slide 15      A2       ref

16           1111126    Slide 16      ref        A2

17           1111127    Slide 17      A2       ref

18           1111128    Slide 18      ref        A2

19           1111129    Slide 19      A2       ref

20           1111130    Slide 20      ref        A2

21           1111131    Slide 21      B1        ref

22           1111132    Slide 22      ref        B1

23           1111133    Slide 23      B1        ref

24           1111134    Slide 24      ref        B1

25           1111135    Slide 25      B1        ref

26           1111136    Slide 26      ref        B1

27           1111137    Slide 27      B1        ref

28           1111138    Slide 28      ref        B1

29           1111139    Slide 29      B1        ref

30           1111140    Slide 30      ref        B1

31           1111141    Slide 31      B2        ref

32           1111142    Slide 32      ref        B2

33           1111143    Slide 33      B2        ref

34           1111144    Slide 34      ref        B2

35           1111145    Slide 35      B2        ref

36           1111146    Slide 36      ref        B2

37           1111147    Slide 37      B2        ref

38           1111148    Slide 38      ref        B2

39           1111149    Slide 39      B2        ref

40           1111150    Slide 40      ref        B2

 

We continue by creating a design object, which is a matrix where the columns indicate what comparison each slide belongs to. A negative value of -1 indicates that the slide will be swapped in the analysis process so that the signal for the sample is in the red Cy5 channel, while the reference is in the green Cy3 channel. The name of the reference sample must be defined in the function call.

 

 

design <- modelMatrix(targets,ref="ref")

design

 

             A1      A2       B1        B2

1          -1         0          0          0

2          1         0          0          0

3          -1         0          0          0

4          1         0          0          0

5          -1         0          0          0

6          1         0          0          0

7          -1         0          0          0

8          1         0          0          0

9          -1         0          0          0

10        1          0          0          0

11        0          -1         0          0

12        0          1          0          0

13        0          -1         0          0

14        0          1          0          0

15        0          -1         0          0

16        0          1          0          0

17        0          -1         0          0

18        0          1          0          0

19        0          -1         0          0

20        0          1          0          0

21        0          0          -1         0

22        0          0          1          0

23        0          0          -1         0

24        0          0          1          0

25        0          0          -1         0

26        0          0          1          0

27        0          0          -1         0

28        0          0          1          0

29        0          0          -1         0

30        0          0          1          0

31        0          0          0          -1

32        0          0          0          1

33        0          0          0          -1

34        0          0          0          1

35        0          0          0          -1

36        0          0          0          1

37        0          0          0          -1

38        0          0          0          1

39        0          0          0          -1

40        0          0          0          1

 

The modelMatrix function has identifed four samples in the targets file, i.e. A1, A2, B1 and B2. Each sample is represented by a column in the design object. The slides are represented by rows, and in this case the first slide for example belongs to Reference vs A1 comparison and will be dye-swapped (indicated by the minus sign).

 

Next we calculate the within-slide duplicate spot correlation using the duplicateCorrelation function.

 

cor <- duplicateCorrelation(Ms, design=design, ndups=2, spacing=15000)

 

where Ms is the matrix of M-values, design is the object created above, ndups indicates the number of duplicate spots on each array and spacing indicates the distance between duplicate spots in the gpr-file.

 

The cor object contains a list of all individual gene-wise correlations (cor$cor.genes) and a trimmed average on the tanh-transformed scale. The trimmed mean can be extracted using

 

cor$cor 

 

and the gene-wise correlations visualised using for example a boxplot

 

boxplot(cor$cor.genes)

 

The cor$cor value is used in the next step where we calculate the estimates of the M-values for each gene using generalised least squars (gls) and the gls.series function. The arguments are the same as for duplicateCorrelation, but include cor$cor for correlation

 

fit<-gls.series(Ms,design=design,ndups=2,spacing=15000,correl=cor$cor)

 

The fit object contains five fields of data (directly extracted from the LIMMA gls.series help file):

 

coefficients: numeric matrix containing the estimated coefficients for each linear model. Same number of rows as 'M', same number of columns as 'design'.

 

stdev.unscaled: numeric matrix conformal with 'coef' containing the unscaled standard deviations for the coefficient estimators. The standard errors are given by 'stdev.unscaled * sigma'.

 

sigma: numeric vector containing the residual standard deviation for each gene.

 

df.residual: numeric vector giving the degrees of freedom corresponding to 'sigma'.

 

correlation: inter-duplicate correlation.

 

The M-value estimates can be extracted in the following way:

 

fit$coefficients[1:10, ]

 

which shows the M-values for the first ten genes in all comparisons.

 

Often it is not of any interest (or only of little interest) to know the fold-change difference between the reference sample and the sample of interest. Instead the main interest is to know the difference between the different samples, in this case for example between A1 and A2, A1 and B1. To do this we create a contrast matrix in which we define which comparisons are of interest

 

design.cont <- makeContrasts(“A1-A2”, “B1-B2”, “A1-B1”, “A2-B2”, levels=design)
design.cont

 

             A1-A2   B1-B2        A1-B1       A2-B2

A1       1            0               1               0

A2       -1            0               0               1

B1        0            1               -1               0

B2        0            -1               0               -1

                                                                                                                                                         

Note that in this case we are not interested in all possible combinations and exluce “A1-B2” and “A2-B1”.

This is of course dependent on your biological question of interest! Next we use this contrast matrix to calculate the new M-values, as well as degrees of freedom, sigma etc, corresponding to our parametrisation

 

fit2 <- contrasts.fit(fit, design.cont)
eb <- ebayes(fit2)

 

The ebayes function calculates the log-odds ratio of differential expression, a moderated t-statitistic and a p-value adjusted for multiple testing (see section describing multiple testing). The following values are calculated and stored in the eb-object (extracted from the LIMMA help file for ebayes):

 

t: numeric vector or matrix of penalized t-statistics

 

p.value: numeric vector of p-values corresponding to the t-statistics

 

s2.prior: estimated prior value for 'sigma^2'

 

df.prior: degrees of freedom associated with 's2.prior'

 

s2.post: vector giving the posterior values for 'sigma^2'

 

lods: numeric vector or matrix giving the log-odds of differential expression

 

var.prior: estimated prior value for the variance of the log2-fold-change for differentially expressed gene

 

The calculated B-values can be extracted using

 

eb$lods[1:5,]

 

which shows the B-values for the first five probes in all comparisons.

 

Finally, use the toptable function to summarise and pick-up the top candidate genes.

 

toptable(fit=fit2, coef=1, number=10, genelist=gpr$Name[,1],  eb=eb, adjust.method="holm",sort.by="B")

 

The number argument can be increasted to choose more genes. The coef= argument indicates which comparison is of interest (the order from makeContrasts function).

 

A positive M-value indicates that a particular gene is more highly expressed in the first sample your parametrisation, which  would be A1 for “A1-A2” comparision.

 

The results may be visualised for example using by using the Volcano-plot. For the “A1-B1” comparison this would be done using

 

plot(x=fit2$coef[,3], y=eb$lods[,3])

 

The number in the brackets indicates which parametrisation is of interest (see makeContrasts function above).

 


How to visualise the results

There are several ways for visualisation of the results. Common figure/plot types are described in this section and showed how they are created using R, but we start by looking at how to change the small, but important details of a figure/plot in R.

 

The basics

We start by creating two vectors, x and y, both containing 100 random numbers. We then plot these against each other to get the basic plot shown in Figure Standard plot

 

x <- rnorm(100,4)
y <- rnorm(100)
plot(x, y)

 

Figure – Standard plot

The standard plot of y vs. x using default values. No fancy things added to this one.

 

We continue by looking at the different components of a figure and show that they can be changed to obtain a much nicer plot. Using the four steps below the same data is visualised in a completely different way. The first step plots the data and adds colours, the second adds the titles, the third adds the line and the fourth is included to show that text can be added into the plot if so required.

 

plot(x,y, ann=FALSE, col=3, pch=20, col.axis=4, col.lab=4, cex=1.5, 
+ cex.axis=1.5, bty="n")

 

title(main="Title for the plot", sub="Additional information may be 
+ added here", xlab="Label for x-axis", ylab="Label for y-axis")

 

abline(h=0, lwd=2, lty=6, col=2)

 

text(x=5, y=2, "It is possible to add text here", col=2)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure – Modified plot

A nicer version of the same data, plotted using the four commands in the text.


It is also possible to zoom into a certain part of the plot by defining the xlim and ylim arguments in the plot command, as for example in

 

plot(x,y, ann=FALSE, col=3, pch=20, col.axis=4, col.lab=4, cex=1.5, ylim=c(-1,1), xlim=c(3,6))

 

where the x-axis is limited to 3 to 6 and the y-axis to –1 to 1.

 

Figure – Zooming in/out

Only a part of the data is shown.

 

Below is a table of the more useful arguments and a short explanation of what they control. The explanations are modified from the help file associated with the par element of R. The complete table is available through

 

?par

 

adj

The value of adj determines the way in which text strings are justified. A value of 0 produces left-justified text, 0.5 centered text and 1 right-justified text.

ann

If set to FALSE, do not annotate the plot with axis and overall titles. The default is to do annotation.

bty

A character string, which determines the type of box drawn around the plot. If bty is one of "o", "l", "7", "c", "u", or "]" the resulting box resembles the corresponding upper case letter. A value of "n" suppresses the box.

cex

A numerical value giving the amount by which plotting text and symbols should be scaled relative to the default.

cex.axis

The magnification to be used for axis annotation relative to the current.

cex.lab

The magnification to be used for x and y labels relative to the current.

cex.main

The magnification to be used for main titles relative to the current.

cex.sub

The magnification to be used for sub-titles relative to the current.

col

A specification for the default plotting colour. A description of how colours are specified is given below.

col.axis

The colour to be used for axis annotation.

col.lab

The colour to be used for x and y labels.

col.main

The colour to be used for plot main titles.

col.sub

The colour to be used for plot sub-titles.

font

An integer, which specifies which font to use for text. If possible, device drivers arrange so that 1 corresponds to plain text, 2 to bold face, 3 to italic and 4 to bold italic.

font.axis

The font to be used for axis annotation.

font.lab

The font to be used for x and y labels.

font.main

The font to be used for plot main titles.

font.sub

The font to be used for plot sub-titles.

las

numeric in {0,1,2,3}; the style of axis labels.

 

0: always parallel to the axis [default],

 

1: always horizontal,

 

2: always perpendicular to the axis,

 

3: always vertical.

lty

The line type. Line types are specified as an integer (0=blank, 1=solid, 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash).

lwd

The line width, a positive number, defaulting to 1.

pch

Either an integer specifying a symbol or a single character to be used as the default in plotting points.

 

More information on the plotting capabilities of R can be found in the excellent documentation by Ross Ihaka, available at

 

http://www.stat.auckland.ac.nz/~ihaka/120/Notes/ch03.pdf

 

MA-plot

We have already discussed the MA-plots used to visualise the data for one single array. This is probably the most widely used plot for microarray data.

There are two ways of generating the MA-plot, and both plots can be modified in the same way using the arguments discussed above.

 

The easiest way is to simply plot an ma-object and by using the slide= argument

 

plot(ma.filt.np, slide=8)                      # plots the MA-plot for slide 8 in the ma-object

 

The alternative way is to define x and y separately. This is needed for example if an average MA-plot is needed. A plot of averaged normalised M-values can be plotted using the following steps. Note that duplicates are not present.

 

M <- ma.filt.np$M[,]    # extracts all M-values and places them in M
A <- ma.filt.np$A[,]    # extracts all M-values and places them in M

M.mean<-apply(X=M, MARGIN=1, FUN=mean, na.rm=TRUE)   # for details, see ?apply

A.mean<-apply(X=A, MARGIN=1, FUN=mean, na.rm=TRUE)

plot(x=A.mean, y=M.mean, col=1, pch=20, main=”Averaged MA-plot”, xlab=”Mean A”, ylab=”Mean M”)
abline(h=0) # not drawn in the figure below, but can be used to add a y=0 line indicating non-DE genes

 

Figure – Averaged MA-plot

 

Volcano-plot

 

Already discussed above.

 

Venn Diagram

Venn Diagrams are used to visualise the overlap of elements between two (A,B), three groups (A, B, C) or four (A, B, C, D). Comparison of four groups is rather complicated, as depicted in figure below, and rarely (never) used. From the Venn Diagram it is possible to identify the number of genes which are unique in A, unique in B, unique in C, common in A and B but not in C, common in A and C but not in B, common in B and C but not in A and, finally, common for A, B and C. More information on the Venn Diagrams can be found by following the source links in the legend of the Venn Diagrams figure below.

 

 

 

Figure – Venn Diagrams

a) A Venn Diagram for comparison of four groups (source: http://en.wikipedia.org/wiki/Venn_diagram),
b) A Venn Diagram for comparison of three groups (source:
http://www.math.csusb.edu/notes/sets/setsex5/setsex5.htm).

 

In R the Venn Diagram analysis can be carried out using the vennDia function of the KTH-package (the actual drawing of the diagrams is done calling functions in the LIMMA package). The function requires at least two vectors (A and B) as input, and can be used with up to three vectors. The vectors may be either numeric (contain only numbers) or character. This means that one could for example use the index vectors generated from the toptable rownames to compare genes differentially expressed in three different samples. In the example below the three “randomly” generated numeric vectors are compared.

 

a <- 1:10
b <-5:15
d <- c(230,32,235,4312,54,245,1:5)
vennDia(A=a, B=b, C=d, names=c(“Sample A”, “Sample B”, “Sample C”), 
+ main=”Example of how to create a Venn Diagram”)

 

 

Figure – Venn Diagram, example

 A Venn Diagram of example data


Clustering

There are three options for clustering, two of which are using open-source software and therefore recommended

 

GeneSpring

GeneSpring is commercial software with several clustering algorithms. We currently have only one license at the department (installed on ArrayEval6 computer). Data from R can be exported to a tab-separated text file and subsequently imported into GeneSpring. See the “How To Exports Results From R” chapter for more information.

 

Multiexperiment viewer

MEV is a nice software from TIGR, but with some limitations. The software provides means for approximately 10 different clustering algorithms as well as SAM and ANOVA analysis. To use MEV the data has to be exported from R using the exportMEV function described in the data export part of this User’s Guide. Additional information on the use of MEV is available on the TIGR web site at

 

http://www.tigr.org/software/tm4/menu/mev.pdf

http://www.tigr.org/software/tm4/menu/MeV.ppt

 

Clustering functions available in a collection of R-packages

More difficult to use, but directly compatible with the R data format. Detailed instructions for hierarchical clustering are available from Johan Lindberg.

 

** instructions will be added later on **

 

 


How to export results from R

There are several ways of exporting results from R depending on the type of object one would like to export.

For tables the primary method to use is to export the table using the write.table-function. The problem is, however, that the default arguments need to be modified a bit. There are alternatives to the write.table-function with different default arguments, and more information on these can be found using

 

?write.table

 

For the most common export steps we have created a set functions. More will be added later if there is a need from the end-users.

 

exportTopTable

Exports a toptable into a format compatible with Microsoft Excel. No fancy things, just remember to add the extension to the file name.

 

exportTopTable(tt.all, file=”tt.all.txt”)

 

export1, export2 and export3

Export the information in an ma-object by creating one file for each slide in the ma-object. These files may be used as input to, for example, GeneSpring.
Depending on which function was called, the following parameters are exported:

  • export1:            Index, Name, ID, F635, F532, spot diameter
  • export2:            Index, Name, ID, M, A, spot diameter
  • export3:            Index, Name, ID, F635, F532, M, A

The Name, ID and diameter are extracted from the gpr-object. As input always use an ma-object (the conversions to R (F635) and G (F532) are done within the function). Users who need to export other things than these are encouraged to contact the kth-package maintainer or send e-mail to the mailing list.

 

export3(ma.filt.np, slideVec=1:8, NameOfFile=”Interferon”)

 

exportMEV

Exports data for slides defined by the user from a ma-object into a tab-separated text file, which can used as input to TIGR’s multiexperiment viewer clustering software. Files will be named using either slide names stored in the 'ma.obj' or defined by the user. The following information will be extracted from the ma-object, gpr-object or calculated, and written to a tab-separated text file. There will be one file created for evey slide present in your ma-object.

  • UID     unique identifier, concatenation of 'slide type'-argument and index value in 'ma.obj'
  • IA        intensity for channel A (Cy3)
  • IB        intensity for channel B (Cy5)
  • R          slide row (rows numbered from 1 to n, block boundaries not considered)
  • C         slide column (columns numbered from 1 to n, block boundaries not considered)
  • MR      meta row (rows of blocks)
  • MC      meta column (columns of blocks)
  • Block   Block on slide
  • SR       row in block
  • SC       column in block

 

The function accepts several arguments which can be used to fine tune the performance. The table below lists them, and provides a short description.

 

Argument

Description

Default value

ma.obj

the ma-objects containing the data

none

gpr.obj

the gpr-object containing the feature IDs and names. As this function requires large amounts of available memory, something called gpr.mini may be used instead of the gpr-object. Currently a gpr.mini object exists only for the Mouse Stem Cell project.

gpr

slideVec

A numeric vector defining the slides to export

none

idxVec

A numeric vector defining the genes to export

NULL (=all genes)

duplicates

Logical value indicating whether duplicates should be averaged

none

slidetype

Name of the slide type. Will be added to the unique identifier. Make sure this prefix matches the unique identified in you MEV annotation file. If questions, contact the person responsible for the annotation of your array.

“Stem32”

slidenames

Character vector with alternative slidenames. If 'NULL', slide names in 'ma.obj' will be used

NULL

 

The slidetype argument is very important, as it will be used to couple the exported probe identity with the annotation file read into the MEV software. Unless these are exactly matching each other, no clustering is possibly.

 

Some examples:

 

a)      to export all data from all the slides in the ma-object

 

exportMEV(ma.obj=ma.filt.np, slideVec=1:8, duplicates=FALSE, slidetype=”HumB9”)

 

 

b)      to export top 500 candidate genes for DE from all slides and to average the duplicates. The slides are from the Stem3.2 array

 

exportMEV(ma.obj=ma.filt.np, slideVec=1:24, idxVec=idx.top500, duplicates=TRUE, slidetype=”Stem32”)

 

            For instructions how to create the idxVec have a look at Additional things we can do with the toptables section of this document.

 

 

To run this function takes some time (go get some coffee!!!).

 

 

 

 

 


 

Suggestion for naming objects in R

Objects in R can be named in almost any way.

A typical analysis process of an experiment takes, however, several weeks and at some point the user has to return to old analysis steps. When several projects are analysed in parallel, it becomes difficult to remember what the different objects were named and one should therefore use informative object names.

To avoid problems there is a suggestion for how to name the different objects of the intermediate steps. It is in no way compulsory to follow these recommendations, but it is much easier for other users to help you if you are using this naming system.

 

 

First some things to remember

 

  • Avoid capital letters (they look “aggressive”)
  • Do not use the underscore (_) or dash (-) signs as these are used by R for assignment
  • Use dot (.) to separate words
  • Use short, but informative names (however, it is better to use a long name than a short non-informative name)
  • Avoid having duplicate objects (memory issues)
  • Keep the intermediate ma-objects (use clone-function)
  • never modify the gpr-object

 

Below is a list of the most common objects needed:

 

gpr       imported GenePix results files

raw      raw data extracted from gpr

ma       transformation of R and G to M and A values

ff          flags filtered

fs         saturated features filtered

ab        B+2SD filter applied

fmm     mean to median filter applied

fsiz      feature size filter applied

fc         intensities from control clones applied

frc        ratio comparison applied

np        print-tip normalised

nl         lowess normalised

nb        between-slides normalised

tt          TopTable object (results of the LIMMA toptable-function)

eb        results of the ebayes function

cor       results of the duplicateCorrelation function

idx        a vector with index values

Ms       matrix of M-values

As        matrix of A-values

Bs        matrix of B-values


Where to find more information?

Below is a collections of web sites where more information may be find

 

Ross Ihaka’s Introduction to R and Statistical Computing I and II chapters

http://www.stat.auckland.ac.nz/~ihaka/120/Notes/ch02.pdf

http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture04.pdf

http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture05.pdf

 

Manuals and other contributed documentation discussion R-related issues

http://cran.r-project.org/manuals.html

http://cran.r-project.org/other-docs.html

 

Lecture notes and slides relating to Bioconductor

http://www.bioconductor.org/workshop.html

 

Vignettes, short descriptions, of a number of essential Bioconductor functions

http://www.bioconductor.org/viglistingindex.html

 

In addition to the R-packages mentioned in this Guide, there are a huge number of other packages which may freely be used. A nice list of these is provided by Korbinian Strimmer:

http://www.stat.uni-muenchen.de/~strimmer/rexpress.html


How to contribute

 

Do not hesitate to contact me if you think you can contribute to this document.

All help is acknowledged and greatly appreciated.

 

Co-authorship is an option for those willing to contribute on a larger scale.

 

 

Send me e-mail!

valtteri@biotech.kth.se

 

 

 


Acknowledgements

Henrik Bengtsson and Gorden Smyth are acknowledged for their excellent microarray data analysis packages aroma and LIMMA respectively. All data import and normalisation functions, as well as the core structure of the ma-objects is based on work done by HB. The functions to calculate the B-values are by GS.

The help of the co-authors of the KTH-package, Marcus Gry Björklund and Johan Lindberg, is greatly appreciated, as well as the whole microarray group at Department of Biotechnology, KTH.