DNA microarrays have rapidly become the platform of choice for analysis of gene expression levels at a genome-wide level. Gene expression in several organisms can today be analysed fairly easily using either in-house produced or commercial arrays.

The User’s Guide

When this document was updated these were the latest versions of each package. Before starting an analysis of before asking questions, make sure you have the latest versions of each package

Package	Version
R (strictly speaking not a package)	1.8.1
kth-package	0.3.13
aroma	0.69
limma	1.5.9
statmod	1.0.6
Bioconductor (bundle of packages)	1.4

Table – Most recent versions of the different packages

Make sure these are the ones you have installed on your computer

Golden Path

The purpose of this document is to present the workflow, called “Golden path”, for DNA microarray analysis. The aim of this approach is to help new people to get started with the DNA microarray data analysis and also to provide tools for this.

The steps in the Golden Path are divided into four subgroups, also shown schematically in the figure below. The first group coloured in purple, contains steps related to data import and transformation. The second group, red, deals with the filtering issues. The third group, green, considers data normalisation and the last group, grey, contains all subsequent data mining steps.

Figure – Data processing path for Golden Path.

All steps of the Golden Path are not included in this figure. The *.gpr denotes the results files generated by the GenePix software.

The KTH-package is a collection of tools, which have been created to assist new users to get started with the analysis, without the need of having in-depth knowledge of the R platform for statistical computing. Excellent help pages are available for the R software, but for the majority of the users, the command-driven interface is still a major hurdle. The KTH-package is not a stand-alone package for microarray data analysis, but relies heavily on other packages, such as the excellent packages aroma by Henrik Bengtsson, Lund Technical University, and LIMMA by Gordon Smyth (WEHI, New Zealand). LIMMA is also part of the Bioconductor package bundle and we will be using also additional features and functionalities of this bundle.

The majority of the steps are carried out in the R environment for statistical computing. The aim is to use only open-source software and therefore we have chosen to use the very nice MEV-software (Multiexperiment viewer) from TIGR for clustering. More information on clustering is found in the later chapters, but the reader is encouraged to look for more details in the material provided with the MEV software (http://www.tigr.org/software/tm4/menu/mev.pdf and http://www.tigr.org/software/tm4/menu/MeV.ppt).

The purpose of this document is to make the data analysis as easy as possible. Please read through this document before asking any questions. The premiere forum for asking the questions is the mailing list for the kth-package at kth-package@freelists.org.

Use the following link to subscribe to this list: http://www.freelists.org/cgi-bin/list?list_id=kth-package

An additional excellent forum for questions is the Bioconductor mailing list, but this is strictly limited to questions regarding the Bioconductor tools.

The data set used in the examples

In this Guide, text, figures and R commands are used to explain what is done to the data. Some of the data (in gpr-format) used in the examples can be obtained by sending e-mail to the corresponding author of this document. Data from several different projects is used to exemplify the different functions and steps.

A good start for each new project is to create a targets file containing as much information as possible of the experiment and the way it was carried out. An example of such a targets file is shown below, and also available as a tab-separated text file:

http://biobase.biotech.kth.se/~valtteri/R/data/targets.txt

Barcode FileName Cy3 Cy5 Scanner Date Blocker Primer Slide

12664015 Hexa1.gpr Control IFN Agilent 2003 11 0840 60mer hexamer cDNA

12664012 Nona1.gpr Control IFN Agilent 2003 11 0840 60mer nonamer cDNA

12664013 HexaNona1.gpr Control IFN Agilent 2003 11 0840 60mer hexanonamer cDNA

12664010 HexaNonaNoBlock1.gpr Control IFN Agilent 2003 11 08 None hexanonamer cDNA

12554758 Operon802.gpr Control IFN Agilent 2003 11 02 80mer 20TVN cDNA

12559204 Operon801.gpr Control IFN Agilent 2003 11 02 80mer 20TVN cDNA

12559205 Amersham1.gpr Contro lIFN Agilent 2003 11 0240 60mer 20TVN cDNA

12559266 Amersham2.gpr ControlI FN Agilent 2003 11 0240 60mer 20TVN cDNA

Table – An example of a targets file

The Cy3 and Cy5 are the only compulsory columns, and must be labelled exactly in the same way. Additional columns may be added. The information in this file can be used in several useful ways during the subsequent data analysis.

Syntax usage in this document

In this document the commands entered into the R console are formatted in this way

print(“This is an example of colour, font and font size used for the commands which should be entered to the R console”)

For clarity and line break reasons all R commands which are a continuation from the previous line have a “+”-sign in front of them shown in the example below. Remove this sign when writing the command into the R console.

print(“This is an example of colour, font and font size used

+ for the commands which should be entered to the R console”)

Installation and additional sources of information

The main portal to the Microarray Resource Center at KTH and the KTH-package is

http://biobase.biotech.kth.se/~valtteri/R/

Instructions on how to install the R software and the additional packages can be found at

http://biobase.biotech.kth.se/~valtteri/pages/kthpackageinstall.html

Additional information on the R software is available at

http://www.r-project.org

Information on the aroma package is available at

http://www.maths.lth.se/help/R/aroma/

Information on the Bioconductor package bundle is available at

http://www.bionconductor.org

Information on the Multiexperiment viewer software and other software by TIGR is available at

http://www.tigr.org/software/tm4

To subscribe to the kth-package mailing list, sign up at

http://www.freelists.org/cgi-bin/list?list_id=kth-package

send the word subscribe in the Subject: line of an e-mail to

kth-package-request@freelists.org

Requirements

The data

Gridding is the process in which the exact location of a feature on the array is identified, and intensity values corresponding to this feature are extracted. Different software is available for this task, but we have chosen to work with the GenePix software from Axon Instruments. When using the kth-package and this User’s Guide your data should, therefore, be in GenePix format. Currently no other data formats are supported.

For each slide there should be one gpr-file. As the gpr-files differ between versions of GenePix, it is suggested that all slides are analysed using the same version of GenePix (preferentially GenePix 5.0 or later). The gpr-files must contain the genes in the same order, which means the slides/arrays used must be from the same print batch or at least have the same design/layout. If duplicate spots are used, they should be in two or more identical fields, and the distance (in rows) between duplicate spots must be constant.

Each probe on the chip must be named using a unique identifier and this placed in either the ID or NAME column. Usage of Gene Description is advised against due to problems with formatting in the R console.

Hardware

The kth-package is intended to be used on the Windows 32-bit platform. It has, however, been successfully installed on Linux OS after some minor modifications to the installation procedure, but the complete functionality has not yet been verified.

The package runs smoothly on a Celeron 2.0 GHz processor and 512 Mb RAM, but extra RAM is always good (and cheap!).

The amount of virtual memory on the computer must be increased to approximately 2 Gb. Until today this has been ok for all the analysis I have carried out. In some cases the memory simply is not enough, but there are always alternative ways to do the analysis. If you encounter this problem, you may try to close all other applications on the computer, save your R workspace, quit R, re-start R and finally reload the workspace and kth-package.

Current versions of R are able to handle up to 2 Gb of memory, so there is no use to increase the amount of RAM to >2 Gb unless you are running several parallel processes. Newer versions of R are rumoured to be better at the memory allocation.

Software

R

Always use the latest version of R. Currently (early April 2004) this corresponds to R 1.8.1.

You should definitely upgrade to R.1.8.1 if you do not already use it!!!

New major versions of R are usually released in April (R 1.9.0 in 2004) and October every year. Remember to update the packages every second week or so.

By default each session of R will be allowed to use a maximum of 1024 MB of memory. For larger projects this is not enough, but can be increased by typing

memory.limit(size=2000) # this is the maximum amount of memory accessible for R

A successful memory increase gives NULL as output. During an R-session the memory limit cannot be reduced.

If a message indicating memory allocation error is obtained during the R session make sure you have increased the memory.limit as shown above. You can also try to free memory reserved by the R software by calling the “garbage collector” through typing

gc()

Bioconductor

New major versions of the Bioconductor bundle are released approximately at the same time as new versions of R or soon thereafter. For Bioconductor it is even more important that you frequently update the packages. Especially the excellent LIMMA package (by Gordon Smyth) should be updated every week as improvements are made and new functions are added frequently.

aroma

The aroma package is updated every now and then. Check Henrik Bengtsson’s web page for more information.

http://www.maths.lth.se/help/R/aroma/

kth-package

The kth-package is updated when new functions are available and I have the time to prepare the new version for distribution. To get information when new versions are available you can subscribe to the mailing list. Send the word subscribe in the Subject: line of an e-mail to

kth-package-request@freelists.org

The list is a closed list (only subscribers can send messages) but not moderated.

Please include the version information on your packages (kth-package, limma, aroma, R base) when asking questions.

It is important to remember that information on urgent and annoying bugs will not be sent out to individual users, but only to the mailing list! Do subscribe!

How to show the version of a certain package

To display information on a certain package in R, type

help(package=”kth”)

Figure – Example of information displayed for each package

Data import and transformation

Data Import

The first step is to read the data into R. It is assumed that you have installed R and the add-on packages (see Installation).

Create a folder for your project (try to keep one separate folder for each of your project) and place your gpr-files there. The TIFF-files are not needed and can be stored elsewhere.

Create a copy of the R shortcut onto your desktop (there might already be one there, but create a new project-specific shortcut). Select the newly created shortcut icon and press Alt+Enter to open the Properties window of the shortcut (see figure). Rename the shortcut to something like “R 1.8.1 Your Project” and into the Start In field enter the path to your newly created folder containing your gpr-files.

Figure – Properties window

Start R and load the kth-library

library(kth)

This will automatically also load the aroma package and some important packages of the Bioconductor bundle. Some information is displayed when the kth-package is loaded. It is good to read through this part, as some important messages might be displayed there.

To import the data into R we use functions in the aroma package. First we verify that the working directory contains your gpr-files:

getwd()

dir()

You should now see all your gpr-files.

Then continue by creating a gpr-object

gpr <- MicroarrayData$read(pattern=”.gpr”)

This will import all gpr-files in your current folder (which now should be the project-specific folder). Be aware of that this might take several minutes, depending on your computer and the number of slides you are importing. It is not uncommon to see R become non-responding during this period of time.

You have now created the gpr-object, which is central to all subsequent analysis steps. This object contains the raw data and should never be modified by the user. To see information on the gpr-object one can type

gpr

in which case you will see something like this:

[1] "GenePixData: Number of slides: 8. Number of fields: 56. Layout: Grids: 12x4 (=48), spots in grids: 25x25 (=625), total number of spots: 30000. Spot name's are specified. Spot id's are specified."

which indicates that the object contains data from 8 slides and each slide contains 30 000 rows of data.

Of course the object contains a lot of more information, which will be used later on. There are several ways of extracting the information. To do this one has to have some knowledge of the indexing rules in R. Below are some examples, but with only minor explanations. More information on indexing in R can be found in the help documents listed in. “Where to find more information” part of this document.

gpr$fields # shows all fields of the gpr-object

gpr$Name[,1] # shows the Name of each probe on slide 1 (the Names are of course the same for the probes on the other slides)

gpr$Name[1:10,] # as above, but only for the first 10 probes

gpr$Dia.[1:20,] # shows spot diameter for the first twenty probes on all slides

gpr$Dia.[1:20, c(1,3,5,7)] # as above, but for slides 1, 3, 5 and 7

All other fields can be accessed in a similar way. If the name of the field contains more then one word the name has to be included into quotation marks. For example:

gpr$”F635 Median”[1:10,] # is the Cy5 forefround signal for the first ten probes on all slides

Extraction of foreground and background signals

In the next step we extract the foreground and background intensities from the gpr-file and enter them into a new object called raw. The user may choose between the median or the mean values, but defaults are the median values (the arguments fg and bg may be omitted if median values are desired). Median values are considered less susceptible to misaligning errors and interfering dust particles and should be used for the majority of the cases.

raw <- as.RawData(gpr, fg=”median”, bg=”median”)

The next step is to decide whether to subtract the local background or not. It is important to point out that the actual subtraction step is not carried out until at the data transformation step explained in the next part of this chapter.

A discussion on the effects of the subtraction can be found on Henrik Bengtsson’s web page:

http://www.maths.lth.se/help/R/aroma/#6.%Subtracting%background

To investigate how even the background is over the slide surface, the imageplot function (from the Limma package) can be used. Limma package is loaded together with the kth-package.

Draw the imageplot for green channel and red channel, respectively.

imageplot(log(raw$Gb[,1],2),layout,low="white",high=“green")

imageplot(log(raw$Rb[,1],2),layout,low="white",high="red")

Figure – Imageplot of the background intensities

This imageplot function is not restricted for the background data, but can be used to visualise any other data column from the gpr-files.

The version 0.3.14 of the kth-package contains the bgAnalysis() function which can be used to more easily analyse the background of each slide. This functions can be used to create plots for both Cy3 and Cy5 background intensities over the slide surface, their CV distribution as well as boxplots showing the background intensity value distributions for each block. For more details see

?bgAnalysis

If subtraction is carried out, some features (spots) may turn out to contain negative or zero-values. These are problematic in subsequent analysis steps because division by zero is not possible and because logarithmic transformation of a negative value is not possible. To get rid of these we identify features with negative or non-zero values in one of the two channels, but not in the other, and set their negative or zero value to 1. Of course one has to be aware of the fact that this might give very large ratios.

Before we do the adjustment of negative or zero values, we will have a look at the filterLog-option. Later on in the analysis steps it might be of interest to know why a certain feature was removed from the analysis. To be able to do this, there is an option to log the results from the different filtering steps into a filter log-object. To do this we first create the filterLog-object, and then in the subsequent filtering steps enter data into it.

filterLog <- createFilterLog(gpr) # creates the object

The filterLog is an array type of object and can be browsed using standard indexing rules for arrays. An array may be considered a book, where each cell is defined by a combination of a row and column number and finally by a page number. The first number refers to the rows, the second to the columns and the third to the “page”. In the filterLog-object row, column and page correspond to probe, filter and slide, respectively.

filterLog[1:10, , 1:2] # shows all filters for the first ten features on slides 1 and 2 before any filtering is carried out

filterLog <- filterZero(raw, logFilter=TRUE, filterLog.obj=filterLog)

note that when using the filterLog option, the results must be assigned to the filterLog object!

If filterLog is not used, then simply

filterZero(raw)

filterLog[1:10, , 1:2] # shows all filters for the first ten features on slides 1 and 2 after the filterZero filter is applied

Transformation into MA-format

The next step is to transform the possibly background subtracted R and G (red and green, respectively) raw intensities to MA-format. A description of this step is available at Henrik Bengtsson’s web page at:

http://www.maths.lth.se/help/R/aroma/#5.2.1%20More%20on%20the%20(R,G)%20to%20(M,A)%20transformation

At this stage the actual background subtraction is carried out if so determined by the user. In this case we subtract the background and create a ma-object containing the transformed data (at this point it might be useful to review the naming suggestion of the objects explained in section Suggestion for naming objects in R).

ma.bs <- getSignal(raw, bgSubtract=TRUE)

We can now have a look at the actual data

plot(ma.bs, slide=1)

Figure – MA-plot

Shows the data after background subtraction, but before any filtering steps. Low intensity artefacts are clearly visible, as well as the bend in the data cloud.

or for the first four slides

subplots(4)

for(k in 1:4)

plot(ma.bs, slide=k)

Figure – Four MA-plots.

The data is now ready for the filtration and subsequent analysis.

Data pre-processing

In the following part we remove features with unreliable data and then proceed with the normalisation steps.

Filtering of the data

A set of filtering tools have been created into the kth-package. Currently there are more than 10 tools for filtering using various spot characteristics extracted from the gpr-object. New tools are rather easy to implement if there is a need from the end-users. Send e-mail to the kth-package mailing list if you think we are missing some useful filtering tools.

The purpose is not to use all these filter tools, as this will probably lead to an over-filtering of the data. Currently there is no final decision on which filters to apply, but instead the user is encouraged to test them all and identify the best combination of filters for each particular project.

Available tools to investigate the effects of filtering

As guidance for making this decision the user may use the following functions:

slideSummary function

This function calculates the number of good and bad features (i.e. those that have passed the filter and those that did not pass). A graphical output is also obtained. Note that this function assumes that ALL positions of the printed surface are occupied by probes. If empty positions are present, the number obtained by this function is an over-estimation of the filtering.

slideSummary(ma.bs) # for more arguments type ?slideSummary

Figure – Example of the output from slideSummary

In this study a very strict filtering is carried out, shown by the fact that >50 % of the features are removed for some slides. This amount of filtering is probably too much in the majority of the cases!

Between duplicates correlation

The duplCor function calculates the Pearson, Spearman or Kendall-Tau correlation between duplicate features on the slide and produces a graphical output. The duplicate features are required to be in two identical fields. The Pearson is calculated by default, but can be changed using the method argument.

duplCor(ma.bs, slideVec=c(1,2,3,4)) # displays correlation for first four slides in the ma.bs-object. For more arguments type ?duplCor

Figure – Correlation of M-values between the duplicated fields on the slide.

Between slides correlation

The detCor function calculates the Pearson, Spearman or Kendall-Tau correlation between replicated hybridisations. The Pearson is calculated by default, but can be changed using the method argument.

detCor(ma.bs, slideVec=c(1,2,3,4)) #displays correlation between the first and second slide, and also between the third and fourth slide of the ma.bs-object. For more arguments type ?duplCor

Figure – Correlation of M-values between the two replicated hybridisations

Numeric presentation of the filtering results

The argument graphics=TRUE adds the number of remaining features for each slide before and after the filtration. For convenience, is also calculates the number of features in each plot, and displays the number in the upper left corner, as shown in the example figure below.

Figure – Results from the filtering step.

n= indicates the number of features in each plot.

Available tools to carry out the filtering

The following filtering tools are currently available

filterZero
filterFlags
filterFlags2
filterSize
filterB2SD
filterCircularity
filterMtoM
filterRatioComp
filterControls
filterSNR
filterSaturated

To see detailed explanations on the use of these, see the provided help files by typing ?NameOfFilter, where you replace “NameOFFilter” with the name of the filter function.

For the majority of these filters the following general arguments are accepted in addition to the more filter specific ones:

Argument	Description
ma.obj	Name of the object on which to carry out the filtering. Must always be specified
gpr.obj	Name of the gpr-object from which to get the required data. Defaults to gpr.obj=gpr
graphics = TRUE or graphics = FALSE	Controls whether ma-plots are drawn. Defaults to graphics = FALSE
logFilter = TRUE or logFilter =FALSE	Determines whether the results of the filtering are logged
filterLog.obj = filterLog	Of interest only if logFilter=TRUE; name of the filterLog-object, defaults to filterLog
filt.A = TRUE or filt.A =FALSE	Controls if also the A-values (intensity values) are set to NA during the filtration process. Defaults to filt.A=FALSE

Examples of usage of filter tools

On the following pages the use of each of these filters is described using examples. Please note that this does not imply that all of these filters should be used for a particular data set. Again, it might be good to review the rules for naming objects.

filterFlags and filterFlags2

In the ma.bs object we now have the background-subtracted fold-change values on the log2-scale (M-values) and the total intensity values also on the log2-scale (A-values). We have 30 000 measurements for each slide, however many of these were flagged by GenePix as unreliable. By using the first filter, i.e. filterFlags, we remove these features by setting the M-value to NA (not available). Note that by default we will keep their A-values unchanged. Before the filtration, however, we create a copy of the ma.bs-object, which makes it possible for us to go back if necessary.

ma.bs.ff <- clone(ma.bs)

filterLog <- filterFlags(ma.obj=ma.bs.ff, gpr.obj=gpr, logFilter=TRUE, 

+ filterLog.obj=filterLog, graphics=TRUE)

As this is the first example, all the arguments are defined using the formal way of ARGUMENT = VALUE. In case no ARGUMENT= is specified, R assumes that the VALUES entered are in the same order as in the formal usage definition of the function. This definition can be found under the USAGE heading in each help-file. The command above could have been given in a shorter way as

filterLog <- filterFlags(ma.bs.ff, 0, gpr, TRUE, TRUE, filterLog, TRUE)

Please note that there are other ARGUMENTS that were not used above.

When the logFilter-argument is set to TRUE, the results of the filtering must be assigned to a new object, and that object should always be the same as the filterLog.obj, i.e. filterLog in this case.

The filterFlags2 function closely resembles the filterFlags function, but the Flag value usage is different. In filterFlags every feature with a flag value of less than the threshold value (flag.level=0 by default) are removed, while filterFlags2 removes features with a certain Flag value (flag.rem-argument).

filterSize

Features can also be accepted and rejected on the basis of their physical spot size. filterSize filter uses the feature diameter defined in gpr$Dia. to remove features outside the accepted boundaries. By default accepted feature sizes are everything between and including 61 and 159 µm, but these should be verified for each batch of arrays. The sizes given by GenePix are with 10-µm intervals, i.e. 60, 70, 80, …, n (assuming one is using the 10-µm scanner resolution).

In the example below where we remove features outside the acceptance range we again choose to enter the results into the filterLog and to keep the A-values.

ma.bs.ff.fsiz <- clone(ma.bs.ff)

filterLog <- filterSize(ma.bs.ff.fsiz, filt.level=c(71,139), 

+ logFilter=TRUE, graphics=TRUE)

filterB2SD

Features with too low signal intensity compared with the local background may also be filtered away. The filterB2SD compares the foreground and background signal intensity values and rejects features for which less than a certain percentage of foreground pixels for both channels (default) are above the background + 2SD. A modification is available that uses the coefficient of variation of the background to determine if the mean (background)+2SD(background) term is driven by an extremely large SD component caused by a dust particle in the background region. [The user may choose to use background + 1 SD as cut-off in stead of the default background + 2 SD].

For the majority of the projects a 70% threshold is commonly used, but again the user should verify this using their own data set.

ma.bs.ff.fsiz.ab70 <- clone(ma.bs.ff.fsiz)

filterLog <- filterB2SD(ma.bs.ff.fsiz.ab70, filt.level=70, 

+ logFilter=TRUE, graphics=TRUE)

filterCircularity

In GenePix 5.0 there are two alternative methods for identification of the features. The first and classical approach uses an adaptive circle method that will adjust the diameter of the circle to match the spot. The second approach uses a growing-seed approach, where a central pixel is identified and then the grid is allowed to grow to match the size of the spot. Seed growing is achieved by comparing the intensity of the pixel to the neighbouring pixels. For adaptive circle method (default) the circularity is always by definition 100% and this filter cannot be applied. For growing-seed gridding approach the circularity is calculated and may be used as a filtering tool. Features witch are less circular than a certain user-defined threshold value are removed.

As the data in this example is obtained through the adaptive circle gridding, no filtering can be carried out using this tool. For an example of the usage, please see the help documentation (?filterCircularity).

filterMtoM

For a good-quality feature one would estimate that the median and the mean values of the foreground signal for a certain channel are close to each other. Deviation of these two measurements from each other by more than a certain percentage may be used as a filtering tool. A threshold value of 15% average deviation has been found out to be appropriate for the majority of the projects.

ma.bs.ff.fsiz.ab70.fmm <- clone(ma.bs.ff.fsiz.ab70)

filterLog <- filterMtoM(ma.bs.ff.fsiz.ab70.fmm, filt.level=15, 

+ logFilter=TRUE, graphics=TRUE)

Note! In my opinion this filter may be too strict at the 15% level and caution should therefore be used when applying this filter!

filterRatioComp

GenePix estimates the ratio between the red and green channel in five different ways

ratio of medians (for each channel calculate the median of the individual pixels, then calculate the ratio of these)
ratio of means (for each channel calculate the mean of the individual pixels, then calculate the ratio)
median of ratios (calculate ratio on a pixel-by-pixel basis, then calculate the median of the ratios)
mean of ratios (calculate ratio on a pixel-by-pixel basis, then calculate the mean of the ratios)
regression ratio (carry out a regression of red channel intensity vs. green channel intensity and take the slope of the regression line)

For a good-quality feature the ratio estimate should be similar irrespective of the method used to calculate the ratio. If two of the estimates deviate too much, the feature may be removed. Currently there are methods written for comparison of 1 to 3 and 1 to 5.

The filt.comp-argument defines which ratio estimates are used for the comparison. The value of 1 compares "Ratio of Medians" to "Regression Ratio", while the value of 2 compares "Median of Ratios" to "Ratio of Medians".

ma.bs.ff.fsiz.ab70.fmm.frc20 <- clone(ma.bs.ff.fsiz.ab70.fmm)

filterLog <- filterRatioComp(ma.bs.ff.fsiz.ab70.fmm.frc20, 

+ filt.level=20, filt.comp=1, logFilter=TRUE, graphics=TRUE)

Be aware that this filter may kick out good-quality features when filt.comp=1, but may not be strict enough when filt.comp=2.

filterSaturated

This filtering tool is based on the intensity level of the feature’s foreground. The TIFF-images generated by the scanners are generally 16-bit images, which gives a maximum intensity level of 2^16= ~65500 (scanner specific, our Agilent scanner’s level is approximately 65180, but dependent on the initial slide background adjustment automatically carried out by the scanner software). Signals above this do not accurately reflect the amount of bound target to the immobilised probe and should, therefore, be removed. In cases where only one of the channels is saturated, the obtained ratio might still indicate a differential expression, although the absolute value of the ratio is incorrect. The user has the option to determine whether to keep or remove these features.

The “saturation level” is also changeable by the user, which makes it possible to use this tool to remove data above a certain arbitrary threshold value.

ma.bs.ff.fsiz.ab70.fmm.frc20.fsat <- clone(ma.bs.ff.fsiz.ab70.fmm.frc20)

filterLog <- filterSaturated(ma.bs.ff.fsiz.ab70.fmm.frc20.fsat, 

+ logFilter=TRUE, graphics=TRUE)

When the filtration process is done, it might be good to rename the ma-object to something more appropriate, such as ma.filt

ma.filt <- clone(ma.bs.ff.fsiz.ab70.fmm.frc20.fsat)

which is then used in the subsequent normalisation step

Within-slide normalisation

Before the identification of differentially expressed genes can be started, the intensities of the two channels have to be balanced against each other, otherwise the calculated ratios cannot be used. The classical approach has been to scale the sum of the two channels to be equal, but it has been noticed that this does not remove some of the dye-dependent artefacts present in the data and easily visualised using, for example, an MA-plot.

Alternative methods for normalisation are the lowess and the print-tip lowess methods. More on these can be read in

Yang YH et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002 Feb 15;30(4):e15

You should read the article!

These normalisation methods are implemented in the aroma package

Normalisation method	Argument
None	“n”
Median (classical)	“m”
Lowess	“l”
Print-tip lowess	“p”
Scaled print-tip lowess	“s”

To determine which method is the most appropriate there are two main plots that can be investigated

Box plot of individual print-tip groups for each slide

boxplot(ma.filt, groupBy=”printtip”, slide=1)

for more help

?MicroarrayData.boxplot

Figure – Box plot for slide 1

At this point it might be useful to read some more about the box-and-whisker plot (the information can be found in any general statistics text book or http://mathworld.wolfram.com/Box-and-WhiskerPlot.html).

Each printtip (block, grid) is symbolised by one boxplot. If all box plots scatter along the same y-value a standard lowess normalisation is appropriate. If they, however, have clearly different median values and spreads, then a print-tip lowess normalisation is the most appropriate way to go.

To visualise all nine slides in the ma.filt-object

windows()

subplots(9)

for(k in 1:9)

boxplot(ma.filt, groupBy=”printtip”, slide=k)

Figure – Nine box plots

The second plot that might be useful is plotSpatial

windows()

subplots(8)

for(k in 1:8)

plotSpatial(ma.filt, slide=k)

Figure – Nine spatial plots

These plots show the M-values at the different locations of the slides and indicate their values using colours. An even colour distribution suggests that the lowess normalisation is appropriate, while an uneven distribution suggests print-tip normalisation.

Normalisation is carried out using

ma.filt.nm <- clone(ma.filt)

ma.filt.nl <- clone(ma.filt)

ma.filt.np <- clone(ma.filt)

normalizeWithinSlide(ma.filt.nm, method=”m”) # median normalisation

normalizeWithinSlide(ma.filt.nl, method=”l”) # lowess normalisation

normalizeWithinSlide(ma.filt.np, method=”p”)     # print-tip lowess normalisation

If unsure, one could test both lowess and print-tip lowess normalisation and check the effects on within-slide duplicate and between-slides correlation using duplCor and detCor functions (see above).

Between-slides normalisation

For each comparison in an experiment one can expect that the range of M-value distribution is approximately the same for all slides. When the range is clearly different, one may consider this an effect of labelling and/or hybridisation, i.e. a technical artefact. The effects of this may be removed by rescaling the M-value distributions using the robust deviation measure Median Absolute Deviation (MAD). This algorithm is available in the aroma function normalizeAcrossSlides.

Additional help is available through the help page for normalizeAcrossSlides:

?MAData.normalizeAcrossSlides

and a detailed description of the MAD scaling is available through the help pages

?mad

Boxplots of M-values for each slide can be investigated to find out whether there is a need for between-slides normalisation or not:

boxplot(ma.filt.np)

Again, clone the object before normalisation

ma.filt.np.nb <- clone(ma.filt.np)

normalizeAcrossSlides(ma.filt.np.nb)

boxplot(ma.filt.np.nb)

Figure – Box plots before and after the between-slides normalisation.

In the first plot the distributions of the M-values are clearly different for the nine slides. After the between-slides normalisations the distributions are much more similar. In this particular case it is of no interest to carry out a between-slides normalisation as we are comparing the effects of different blockers and priming methods.

Identification of differentially expressed genes

We are finally entering the more interesting stuff, which is to find out which genes are (statistically) significantly differentially expressed (DE) between two or more samples.

In the beginning of the microarray era arbitrarily fold-change cut-offs were used to identify DE genes. For example, a two-fold up- or down-regulation might have been considered DE. Today, however, to consider a gene to be DE, it has to be identified using a statistical test, such as the t-test, SAM or B-test and the results corrected for multiple testing. This takes into account the higher variability of the gene expression measurements of lowly abundant genes and, furthermore, enables the identification of genes, such as transcription factors, for which a smaller fold-change value may have profound biological effects.

In this User’s guide we will be considering the B-test.

Reference to SAM paper:

Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001 Apr 24;98(9):5116-21

The B-test

The B-test is an empirical Bayes approach for identification of DE genes. The details of this have been published in several works, and the user is encouraged to read these to learn about the statistical theory underlying the use of this method:

Lönnstedt, I. (2001) Replicated Microarray Data. Statistica Sinica 12(2002), 31-46

http://www.stat.sinica.edu.tw/statistica/j12n1/j12n12/j12n12.htm

Smyth, G. K., Michaud, J., and Scott, H. (2003). The use of within-array replicate spots for assessing differential expression in microarray experiments. In preparation

http://www.statsci.org/smyth/pubs/dupcor.pdf

Smyth, G. K. (2003). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. In preparation

http://www.statsci.org/smyth/pubs/ebayes.pdf

The B-test ranks the probes in order of evidence for differential expression. The higher the B-value, the more probable it is that a gene is differentially expressed. The actual cut-off for differential expression is difficult to determine and is dependent among other things on the number of slides and also the a priori assumption of the proportion of differentially expressed genes (default assumption is 0.01). The main use of the B-test is therefore to rank the genes and then continue with as many genes as is feasible to work with, starting from the top of the ranking list.

At this stage we limit the discussion to simple two-sample comparisons. Later on in the case-study section we introduce more advanced comparisons. The basic idea is the same for comparisons consisting of more than two samples, but the design object and the interpretation differs somewhat.

All the steps in the B-test use functions from the LIMMA package. This package is automatically loaded together with the kth-package. There is an excellent User’s Guide to LIMMA, which is available on your computer if you have installed LIMMA. In R, type

help.start()

Choose packages -> choose LIMMA -> browse directories and there you have it.

You should definitely read chapters 7-9.

The aim of the examples is to provide an example of how to identify DE genes, but not to explain the underlying statistics. For more information see the cited refrences above.

We start with a general discussion and a few examples on how to carry out the identification of differentially expressed genes using the B-test. Later on we look at some case-studies in more detail (currently only one case-study is included).

The first step is to create a matrix of M-values, in which each row and column represents features and slides, respectively. If all the slides in the ma-object are not part of the comparison of interest, these can be excluded using standard indexing rules (see example below).

Ms <- ma.filt.np$M[,] # includes all slides

Example with only five slides included:

Ms <- ma.filt.np$M[,c(1,2,3,6,7)] # we will continue to work with this one

NB!

The matrix may now contain also features for which there is no or only one valid measurement. Please see the discussion below regarding the strange tailing phenomenon for more information and how to deal with these.

Estimation of the within-slide duplicate spot correlation

Next estimate the within-slide correlation. This can be skipped if there are no duplicate spots on the arrays. The argument design is a vector of same length as the number of slides in the study, and is used to indicate which slides should be dye-swapped. In this case every second slide should be dye-swapped, and therefore we give them the –1 value. (In cases where more comparisons are carried out (for example samples A, B and C against a reference) the “design” object is a matrix containing the three values –1, 0 and 1. An example of this is available below in the How to create a design object for reference design experiments with multiple comparisons section). The “ndups” argument determines how many duplicates there are of each feature and the spacing argument determines their distance in gpr-file. For two consecutive spots the spacing value is 1.

This computation may take a long time depending on the number of slides and your computer hardware.

cor <- duplicateCorrelation(Ms, design=c(1,-1,1,-1,1,-1,1,-1), ndups=2, 

+ spacing=15000)

Estimation of the M-value for each probe

Next we estimate the M-value for each probe using generalised least squares taking into account the better correlation between the within-slide duplicates. (This should only be used if there are within-slide duplicates, otherwise use the simpler lm.series function below.)

fit <- gls.series(Ms, design=c(1,-1,1,-1,1,-1,1,-1), ndups=2, 

+ spacing=15000, correlation=cor$cor)

If no within-slide duplicates are present, then use

fit <- lm.series(Ms, design=c(1,-1,1,-1,1,-1,1,-1))

We continue by calculating the log odds probability of differential expression using the ebayes function

eb <- ebayes(fit)

The estimated M-values are now in the fit$coef vector, while the B-values are in the eb$lods vector.

The M-values for the first 50 probes are

fit$coef[1:50]

and the B-values for the corresponding genes are

eb$lods[1:50]

We can also have a look at the volcano-plot. i.e. B vs. M

plot(x=fit$coef, y=eb$lods)

Figure – Volcano-plot

In this figure the top 500 genes are coloured in red.

Multiple testing and toptables

With many probes, as for example 30000, the problem of multiple testing becomes an issue we have to deal with, otherwise the probability of false positive (genes that by chance are statistically significantly differentially expressed) becomes too high. Therefore we use the false discovery rate to control for the multiple testing. Alternative ways for controlling this can be found by typing ?p.adjust.

Using the B-values in the eb$lods vector we rank the probes, calculate p-values corrected for multiple testing and display the results in a table format. This is done using the toptable function of the LIMMA package.

The “genelist” argument determines the probe names, which in this case are extracted from the gpr-object. One could choose gpr$ID[,1], or both by creating a matrix cbind(gpr$ID[,1], gpr$Name[,1]), instead.

We start by identifying the top 100 probes and enter the results into a tt.100 table

tt.100 <- toptable(number=100, genelist=gpr$Name[,1], fit=fit, eb=eb, 

+ adjust="fdr")

To visualise the results type

tt.100

It is also good to make a complete toptable.

tt.all <- toptable(number=30000, genelist=gpr$Name[,1], fit=fit, eb=eb, 

+ adjust="fdr")

It might also be useful to export the big toptable into a tab-separated text-file, which is easily read into Excel.

exportTopTable(tt.all, file=”tt.all.txt”)

Additional things we can do with the toptables

Extract the index values of a certain set of probes (the index value of a gene is equivalent to the row number the gene occupies in the input data)

The index values of the probes in the toptable are stored as row names. These index values are useful to have, as they may be used for example to colour corresponding spots in a MA-plot.

To extract the index values for the top 50 genes in the tt.all toptable object the following command can be used

idx <- as.numeric(rownames(tt.all[1:50]))

Extract the data from a toptable using a vector of index values

The index values of some interesting genes (perhaps your 140 favourite genes) are assumed to be stored in the idx object. To find these probes in the toptable and create a new toptable-like object of these we can use the extractFromTopTable function:

new.tt <- extractFromTopTable(toptable=tt.all, idxVec=idx)

This new “toptable” can be treated as a regular toptable, for example exported using the exportTopTable function.

Create a HTML-document of the toptable and add links to either UniGene or LocusLink.

To be able to use this function your array must be within the list of arrays for which an annotImage-object has been created. Currently the only arrays with this annotation information are the arrays of the Mouse Stem Cell project. For the Mouse Stem Cell project this annotImage file can be found /Arkivet/Molbio/Microarray technologies/R package for data analysis/dataImages of annotations/.

The function requires an object with IDs in one of the columns (default is 1), which correspond to IDs in the annotImage-object. The user has also the option to choose the columns which to export. The produced file may then be opened using any web-browser.

load(“annotImage”) # in some cases a saveLoadReference object is created. If this happens, the saveLoadReference object should be assigned to annotImage

makeHtmlLL(object=tt.100, colVecTT=1:5, colVecAnnot=3:10, fileName=”Interferon.html”, dataImage.obj=annotImage)

For more information on how to create a annotImage file for your project, read the following description

http://biobase.biotech.kth.se/~valtteri/R/AnnotImageReadme.txt

Strange tailing phenomenon

In some cases a strange tailing phenomenon in the volcano-plot is be observed as shown in red in the figure below. This is caused by probes with only one valid (non-NA) measurement.

Figure – Strange tailing phenomenon

For these probes with only one valid measurement there are no degrees of freedom left to estimate the variance, and the B-value becomes proportional to the M-value.

The results obtained for probes with only a few valid measurements are probably not trustworthy and therefore we suggest that these are removed from the data set. This can be done on the matrix of M-values generated above.

Ms <- filterOnNAs(Ms, filt.level=2, out.obj=”filtered”)

Here we choose to remove all features with two or fewer valid measurements. The results are returned as a filtered matrix of M-values and assigned to the same object as the input object thereby replacing it. For alternative uses of this function please see the help documentation.

This Ms matrix is used in the same way as explained above.

The filtering can also be carried out directly on the ma-objects, i.e. before the Ms-matrix is created. In this case the filterNbrOK function should be used. The argument filt.level determines how many non-valid measurements are accepted.

ma.filt.np.Nbrok2 <- clone(ma.filt.np)

ma.filt.np.Nbrok2 <- filterNbrOK(ma.filt.np.Nbrok2, filt.level=2)

This new ma-object may then be used as above in the identification of DE genes.

How to create a big matrix of M-values starting from M-value matrices in different workspaces

In cases where there are large numbers of slides all the filtration and analysis steps cannot be done in the same workspace due to memory limits. In these cases the different comparisons are carried out separately, but need to be imported into one final workspace where the identification of differentially expressed genes is carried out.

In this example we consider three separate comparisons, each carried out in a separate workspace. The aim is to import M-value matrices from each of these workspaces into a common fourth workspace, where the identification of differentially expressed genes is carried out.

In each workspace, name the M-value matrix to some unique

Ms.compA <- Ms # for A-comparison

save(Ms.compA, file=”Ms.compA.RData”) # saves an object into the current working directory

Transfer the .RData file to the working directory of the common workspace, then

load(“Ms.compA.RData”) # imports the Ms.compA object, and assigns it into an object with the same name as in the original workspace

Repeat the three steps above for each of the comparison. Remember to replace compA with for example compB and compC.

Use the cbind function to make a new big matrix of M-values

Ms.all <- cbind(Ms.compA, Ms.compB, Ms.compC)

This will concatenate all the original M-value matrices into one big. The order of the columns is determined by the input order of matrices to concatenate. The column names will be retained.

How to create a design object for reference design experiments with multiple comparisons

For more advanced comparisons (eg three samples compared to a common reference), the design argument used in the B-test functions above, is a matrix of -1, 0 and 1. This matrix can be created either manually or using the modelMatrix function in the Limma package. For manual creation consult your local statistician.

If using the modelMatrix function, first import the tab-separated targets file described earlier (this file must contain columns labelled Cy3 and Cy5, but may include additional columns as the user wishes)

targets <- readTargets(“targets.txt”)

Then create the modelMatrix, but remember to define the reference sample

design <- modelMatrix(targets, ref=”ref”)

Case study 1: Reference study where four samples are compared to a common reference

Experimental design: In this example we consider a case where we have two groups of animals
(genotype 1 and genotype 2), and each group is divided into treatment A and treatment B animals as shown in the figure below. Each experimental group (i.e. combination of genotype and treatment) contains 5 individual animals. The gene expression levels for each animal are compared to the common reference using a dye-swap design. This design requires 10 hybridisations per sample to reference comparison, and with four samples adds up to 40 slides.

A design like this contains replicates at different levels, starting from the five individual animals constituting the biological replicates, followed by technical replicates (dye-swap hybridisations; two per animal in this case). Some arrays, depending on the design, contain probes in duplicate, which renders an additional level of replicates. Further levels of replicates would be repeated RNA extractions or RNA amplifications but these are not included in this experiment.

It is important to recognise the fact that the different levels of replicates cannot be considered equal in the statistical tests used. This is clearly evident when the correlation coefficients between different replicates are analysed. For example the within-slide duplicates always correlate really well compared to for example animal to animal correlation, and their contribution should therefore somehow be downweighted in the final analysis.

In this example we consider the dye-swap replicates be equal to biological replicates while the duplicate spots are considered using the duplicateCorrelation function. We start by reading in the targets file

readTargets("targets.txt")

Barcode Name Cy3 Cy5

1 1111111 Slide 1 A1 ref

2 1111112 Slide 2 ref A1

3 1111113 Slide 3 A1 ref

4 1111114 Slide 4 ref A1

5 1111115 Slide 5 A1 ref

6 1111116 Slide 6 ref A1

7 1111117 Slide 7 A1 ref

8 1111118 Slide 8 ref A1

9 1111119 Slide 9 A1 ref

10 1111120 Slide 10 ref A1

11 1111121 Slide 11 A2 ref

12 1111122 Slide 12 ref A2

13 1111123 Slide 13 A2 ref

14 1111124 Slide 14 ref A2

15 1111125 Slide 15 A2 ref

16 1111126 Slide 16 ref A2

17 1111127 Slide 17 A2 ref

18 1111128 Slide 18 ref A2

19 1111129 Slide 19 A2 ref

20 1111130 Slide 20 ref A2

21 1111131 Slide 21 B1 ref

22 1111132 Slide 22 ref B1

23 1111133 Slide 23 B1 ref

24 1111134 Slide 24 ref B1

25 1111135 Slide 25 B1 ref

26 1111136 Slide 26 ref B1

27 1111137 Slide 27 B1 ref

28 1111138 Slide 28 ref B1

29 1111139 Slide 29 B1 ref

30 1111140 Slide 30 ref B1

31 1111141 Slide 31 B2 ref

32 1111142 Slide 32 ref B2

33 1111143 Slide 33 B2 ref

34 1111144 Slide 34 ref B2

35 1111145 Slide 35 B2 ref

36 1111146 Slide 36 ref B2

37 1111147 Slide 37 B2 ref

38 1111148 Slide 38 ref B2

39 1111149 Slide 39 B2 ref

40 1111150 Slide 40 ref B2

We continue by creating a design object, which is a matrix where the columns indicate what comparison each slide belongs to. A negative value of -1 indicates that the slide will be swapped in the analysis process so that the signal for the sample is in the red Cy5 channel, while the reference is in the green Cy3 channel. The name of the reference sample must be defined in the function call.

design <- modelMatrix(targets,ref="ref")

design

A1 A2 B1 B2

1 -1 0 0 0

2 1 0 0 0

3 -1 0 0 0

4 1 0 0 0

5 -1 0 0 0

6 1 0 0 0

7 -1 0 0 0

8 1 0 0 0

9 -1 0 0 0

10 1 0 0 0

11 0 -1 0 0

12 0 1 0 0

13 0 -1 0 0

14 0 1 0 0

15 0 -1 0 0

16 0 1 0 0

17 0 -1 0 0

18 0 1 0 0

19 0 -1 0 0

20 0 1 0 0

21 0 0 -1 0

22 0 0 1 0

23 0 0 -1 0

24 0 0 1 0

25 0 0 -1 0

26 0 0 1 0

27 0 0 -1 0

28 0 0 1 0

29 0 0 -1 0

30 0 0 1 0

31 0 0 0 -1

32 0 0 0 1

33 0 0 0 -1

34 0 0 0 1

35 0 0 0 -1

36 0 0 0 1

37 0 0 0 -1

38 0 0 0 1

39 0 0 0 -1

40 0 0 0 1

The modelMatrix function has identifed four samples in the targets file, i.e. A1, A2, B1 and B2. Each sample is represented by a column in the design object. The slides are represented by rows, and in this case the first slide for example belongs to Reference vs A1 comparison and will be dye-swapped (indicated by the minus sign).

Next we calculate the within-slide duplicate spot correlation using the duplicateCorrelation function.

cor <- duplicateCorrelation(Ms, design=design, ndups=2, spacing=15000)

where Ms is the matrix of M-values, design is the object created above, ndups indicates the number of duplicate spots on each array and spacing indicates the distance between duplicate spots in the gpr-file.

The cor object contains a list of all individual gene-wise correlations (cor$cor.genes) and a trimmed average on the tanh-transformed scale. The trimmed mean can be extracted using

cor$cor

and the gene-wise correlations visualised using for example a boxplot

boxplot(cor$cor.genes)

The cor$cor value is used in the next step where we calculate the estimates of the M-values for each gene using generalised least squars (gls) and the gls.series function. The arguments are the same as for duplicateCorrelation, but include cor$cor for correlation

fit<-gls.series(Ms,design=design,ndups=2,spacing=15000,correl=cor$cor)

The fit object contains five fields of data (directly extracted from the LIMMA gls.series help file):

coefficients: numeric matrix containing the estimated coefficients for each linear model. Same number of rows as 'M', same number of columns as 'design'.

stdev.unscaled: numeric matrix conformal with 'coef' containing the unscaled standard deviations for the coefficient estimators. The standard errors are given by 'stdev.unscaled * sigma'.

sigma: numeric vector containing the residual standard deviation for each gene.

df.residual: numeric vector giving the degrees of freedom corresponding to 'sigma'.

correlation: inter-duplicate correlation.

The M-value estimates can be extracted in the following way:

fit$coefficients[1:10, ]

which shows the M-values for the first ten genes in all comparisons.

Often it is not of any interest (or only of little interest) to know the fold-change difference between the reference sample and the sample of interest. Instead the main interest is to know the difference between the different samples, in this case for example between A1 and A2, A1 and B1. To do this we create a contrast matrix in which we define which comparisons are of interest

design.cont <- makeContrasts(“A1-A2”, “B1-B2”, “A1-B1”, “A2-B2”, levels=design)

design.cont

A1-A2 B1-B2 A1-B1 A2-B2

A1 1 0 1 0

A2 -1 0 0 1

B1 0 1 -1 0

B2 0 -1 0 -1

Note that in this case we are not interested in all possible combinations and exluce “A1-B2” and “A2-B1”.

This is of course dependent on your biological question of interest! Next we use this contrast matrix to calculate the new M-values, as well as degrees of freedom, sigma etc, corresponding to our parametrisation

fit2 <- contrasts.fit(fit, design.cont)

eb <- ebayes(fit2)

The ebayes function calculates the log-odds ratio of differential expression, a moderated t-statitistic and a p-value adjusted for multiple testing (see section describing multiple testing). The following values are calculated and stored in the eb-object (extracted from the LIMMA help file for ebayes):

t: numeric vector or matrix of penalized t-statistics

p.value: numeric vector of p-values corresponding to the t-statistics

s2.prior: estimated prior value for 'sigma^2'

df.prior: degrees of freedom associated with 's2.prior'

s2.post: vector giving the posterior values for 'sigma^2'

lods: numeric vector or matrix giving the log-odds of differential expression

var.prior: estimated prior value for the variance of the log2-fold-change for differentially expressed gene

The calculated B-values can be extracted using

eb$lods[1:5,]

which shows the B-values for the first five probes in all comparisons.

Finally, use the toptable function to summarise and pick-up the top candidate genes.

toptable(fit=fit2, coef=1, number=10, genelist=gpr$Name[,1],  eb=eb, adjust.method="holm",sort.by="B")

The number argument can be increasted to choose more genes. The coef= argument indicates which comparison is of interest (the order from makeContrasts function).

A positive M-value indicates that a particular gene is more highly expressed in the first sample your parametrisation, which would be A1 for “A1-A2” comparision.

The results may be visualised for example using by using the Volcano-plot. For the “A1-B1” comparison this would be done using

plot(x=fit2$coef[,3], y=eb$lods[,3])

The number in the brackets indicates which parametrisation is of interest (see makeContrasts function above).

How to visualise the results

There are several ways for visualisation of the results. Common figure/plot types are described in this section and showed how they are created using R, but we start by looking at how to change the small, but important details of a figure/plot in R.

The basics

We start by creating two vectors, x and y, both containing 100 random numbers. We then plot these against each other to get the basic plot shown in Figure Standard plot

x <- rnorm(100,4)

y <- rnorm(100)

plot(x, y)

Figure – Standard plot

The standard plot of y vs. x using default values. No fancy things added to this one.

We continue by looking at the different components of a figure and show that they can be changed to obtain a much nicer plot. Using the four steps below the same data is visualised in a completely different way. The first step plots the data and adds colours, the second adds the titles, the third adds the line and the fourth is included to show that text can be added into the plot if so required.

plot(x,y, ann=FALSE, col=3, pch=20, col.axis=4, col.lab=4, cex=1.5, 

+ cex.axis=1.5, bty="n")

title(main="Title for the plot", sub="Additional information may be 

+ added here", xlab="Label for x-axis", ylab="Label for y-axis")

abline(h=0, lwd=2, lty=6, col=2)

text(x=5, y=2, "It is possible to add text here", col=2)

Figure – Modified plot

A nicer version of the same data, plotted using the four commands in the text.

It is also possible to zoom into a certain part of the plot by defining the xlim and ylim arguments in the plot command, as for example in

plot(x,y, ann=FALSE, col=3, pch=20, col.axis=4, col.lab=4, cex=1.5, ylim=c(-1,1), xlim=c(3,6))

where the x-axis is limited to 3 to 6 and the y-axis to –1 to 1.

Figure – Zooming in/out

Only a part of the data is shown.

Below is a table of the more useful arguments and a short explanation of what they control. The explanations are modified from the help file associated with the par element of R. The complete table is available through

?par

`adj`	The value of `adj` determines the way in which text strings are justified. A value of `0` produces left-justified text, `0.5` centered text and `1` right-justified text.
`ann`	If set to `FALSE,` do not annotate the plot with axis and overall titles. The default is to do annotation.
`bty`	A character string, which determines the type of box drawn around the plot. If `bty` is one of `"o"`, `"l"`, `"7"`, `"c"`, `"u"`, or `"]"` the resulting box resembles the corresponding upper case letter. A value of `"n"` suppresses the box.
`cex`	A numerical value giving the amount by which plotting text and symbols should be scaled relative to the default.
`cex.axis`	The magnification to be used for axis annotation relative to the current.
`cex.lab`	The magnification to be used for x and y labels relative to the current.
`cex.main`	The magnification to be used for main titles relative to the current.
`cex.sub`	The magnification to be used for sub-titles relative to the current.
`col`	A specification for the default plotting colour. A description of how colours are specified is given below.
`col.axis`	The colour to be used for axis annotation.
`col.lab`	The colour to be used for x and y labels.
`col.main`	The colour to be used for plot main titles.
`col.sub`	The colour to be used for plot sub-titles.
`font`	An integer, which specifies which font to use for text. If possible, device drivers arrange so that 1 corresponds to plain text, 2 to bold face, 3 to italic and 4 to bold italic.
`font.axis`	The font to be used for axis annotation.
`font.lab`	The font to be used for x and y labels.
`font.main`	The font to be used for plot main titles.
`font.sub`	The font to be used for plot sub-titles.
`las`	numeric in {0,1,2,3}; the style of axis labels.
	0: always parallel to the axis [default],
	1: always horizontal,
	2: always perpendicular to the axis,
	3: always vertical.
`lty`	The line type. Line types are specified as an integer (0=blank, 1=solid, 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash).
`lwd`	The line width, a positive number, defaulting to `1`.
`pch`	Either an integer specifying a symbol or a single character to be used as the default in plotting points.

More information on the plotting capabilities of R can be found in the excellent documentation by Ross Ihaka, available at

http://www.stat.auckland.ac.nz/~ihaka/120/Notes/ch03.pdf

MA-plot

We have already discussed the MA-plots used to visualise the data for one single array. This is probably the most widely used plot for microarray data.

There are two ways of generating the MA-plot, and both plots can be modified in the same way using the arguments discussed above.

The easiest way is to simply plot an ma-object and by using the slide= argument

plot(ma.filt.np, slide=8) # plots the MA-plot for slide 8 in the ma-object

The alternative way is to define x and y separately. This is needed for example if an average MA-plot is needed. A plot of averaged normalised M-values can be plotted using the following steps. Note that duplicates are not present.

M <- ma.filt.np$M[,]    # extracts all M-values and places them in M

A <- ma.filt.np$A[,]    # extracts all M-values and places them in M

M.mean<-apply(X=M, MARGIN=1, FUN=mean, na.rm=TRUE) # for details, see ?apply

A.mean<-apply(X=A, MARGIN=1, FUN=mean, na.rm=TRUE)

plot(x=A.mean, y=M.mean, col=1, pch=20, main=”Averaged MA-plot”, xlab=”Mean A”, ylab=”Mean M”)

abline(h=0) # not drawn in the figure below, but can be used to add a y=0 line indicating non-DE genes

Figure – Averaged MA-plot

Volcano-plot

Already discussed above.

Venn Diagram

Venn Diagrams are used to visualise the overlap of elements between two (A,B), three groups (A, B, C) or four (A, B, C, D). Comparison of four groups is rather complicated, as depicted in figure below, and rarely (never) used. From the Venn Diagram it is possible to identify the number of genes which are unique in A, unique in B, unique in C, common in A and B but not in C, common in A and C but not in B, common in B and C but not in A and, finally, common for A, B and C. More information on the Venn Diagrams can be found by following the source links in the legend of the Venn Diagrams figure below.

Figure – Venn Diagrams

a) A Venn Diagram for comparison of four groups (source: http://en.wikipedia.org/wiki/Venn_diagram),
b) A Venn Diagram for comparison of three groups (source: http://www.math.csusb.edu/notes/sets/setsex5/setsex5.htm).

In R the Venn Diagram analysis can be carried out using the vennDia function of the KTH-package (the actual drawing of the diagrams is done calling functions in the LIMMA package). The function requires at least two vectors (A and B) as input, and can be used with up to three vectors. The vectors may be either numeric (contain only numbers) or character. This means that one could for example use the index vectors generated from the toptable rownames to compare genes differentially expressed in three different samples. In the example below the three “randomly” generated numeric vectors are compared.

a <- 1:10

b <-5:15

d <- c(230,32,235,4312,54,245,1:5)

vennDia(A=a, B=b, C=d, names=c(“Sample A”, “Sample B”, “Sample C”), 

+ main=”Example of how to create a Venn Diagram”)

Figure – Venn Diagram, example

A Venn Diagram of example data

Clustering

There are three options for clustering, two of which are using open-source software and therefore recommended

GeneSpring

GeneSpring is commercial software with several clustering algorithms. We currently have only one license at the department (installed on ArrayEval6 computer). Data from R can be exported to a tab-separated text file and subsequently imported into GeneSpring. See the “How To Exports Results From R” chapter for more information.

Multiexperiment viewer

MEV is a nice software from TIGR, but with some limitations. The software provides means for approximately 10 different clustering algorithms as well as SAM and ANOVA analysis. To use MEV the data has to be exported from R using the exportMEV function described in the data export part of this User’s Guide. Additional information on the use of MEV is available on the TIGR web site at

http://www.tigr.org/software/tm4/menu/mev.pdf

http://www.tigr.org/software/tm4/menu/MeV.ppt

Clustering functions available in a collection of R-packages

More difficult to use, but directly compatible with the R data format. Detailed instructions for hierarchical clustering are available from Johan Lindberg.

** instructions will be added later on **

How to export results from R

There are several ways of exporting results from R depending on the type of object one would like to export.

For tables the primary method to use is to export the table using the write.table-function. The problem is, however, that the default arguments need to be modified a bit. There are alternatives to the write.table-function with different default arguments, and more information on these can be found using

?write.table

For the most common export steps we have created a set functions. More will be added later if there is a need from the end-users.

exportTopTable

Exports a toptable into a format compatible with Microsoft Excel. No fancy things, just remember to add the extension to the file name.

exportTopTable(tt.all, file=”tt.all.txt”)

export1, export2 and export3

Export the information in an ma-object by creating one file for each slide in the ma-object. These files may be used as input to, for example, GeneSpring.
Depending on which function was called, the following parameters are exported:

export1: Index, Name, ID, F635, F532, spot diameter
export2: Index, Name, ID, M, A, spot diameter
export3: Index, Name, ID, F635, F532, M, A

The Name, ID and diameter are extracted from the gpr-object. As input always use an ma-object (the conversions to R (F635) and G (F532) are done within the function). Users who need to export other things than these are encouraged to contact the kth-package maintainer or send e-mail to the mailing list.

export3(ma.filt.np, slideVec=1:8, NameOfFile=”Interferon”)

exportMEV

Exports data for slides defined by the user from a ma-object into a tab-separated text file, which can used as input to TIGR’s multiexperiment viewer clustering software. Files will be named using either slide names stored in the 'ma.obj' or defined by the user. The following information will be extracted from the ma-object, gpr-object or calculated, and written to a tab-separated text file. There will be one file created for evey slide present in your ma-object.

UID unique identifier, concatenation of 'slide type'-argument and index value in 'ma.obj'
IA intensity for channel A (Cy3)
IB intensity for channel B (Cy5)
R slide row (rows numbered from 1 to n, block boundaries not considered)
C slide column (columns numbered from 1 to n, block boundaries not considered)
MR meta row (rows of blocks)
MC meta column (columns of blocks)
Block Block on slide
SR row in block
SC column in block

The function accepts several arguments which can be used to fine tune the performance. The table below lists them, and provides a short description.

Argument	Description	Default value
ma.obj	the ma-objects containing the data	none
gpr.obj	the gpr-object containing the feature IDs and names. As this function requires large amounts of available memory, something called gpr.mini may be used instead of the gpr-object. Currently a gpr.mini object exists only for the Mouse Stem Cell project.	gpr
slideVec	A numeric vector defining the slides to export	none
idxVec	A numeric vector defining the genes to export	NULL (=all genes)
duplicates	Logical value indicating whether duplicates should be averaged	none
slidetype	Name of the slide type. Will be added to the unique identifier. Make sure this prefix matches the unique identified in you MEV annotation file. If questions, contact the person responsible for the annotation of your array.	“Stem32”
slidenames	Character vector with alternative slidenames. If 'NULL', slide names in 'ma.obj' will be used	NULL

The slidetype argument is very important, as it will be used to couple the exported probe identity with the annotation file read into the MEV software. Unless these are exactly matching each other, no clustering is possibly.

Some examples:

a) to export all data from all the slides in the ma-object

exportMEV(ma.obj=ma.filt.np, slideVec=1:8, duplicates=FALSE, slidetype=”HumB9”)

b) to export top 500 candidate genes for DE from all slides and to average the duplicates. The slides are from the Stem3.2 array

exportMEV(ma.obj=ma.filt.np, slideVec=1:24, idxVec=idx.top500, duplicates=TRUE, slidetype=”Stem32”)

For instructions how to create the idxVec have a look at Additional things we can do with the toptables section of this document.

To run this function takes some time (go get some coffee!!!).

Suggestion for naming objects in R

Objects in R can be named in almost any way.

A typical analysis process of an experiment takes, however, several weeks and at some point the user has to return to old analysis steps. When several projects are analysed in parallel, it becomes difficult to remember what the different objects were named and one should therefore use informative object names.

To avoid problems there is a suggestion for how to name the different objects of the intermediate steps. It is in no way compulsory to follow these recommendations, but it is much easier for other users to help you if you are using this naming system.

First some things to remember

Avoid capital letters (they look “aggressive”)
Do not use the underscore (_) or dash (-) signs as these are used by R for assignment
Use dot (.) to separate words
Use short, but informative names (however, it is better to use a long name than a short non-informative name)
Avoid having duplicate objects (memory issues)
Keep the intermediate ma-objects (use clone-function)
never modify the gpr-object

Below is a list of the most common objects needed:

gpr imported GenePix results files

raw raw data extracted from gpr

ma transformation of R and G to M and A values

ff flags filtered

fs saturated features filtered

ab B+2SD filter applied

fmm mean to median filter applied

fsiz feature size filter applied

fc intensities from control clones applied

frc ratio comparison applied

np print-tip normalised

nl lowess normalised

nb between-slides normalised

tt TopTable object (results of the LIMMA toptable-function)

eb results of the ebayes function

cor results of the duplicateCorrelation function

idx a vector with index values

Ms matrix of M-values

As matrix of A-values

Bs matrix of B-values

Where to find more information?

Below is a collections of web sites where more information may be find

Ross Ihaka’s Introduction to R and Statistical Computing I and II chapters

http://www.stat.auckland.ac.nz/~ihaka/120/Notes/ch02.pdf

http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture04.pdf

http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture05.pdf

Manuals and other contributed documentation discussion R-related issues

http://cran.r-project.org/manuals.html

http://cran.r-project.org/other-docs.html

Lecture notes and slides relating to Bioconductor

http://www.bioconductor.org/workshop.html

Vignettes, short descriptions, of a number of essential Bioconductor functions

http://www.bioconductor.org/viglistingindex.html

In addition to the R-packages mentioned in this Guide, there are a huge number of other packages which may freely be used. A nice list of these is provided by Korbinian Strimmer:

http://www.stat.uni-muenchen.de/~strimmer/rexpress.html

How to contribute

Do not hesitate to contact me if you think you can contribute to this document.

All help is acknowledged and greatly appreciated.

Co-authorship is an option for those willing to contribute on a larger scale.

Send me e-mail!

valtteri@biotech.kth.se

Acknowledgements

Henrik Bengtsson and Gorden Smyth are acknowledged for their excellent microarray data analysis packages aroma and LIMMA respectively. All data import and normalisation functions, as well as the core structure of the ma-objects is based on work done by HB. The functions to calculate the B-values are by GS.

The help of the co-authors of the KTH-package, Marcus Gry Björklund and Johan Lindberg, is greatly appreciated, as well as the whole microarray group at Department of Biotechnology, KTH.