|
Tutorial
This document is a tutorial for the DetectiV software.
The tutorial contains code that can be directly executed in R, the software
on which DetectiV is based.
1. Requirements
You will need:
source("http://www.bioconductor.org/biocLite.R")
biocLite("limma")
biocLite("GEOquery")
Connecting R to the internet
If you're having problems with R accessing the internet, then there are a couple of things to try.
On windows,
right click the shortcut to R, and click properties. In the target box, enter the flag '--internet2':
On linux or unix systems, if you are using a proxy server, make sure your http_proxy environment variable is set
and you may also need to 'unset no_proxy'.
2. Getting some data
Using the data already in the package
Using GEOquery
Code for retrieving data from GEO can be found in the /doc
folder of the DetectiV package, however if you want to do it yourself, here
is an example:
library(GEOquery)
gse2228 <- getGEO("GSE2228")
gsmplatforms <- lapply(GSMList(gse2228), function(x) {
Meta(x)$platform
})
probesets <- Table(GPLList(gse2228)[[1]])$ID
gse2228.matrix <- do.call("cbind", lapply(GSMList(gse2228), function(x) {
tab <- Table(x)
mymatch <- match(probesets, tab$ID_REF)
return(tab$CH1_MEDIAN[mymatch])
}))
gse2228.matrix[is.na(gse2228.matrix)] <- 0
gse2228.matrix[gse2228.matrix<=0] <- 0.5
gse2228.genes <- Table(gse2228@gpls[[1]])
Alternatively, at time of writing the data can be downloaded as files:
GSE2228
GSE546
and then loaded using GEOquery:
library(GEOquery)
gse2228 <- getGEO(filename="GSE2228_family.soft.gz")
gse546 <- getGEO(filename="GSE546_family.soft.gz")
Using limma
Limma is a package from bioconductor for the analysis of microarray data. It contains functions to read in
microarray data from many common formats. Here is an example of using limma to read in some genepix files (not provided):
library(limma)
setwd("C://path//to//a//directory//containing//gprfiles")
myfiles <- dir(pattern="gpr")
RG <- read.maimages(myfiles, source="genepix")
Preparing the data
Before we visualise the data, we need to prepare the data. This includes
averaging over replicate probes and joining the data to probe annotation data. If you used the data(DetectiV) command,
then the probe annotation data for the Urisman et al dataset will already be loaded:
data(DetectiV)
?grouping
grouping[1:10,]
However, if you did not use the data(DetectiV) command, you must load the probe annotation information. There is
a file, "oligo_annotation.txt" in the /doc folder of the DetectiV package. This can be loaded:
grouping <- read.table(file="oligo_annotation.txt",
header=TRUE, sep="\t", quote="")
grouping[1:10,]
The next stage is to prepare the data by joining the intensity information to the probe annotation and averaging
over replicate probes (where appropriate):
gse2228 <- prepare.data(gse2228.matrix, gse2228.genes$OLIGO_ID, grouping, "UID")
gdata <- prepare.data(RG$G-RG$Gb, RG$genes$ID, RG$genes, "Name")
gdata <- prepare.data(RG$G, RG$genes$ID, RG$genes, "Name")
3. Visualisation 1
Visualisation is by means of a barplot, with one bar per unique oligo ID, and oligos grouped together according
to user defined annotation:
show.barplot(gse2228, "FAMILY", 2)
show.barplot(gse2228, "SPECIES", 2)
show.barplot(gse2228, "FAMILY", 10)
?show.barplot
pap <- get.subset(gse2228,"SPECIES","Human papillomavirus")
show.barplot(pap, "SPECIES", 2, limit.label=30)
4. Normalisation
On the vast majority of arrays, most oligos will report an intensity above zero, due to either non-specific
hybridisation or simply the effects of scanning and image analysis. The purpose of normalisation here is to
compare all oligos against a negative control and express as a log ratio. This should ensure that oligos that
are not binding to a specific target will be normally distributed and have mean zero. We can then use a t-test
to compare that distribution to zero
There are three options for the negative control in DetectiV. These are:
- Specific negative controls: a subset of oligos are used for the negative control, and all intensities are expressed as a ratio to their mean
- The global median: the global median intensity is calculated and all intensities are expressed as a ratio to that median
- A specific array: an entire array is specified as the negative control, and all intensities are expressed as a ratio to their respective intensities on the control array
mdata <- normalise(gse2228, 2:57, method="median")
adata <- normalise(gse2228, 2:57, carray=gse2228[,3])
controls <- grep("HumanPool", gse2228$UID)
odata <- normalise(gse2228, 2:57, controls)
5. Significance Testing
Finally, we can test the data to find if any groups of viruses appear significantly different from zero. The groups
are again defined by the unique values of a user-defined column, so the oligos may be grouped by virus family, virus species etc.
In most instances, for significance testing, we will wish to group oligos by virus species. What we are looking for here
are small p-values in combination with large mean normalised log ratios. The small p-value indicates a group of oligos
that are significantly different from zero, but a large mean normalised log ratio indicates how far from zero they are,
and for a positive result we would expect them to be very far from zero.
tt <- do.t.test(mdata, mdata$SPECIES, 2)
ignore <- c("ACTIN", "Actin", "GAPDH", "Line_Sine")
tt <- tt[!tt$fv %in% ignore,]
tt[tt$av>=1,][1:10,]
This tutorial represents only part of the functionality in the DetectiV package. For the rest, I recommend you
read the full documentation after installing.
Comments and queries should be made to the author, Michael Watson
6. References
- http://derisilab.ucsf.edu/software/epredict/index.html
- Urisman A, Fischer KF, Chiu CY, Kistler AL, Beck S, Wang D, DeRisi
JL. E-Predict: a computational strategy for species identification
based on observed DNA microarray hybridization patterns. Genome
Biol. 2005;6(9):R78.
|