![]() |
DetectiV: Analysis of pathogen detection microarray data |
|
TutorialThis document is a tutorial for the DetectiV software. The tutorial contains code that can be directly executed in R, the software on which DetectiV is based.
1. RequirementsYou will need:
# To install the bioconductor packages: source("http://www.bioconductor.org/biocLite.R")
Connecting R to the internetIf you're having problems with R accessing the internet, then there are a couple of things to try. On windows, right click the shortcut to R, and click properties. In the target box, enter the flag '--internet2': ![]() On linux or unix systems, if you are using a proxy server, make sure your http_proxy environment variable is set and you may also need to 'unset no_proxy'. 2. Getting some dataUsing the data already in the package# Load the Urisman et al data in the package: data(DetectiV) Using GEOqueryCode for retrieving data from GEO can be found in the /doc
folder of the DetectiV package, however if you want to do it yourself, here
is an example: # you could use the GEOquery package directly: library(GEOquery) # get gse2228 gse2228 <- getGEO("GSE2228") gsmplatforms <- lapply(GSMList(gse2228), function(x) { Meta(x)$platform }) probesets <- Table(GPLList(gse2228)[[1]])$ID gse2228.matrix <- do.call("cbind", lapply(GSMList(gse2228), function(x) { tab <- Table(x) mymatch <- match(probesets, tab$ID_REF) # Note: retrieving the NON-background corrected median intensity return(tab$CH1_MEDIAN[mymatch]) })) # remove NA's and zeros gse2228.matrix[is.na(gse2228.matrix)] <- 0 gse2228.matrix[gse2228.matrix<=0] <- 0.5 # get the genes gse2228.genes <- Table(gse2228@gpls[[1]])
Alternatively, at time of writing the data can be downloaded as files:
and then loaded using GEOquery:
library(GEOquery) gse2228 <- getGEO(filename="GSE2228_family.soft.gz") gse546 <- getGEO(filename="GSE546_family.soft.gz") Using limmaLimma is a package from bioconductor for the analysis of microarray data. It contains functions to read in
microarray data from many common formats. Here is an example of using limma to read in some genepix files (not provided):
# load the library library(limma) # find your gpr files # NB: substitute your own value for the directory setwd("C://path//to//a//directory//containing//gprfiles") myfiles <- dir(pattern="gpr") RG <- read.maimages(myfiles, source="genepix") Preparing the data
Before we visualise the data, we need to prepare the data. This includes
averaging over replicate probes and joining the data to probe annotation data. If you used the data(DetectiV) command,
then the probe annotation data for the Urisman et al dataset will already be loaded:
data(DetectiV) ?grouping grouping[1:10,] However, if you did not use the data(DetectiV) command, you must load the probe annotation information. There is
a file, "oligo_annotation.txt" in the /doc folder of the DetectiV package. This can be loaded:
grouping <- read.table(file="oligo_annotation.txt", header=TRUE, sep="\t", quote="") grouping[1:10,]
The next stage is to prepare the data by joining the intensity information to the probe annotation and averaging
over replicate probes (where appropriate):
# we give the prepare.data() function: # - the matrix of intensities, with rows representing probes and columns arrays # - a vector of probe identifiers, one per row of the above matrix # - a data frame of data representing probe annotation # - the name of the column within the probe annotation that contains the probe identifiers # for GEOquery data gse2228 <- prepare.data(gse2228.matrix, gse2228.genes$OLIGO_ID, grouping, "UID") # For limma data (presuming intensities are in the Cy3 channel) # Background corrected: gdata <- prepare.data(RG$G-RG$Gb, RG$genes$ID, RG$genes, "Name") # Not background corrected: gdata <- prepare.data(RG$G, RG$genes$ID, RG$genes, "Name") 3. Visualisation 1Visualisation is by means of a barplot, with one bar per unique oligo ID, and oligos grouped together according
to user defined annotation:
# we give the show.barplot() function: # - the output from prepare.data() # - the name of the column that contains the annotation by which oligos will be grouped # - the index of the column of the array we wish to show # group by virus family and show array GSM40806 show.barplot(gse2228, "FAMILY", 2) # group by virus species and show array GSM40806 show.barplot(gse2228, "SPECIES", 2) # group by virus family and show array GSM40814 show.barplot(gse2228, "FAMILY", 10) # there are many other options to show.barplot(): ?show.barplot # we can also use the get.subset() function to visualise # a subset of viruses: pap <- get.subset(gse2228,"SPECIES","Human papillomavirus") show.barplot(pap, "SPECIES", 2, limit.label=30) 4. NormalisationOn the vast majority of arrays, most oligos will report an intensity above zero, due to either non-specific hybridisation or simply the effects of scanning and image analysis. The purpose of normalisation here is to compare all oligos against a negative control and express as a log ratio. This should ensure that oligos that are not binding to a specific target will be normally distributed and have mean zero. We can then use a t-test to compare that distribution to zero There are three options for the negative control in DetectiV. These are:
# normalise the data to the global median mdata <- normalise(gse2228, 2:57, method="median") # normalise the data to a chosen array adata <- normalise(gse2228, 2:57, carray=gse2228[,3]) # normalise the data according to specific control oligos controls <- grep("HumanPool", gse2228$UID) odata <- normalise(gse2228, 2:57, controls) 5. Significance TestingFinally, we can test the data to find if any groups of viruses appear significantly different from zero. The groups
are again defined by the unique values of a user-defined column, so the oligos may be grouped by virus family, virus species etc.
In most instances, for significance testing, we will wish to group oligos by virus species. What we are looking for here
are small p-values in combination with large mean normalised log ratios. The small p-value indicates a group of oligos
that are significantly different from zero, but a large mean normalised log ratio indicates how far from zero they are,
and for a positive result we would expect them to be very far from zero.
# group according to species and test array GSM40806 tt <- do.t.test(mdata, mdata$SPECIES, 2) # filter some of the controls from the results ignore <- c("ACTIN", "Actin", "GAPDH", "Line_Sine") tt <- tt[!tt$fv %in% ignore,] # filter such that each group's mean log ratio is > 1 and look at the top ten tt[tt$av>=1,][1:10,] This tutorial represents only part of the functionality in the DetectiV package. For the rest, I recommend you read the full documentation after installing. Comments and queries should be made to the author, Michael Watson 6. References
|