Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Dec;192(4):1249-69.
doi: 10.1534/genetics.112.144204. Epub 2012 Oct 10.

CloudMap: a cloud-based pipeline for analysis of mutant genome sequences

Affiliations

CloudMap: a cloud-based pipeline for analysis of mutant genome sequences

Gregory Minevich et al. Genetics. 2012 Dec.

Abstract

Whole genome sequencing (WGS) allows researchers to pinpoint genetic differences between individuals and significantly shortcuts the costly and time-consuming part of forward genetic analysis in model organism systems. Currently, the most effort-intensive part of WGS is the bioinformatic analysis of the relatively short reads generated by second generation sequencing platforms. We describe here a novel, easily accessible and cloud-based pipeline, called CloudMap, which greatly simplifies the analysis of mutant genome sequences. Available on the Galaxy web platform, CloudMap requires no software installation when run on the cloud, but it can also be run locally or via Amazon's Elastic Compute Cloud (EC2) service. CloudMap uses a series of predefined workflows to pinpoint sequence variations in animal genomes, such as those of premutagenized and mutagenized Caenorhabditis elegans strains. In combination with a variant-based mapping procedure, CloudMap allows users to sharply define genetic map intervals graphically and to retrieve very short lists of candidate variants with a few simple clicks. Automated workflows and extensive video user guides are available to detail the individual analysis steps performed (http://usegalaxy.org/cloudmap). We demonstrate the utility of CloudMap for WGS analysis of C. elegans and Arabidopsis genomes and describe how other organisms (e.g., Zebrafish and Drosophila) can easily be accommodated by this software platform. To accommodate rapid analysis of many mutants from large-scale genetic screens, CloudMap contains an in silico complementation testing tool that allows users to rapidly identify instances where multiple alleles of the same gene are present in the mutant collection. Lastly, we describe the application of a novel mapping/WGS method ("Variant Discovery Mapping") that does not rely on a defined polymorphic mapping strain, and we integrate the application of this method into CloudMap. CloudMap tools and documentation are continually updated at http://usegalaxy.org/cloudmap.

PubMed Disclaimer

Figures

Figure 1
Figure 1
CloudMap overall conceptual strategy for mutant genome analysis. This high-level summary depicts the main CloudMap processes and outputs. Detailed overview of all the CloudMap functions is provided in Figure 3, in the user guides, and published workflows available at http://usegalaxy.org/cloudmap.
Figure 2
Figure 2
Screenshot of Galaxy workflow using the ot266 example discussed in Proof-of-principle application of CloudMap. Users may run this workflow as well as others at http://usegalaxy.org/cloudmap. The output of the ot266 workflow is also available as a shared history at the URL mentioned above. Here we see a Galaxy history with the FASTQ raw data file for ot266 along with various reference files used as input into the CloudMap Hawaiian Variant Mapping With WGS Data and Variant Calling workflow. The reader is referred to user guides and videos for step-by-step instructions.
Figure 3
Figure 3
Summary flowchart illustrating all functions used in the CloudMap pipeline. More experienced users may choose among different software tools to perform desired operations at marked decision points in the flowchart. Detailed step-by-step instructions are available in user guides and videos.
Figure 4
Figure 4
Sample screenshot of snpEff output following markup of affected transcription factors by CloudMap Check snpEff Candidates tool. Tabular output of mutated genes and transcripts from snpEff together with lists of candidate loci can be used as input into the CloudMap Check snpEff Candidates tool. In the example shown here, the output of the analysis of ot266 is displayed with the causal lesion in the vab-3 gene labeled as a homeodomain transcription factor. The “Quality” column reflects the GATK-assigned, PHRED-based QUAL score from the VCF file input into snpEff (Danecek et al. 2011).
Figure 5
Figure 5
Variant subtraction and filtration. Only a subset of variants in a sample are legitimate candidates that might be responsible for the mutant phenotype of interest. In addition to the ability to map potential mutant lesions to a small region (∼1 Mb), the CloudMap pipeline allows users to subtract nonphenotype-inducing variants from consideration. (A) Subtracting variants present in the background strain. If the premutagenesis, starting strain has been sequenced, users may use the GATK Select Variants tool to subtract starting strain variants (“background variants”) from consideration. (B) Subtracting variants present in other mutant strains from the same screen. If the premutagenesis strain has not been sequenced, then fewer variants can be subtracted from the mutant under consideration. If other mutant strains from the same screen have been sequenced, common variants present in the premutagenesis strain can be deduced from sequence analysis of such mutants. Employing a fairly conservative approach, we can choose to subtract variants only if they are present in only two or more mutants that have been derived from forward genetic screens on the same starting strain. (C) Subtracting variants present in at least one mutant strain of the same background. A less conservative variant subtraction strategy than mentioned in B involves subtracting all variants that are present in the mutant strain of interest and at least one additional strain from the same screen. (D) Subtracting variants present in at least one strain of any background. A more liberal variant subtraction strategy can be performed by subtracting variants present in at least one strain of any background. The same caveats for this strategy apply as for the strategy described above in C. As variant information from more whole genome sequenced strains becomes available, more variants will be available for this subtraction strategy.
Figure 6
Figure 6
CloudMap Hawaiian Variant Mapping with WGS Data strategy. (A) Schematic presentation of a previously described one-step strategy for whole genome sequencing and mapping (Doitsidou et al. 2010) modeled on a similar strategy in plants (Schneeberger et al. 2009). (B) The CloudMap Hawaiian Variant Mapping With WGS tool plots the ratio of mapping strain alleles/total reads at each of the mapping strain SNP positions in the genome, as exemplified with the ot266 dataset. To better visualize trends in the scatter plots of the SNP ratios, we plot a LOESS regression line (red) through all the points on each chromosome. Each scatter plot also has a corresponding frequency plot that displays regions of linked chromosomes where pure parental allele SNP positions are concentrated. The same genomic region that shows linkage in the LOESS scatter plots also shows a matching peak in the frequency plots of pure parental alleles. These frequency plots are binned by default into 1-Mb (gray) and 0.5-Mb (red) bins although these bin sizes are adjustable. The figure also shows 2-Mb (gray) and 1-Mb (red) bin sizes (top frequency plot) and 0.5-Mb and 0.25-Mb bin sizes (bottom frequency plot). Data in these plots can also be normalized to improve the mapping signal (details in text, Figure 7, and Table S1). (C) CloudMap Hawaiian Variant Mapping with WGS Data variant subtraction. As described in the text and in Figure 5, subtracting variants present in other samples can reduce the number of variants that are considered candidates for causing the phenotype of interest.
Figure 7
Figure 7
Hawaiian SNP normalization. (A) All mapping plots examined thus far contain similar regions of pure parental alleles. To normalize for these and improve the mapping signal, we removed those Hawaiian SNPs from consideration where the ratio of Hawaiian alleles/total read depth was either <0.05 or >0.95 in at least two mutant strains (Table S1 and details in text). (B) Equation of mapping strain SNP normalization procedure. Users have the option of applying this normalization when using the CloudMap Hawaiian Variant Mapping With WGS Data tool.
Figure 8
Figure 8
Proof-of-Principle Variant Subtraction strategy. A step-by-step proof-of-principle analysis using the vab-3(ot266) allele. Strains ot260 and ot263 are mutants retrieved from the same screen for loss of dopamine neuron specification as ot266. ot266 was crossed to the highly polymorphic Hawaiian strain so it contains many more variants than the ot260 and ot263 strains, which were sequenced without outcrossing. Automated workflows for this analysis, raw datasets, and a shared history of the analysis are all available at http://usegalaxy.org/cloudmap.
Figure 9
Figure 9
Hawaiian Variant Mapping With WGS Data tool support for other organisms. To demonstrate support for organisms other than C. elegans, we show that CloudMap can be used to map mutant WGS data from Arabidopsis (Schneeberger et al. 2009). Users must provide a simple configuration file for organisms other than C. elegans and Arabidopsis. Configuration files for most organisms and instructions for other organism support are provided at http://usegalaxy.org/cloudmap.
Figure 10
Figure 10
The vab-3(ot266) allele as displayed in the UCSC Genome Browser. Users can view their WGS alignments and any other track-based data in their choice of genome browser (UCSC, WormBase, IGB, or Galaxy Trackster). Here we show the vab-3 locus and a zoom-in view of the C—>T SNP that leads to a premature stop mutation.
Figure 11
Figure 11
Variant Discovery Mapping. (A) Schematic representation of two extreme examples of the segregation of crossing strain variants (lime green diamonds), mutagen-induced variants (red diamonds), and background strain variants (pale blue diamonds) following an outcross of a mutant strain (white chromosome) to a nonparental (gray chromosome) strain. (B) Schematic representation of variant subtraction strategy for allele frequency plots. Allele frequency scatter plots display the ratio of variant reads/total reads at heterozygous and homozygous variant positions in the sequenced sample of pooled F2 mutant progeny. Scatter plots are shown both prior to and after the successive subtraction of crossing-strain and background-strain variants. Color scheme is the same as in A. CloudMap Variant Discovery Mapping plots of normalized pure parental allele frequency for ot266. Note: y-axis scales are not consistent from panel to panel due to normalization. (C) Schematic representation of combining variant lists from other mutants to generate crossing-strain– or background-strain–specific variant lists for subtraction during Variant Discovery Mapping. Color scheme is the same as in A and B.

References

    1. Abe A., Kosugi S., Yoshida K., Natsume S., Takagi H., et al. , 2012. Genome sequencing reveals agronomically important loci in rice using MutMap. Nat. Biotechnol. 30: 174–178 - PubMed
    1. Afgan E., Baker D., Coraor N., Goto H., Paul I. M., et al. , 2011. Harnessing cloud computing with Galaxy Cloud. Nat. Biotechnol. 29: 972–974 - PMC - PubMed
    1. Bigelow H., Doitsidou M., Sarin S., Hobert O., 2009. MAQGene: software to facilitate C. elegans mutant genome sequence analysis. Nat. Methods 6: 549. - PMC - PubMed
    1. Blankenberg D., Gordon A., Von Kuster G., Coraor N., Taylor J., et al. , 2010. Manipulation of FASTQ data with Galaxy. Bioinformatics 26: 1783–1785 - PMC - PubMed
    1. Cingolani P., Platts A., Wang le L., Coon M., Nguyen T., et al. , 2012. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6: 80–92 - PMC - PubMed

Publication types