Mutation Mapping Pipiline for *C. elegans* EMS mutagenesis and Backcross Experiments • MutantSets

Quick Start

Inputs:
- paired-end fastq files to a galaxy (Afgan et al. 2016; Jalili et al. 2020) history as list of dataset pairs
- A suitable genome fasta file (C. elegans, ce11.fa.gz - Compatible with WBcel235.86 used by SnpEff)
Run the Pipeline: https://usegalaxy.eu/u/richardjacton/w/c-elegans-ems-mutagenesis-mutation-caller
Outputs:
- MultiQC HTML report with QC info on the input fastqs, trimming, mapping, and deduplication steps.
- .vcf file with variants from all samples (FreeBayes mutation caller)
- .vcf file with variants from all samples (MiModD mutation caller)
- .gff file with deletions from all samples (MiModD deletion calling tool)
Perform Quality filtering and appropriate set subtractions with MutantSets or alternatively the MiModD VCF Filter or SnpSift Filter tools to identify candidate variants.
(optionally) MiModD NacreousMap for visualisation of mutation locations and MiModD Report Variants for HTML mutation list

NB samples are expected to be of the form ‘A123_0001_S1_R1_L001.fq.gz’, sample Identifiers are extracted from this with a regular expression: \w+_(\d+)_S\d+_L\d+.*. This would yield the sample identifier of: 0001. If your file does not conform to this pattern you may need to update this regex by editing the rules in the ‘apply rule to collection’ step of the workflow.

Background

Doitsidou et al. reviewed Sequencing-Based Approaches for Mutation Mapping and Identification in C. elegans (Doitsidou, Jarriault, and Poole 2016). They describe three main approaches to mapping by sequencing:

Hawaiian variant mapping
EMS-density mapping
Variant discovery mapping

This pipeline is currently only compatible with 2 of them, EMS-density mapping & Variant discovery mapping (VDM).

Research Need

The Schumacher lab identified a need for an analysis pipeline to map and identify mutations in Ethyl methanesulfonate (EMS) mutagenesis forward genetic screens.

Previously a tool called CloudMap (Minevich et al. 2012) had been used for this purpose on a Galaxy server. CloundMap is no longer under active development and has been deprecated from Galaxy Europe and replaced by MiModD Docs

Choice of Tools

In a comparison of C. elegans mutation calling pipelines Smith et al. (Smith and Yun 2017) indicated that they had good results with the FreeBayes (Garrison and Marth 2012). So I have initially included this tool here in addition to theMiModD mutation caller to evaluate their relative performance. They also found the the BBMap aligner yielded better results however this is not available in Galaxy so I have opted for Bowtie2 for expediency.

Pipeline Summary

Pipeline File (Local)

https://usegalaxy.eu/u/richardjacton/w/c-elegans-ems-mutagenesis-mutation-caller

Adapter and Quality Trimming with fastp (Chen et al. 2018)
Alignment with bowtie2 --sensitive-local (Langmead and Salzberg 2012)
samtools view requiring that reads are mapped in a proper pair (Li et al. 2009)
Removal of PCR duplicates with Picard MarkDuplicates (“Picard Toolkit” 2019)
Left alignment of indels in the BAM files using FreeBayes (Garrison and Marth 2012)
MultiQC aggregating quality metrics from trimming, deduplication and alignment (Ewels et al. 2016)
Variant calling with FreeBayes (Garrison and Marth 2012), MiModD (“MiModD” 2013) variant caller and deletion caller
SNP effect annotation with SnfEff eff (Cingolani et al. 2012)
SNP type annotation with SnpSift Variant Type (Cingolani et al. 2012)

Instructions (Step-by-Step)

1. Upload Data to galaxy

2. Select all fastq files and create a paired list

3. Pair the fastq files

4. Import the workflow

5. Run the workflow

Select the paired list object and a genome sequence file as inputs

6. Check Quality Control Information

Inspect the MultiQC output for signs of technical problems with your data. Consult with your friendly local bioinformatician if there are QC issues you can’t diagnose.

7. Preliminary quality filtering SnpSift filter

Locate the SnpSift filter tool in the galaxy tools panel and apply some initial quality filters, simply ( QUAL > 15) or 20 is probably sufficient. Starting with a low stringency filter and applying more stringent criteria when inspecting your candidate mutations it is probably advisable to avoid throwing out possible mutations. Some initial filtering is advisable as the full-sized VCF files may be too large to be easily read by the candidate mutant inspection tool in the next steps. You can check how many lines are in your VCF files by selecting them in the Galaxy history.

8. Download Data

The main FreeBayes VCF file:

The MiModD deletion calls:

9. Load the results in the MutantSets Shiny App to identify candidate mutations

If running the App locally, install the R package from: https://github.com/RichardJActon/MutantSets

R package installation and running the app locally:

# install.packages("remotes") # If you don't already have remotes/devtools
# remotes::install_github("knausb/vcfR") # If vcfR fails to install from CRAN
remotes::install_github("RichardJActon/MutantSets")
MutantSets::launchApp() # opens the app in a web browser

Load the VCF and (optionally) the gff deletion mutant files into MutantSets
(Optionally) Name your samples something easier to understand
Use the genotype filters to subtract the appropriate sets
Tweak quality and allele frequency thresholds to get a small set of high quality candidates
Assess the candidate mutations by clicking on them and looking at their predicted effects and genomic locations
Download your top results as a .tsv file (openable in excel)

You should now have some candidate mutants to screen - Good Luck!

Feedback

Please direct bug reports, feature requests, and questions to the maintainer of the mutant sets package via [github issues](https://github.com/RichardJActon/MutantSets/issues.

References

Afgan, Enis, Dannon Baker, Marius van den Beek, Daniel Blankenberg, Dave Bouvier, Martin Čech, John Chilton, et al. 2016. “The Galaxy Platform for Accessible, Reproducible and Collaborative Biomedical Analyses: 2016 Update.” Nucleic Acids Research 44 (W1): W3–10. https://doi.org/10.1093/nar/gkw343.

Chen, Shifu, Yanqing Zhou, Yaru Chen, and Jia Gu. 2018. “Fastp: An Ultra-Fast All-in-One FASTQ Preprocessor.” Bioinformatics 34 (17): i884–90. https://doi.org/10.1093/bioinformatics/bty560.

Cingolani, Pablo, Adrian Platts, Le Lily Wang, Melissa Coon, Tung Nguyen, Luan Wang, Susan J. Land, Xiangyi Lu, and Douglas M. Ruden. 2012. “A Program for Annotating and Predicting the Effects of Single Nucleotide Polymorphisms, SnpEff.” Fly 6 (2): 80–92. https://doi.org/10.4161/fly.19695.

Doitsidou, Maria, Sophie Jarriault, and Richard J. Poole. 2016. “Next-Generation Sequencing-Based Approaches for Mutation Mapping and Identification in Caenorhabditis Elegans.” Genetics 204 (2): 451–74. https://doi.org/10.1534/genetics.115.186197.

Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016. “MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a Single Report.” Bioinformatics (Oxford, England) 32 (19): 3047–48. https://doi.org/10.1093/bioinformatics/btw354.

Garrison, Erik, and Gabor Marth. 2012. “Haplotype-Based Variant Detection from Short-Read Sequencing,” July, 1–9. http://arxiv.org/abs/1207.3907.

Jalili, Vahid, Enis Afgan, Qiang Gu, Dave Clements, Daniel Blankenberg, Jeremy Goecks, James Taylor, and Anton Nekrutenko. 2020. “The Galaxy Platform for Accessible, Reproducible and Collaborative Biomedical Analyses: 2020 Update.” Nucleic Acids Research 48 (W1): W395–402. https://doi.org/10.1093/nar/gkaa434.

Langmead, Ben, and Steven L Salzberg. 2012. “Fast Gapped-Read Alignment with Bowtie 2.” Nature Methods 9 (4): 357–59. https://doi.org/10.1038/nmeth.1923.

Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. 2009. “The Sequence Alignment/Map Format and SAMtools.” Bioinformatics 25 (16): 2078–79. https://doi.org/10.1093/bioinformatics/btp352.

“MiModD.” 2013. Baumeister Lab. https://sourceforge.net/projects/mimodd/.

Minevich, Gregory, Danny S. Park, Daniel Blankenberg, Richard J. Poole, and Oliver Hobert. 2012. “CloudMap: A Cloud-Based Pipeline for Analysis of Mutant Genome Sequences.” Genetics 192 (4): 1249–69. https://doi.org/10.1534/genetics.112.144204.

“Picard Toolkit.” 2019. Broad Institute. http://broadinstitute.github.io/picard/.

Smith, Harold E., and Sijung Yun. 2017. “Evaluating Alignment and Variant-Calling Software for Mutation Identification in c. Elegans by Whole-Genome Sequencing.” Edited by David J. Reiner. PLOS ONE 12 (3): e0174446. https://doi.org/10.1371/journal.pone.0174446.

Mutation Mapping Pipiline for C. elegans EMS mutagenesis and Backcross Experiments

Richard J. Acton

2025-02-25