Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Accurate somatic small variant discovery for multiple sequencing technologies with DeepSomatic

Abstract

Somatic variant detection is an integral part of cancer genomics analysis. While most methods have focused on short-read sequencing, long-read technologies offer potential advantages in repeat mapping and variant phasing. We present DeepSomatic, a deep-learning method for detecting somatic small nucleotide variations and insertions and deletions from both short-read and long-read data. The method has modes for whole-genome and whole-exome sequencing and can run on tumor–normal, tumor-only and formalin-fixed paraffin-embedded samples. To train DeepSomatic and help address the dearth of publicly available training and benchmarking data for somatic variant detection, we generated and make openly available the Cancer Standards Long-read Evaluation (CASTLE) dataset of six matched tumor–normal cell line pairs whole-genome sequenced with Illumina, PacBio HiFi and Oxford Nanopore Technologies, along with benchmark variant sets. Across samples, both cell line and patient-derived, and across short-read and long-read sequencing technologies, DeepSomatic consistently outperforms existing callers.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: DeepSomatic overview and performance on the SEQC2 HCC1395 benchmark.
Fig. 2: Six tumor–normal cell lines sequenced with multiple technologies.
Fig. 3: Somatic variant-calling performance of five tumor–normal cancer cell lines.
Fig. 4: Somatic variant calling on glioblastoma and pediatric blood cancer tumor samples.
Fig. 5: Extending DeepSomatic to tumor-only models and other data types.

Similar content being viewed by others

Data availability

The CASTLE cell line sequencing produced in this study is openly available at NCBI SRA BioProject PRJNA1086849 (ref. 53) and on GitHub https://github.com/CASTLE-Panel/castle (ref. 25). PacBio HiFi sequencing data are available at https://www.pacb.com/connect/datasets. Accession codes for the samples are organized in Supplementary Table 1. Sequencing of the clinical samples is under controlled access and is available through dbGaP study phs002529 (ref. 54) and phs004188 (ref. 55). CASTLE benchmarking sets derived from DeepSomatic and aligned reads used for training and evaluation are available in the Google Cloud at https://console.cloud.google.com/storage/browser/brain-genomics-public/publications/park2024_deepsomatic. All benchmarking sets and variant-calling outputs are available via Zenodo at https://doi.org/10.5281/zenodo.16595168 (ref. 56).

Code availability

DeepSomatic and DeepVariant code is available publicly on GitHub at https://github.com/google/deepsomatic (ref. 57) and https://github.com/google/deepvariant (ref. 48). pbmm2 used for aligning PacBio HiFi reads is available from GitHub at https://github.com/PacificBiosciences/pbmm2. Filtering COSMIC coding variants OakVar available on GitHub at https://github.com/rkimoakbioinformatics/oakvar (ref. 32). Our training loop is available on GitHub at https://github.com/google/deepvariant/blob/r1.8/deepvariant/train.py (ref. 48). Somatic variant caller evaluation was performed using files from SEQC2 at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/data/WGS/. Variant calls for MuTect2 and SomaticSniper are from SEQC2 at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/analysis/SNVs/vcfs/WGS/. Scripts for generating training and benchmarking sets are available on GitHub at https://github.com/jimin001/DeepSomatic_manuscript.

References

  1. Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719–724 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Alexandrov, L. B. & Stratton, M. R. Mutational signatures: the patterns of somatic mutations hidden in cancer genomes. Curr. Opin. Genet. Dev. 24, 52–60 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Perera-Bel, J. et al. From somatic variants towards precision oncology: evidence-driven reporting of treatment options in molecular tumor boards. Genome Med. 10, 18 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Garcia-Prieto, C. A., Martínez-Jiménez, F., Valencia, A. & Porta-Pardo, E. Detection of oncogenic and clinically actionable mutations in cancer genomes critically depends on variant calling tools. Bioinformatics 38, 3181–3191 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Farswan, A. et al. Branching clonal evolution patterns predominate mutational landscape in multiple myeloma. Am. J. Cancer Res. 11, 5659–5679 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Li, W. & Freudenberg, J. Mappability and read length. Front. Genet. 5, 381 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012).

    Article  CAS  PubMed  Google Scholar 

  9. Koboldt, D. C. et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Wilm, A. et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 40, 11189–11201 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).

    Article  CAS  PubMed  Google Scholar 

  13. Sahraeian, S. M. E. et al. Deep convolutional neural networks for accurate somatic mutation detection. Nat. Commun. 10, 1041 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Krishnamachari, K. et al. Accurate somatic variant detection using weakly supervised deep learning. Nat. Commun. 13, 4248 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Musunuri, R. L. et al. Lancet2: improved and accelerated somatic variant calling with joint multi-sample local assembly graphs. Preprint at bioRxiv https://doi.org/10.1101/2025.02.18.638852 (2025).

  16. Fang, L. T. et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat. Biotechnol. 39, 1151–1160 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Damaraju, N., Miller, A. L. & Miller, D. E. Long-read DNA and RNA sequencing to streamline clinical genetic testing and reduce barriers to comprehensive genetic testing. J. Appl. Lab. Med. 9, 138–150 (2024).

    Article  PubMed  Google Scholar 

  19. Kolesnikov, A. et al. Local read haplotagging enables accurate long-read small variant calling. Nat. Commun. 15, 5907 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Zheng, Z. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat. Comput. Sci. 2, 797–803 (2022).

    Article  PubMed  Google Scholar 

  21. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    Article  CAS  PubMed  Google Scholar 

  22. Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 18, 1322–1332 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Kolmogorov, M. et al. Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483–1492 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Zheng, Z. et al. ClairS: a deep-learning method for long-read somatic small variant calling. Preprint at bioRxiv https://doi.org/10.1101/2023.08.17.553778 (2023).

  25. Kolmogorov, M. & Gokce, A. CASTLE-Panel/castle. Datasets. GitHub https://github.com/CASTLE-Panel/castle (2025).

  26. Keskus, A. G. et al. Severus detects somatic structural variation and complex rearrangements in cancer genomes using long-read sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-025-02618-8 (2025)

  27. Díaz-Gay, M. et al. Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment. Bioinformatics 39, btad756 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Vasimuddin, M., Misra, S., Li, H. & Aluru, S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In Proc. 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 314–324 (IEEE, 2019).

  29. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Bergstrom, E. N. et al. SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genomics 20, 685 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Lansdon, L. A. et al. Successful classification of clinical pediatric leukemia genetic subtypes via structural variant detection using HiFi long-read sequencing. Preprint at medRxiv https://doi.org/10.1101/2024.11.05.24316078 (2024).

  32. Kim, R. rkimoakbioinformatics/oakvar. Source code. GitHub https://github.com/rkimoakbioinformatics/oakvar/ (2025).

  33. Steiert, T. A. et al. A critical spotlight on the paradigms of FFPE-DNA sequencing. Nucleic Acids Res. 51, 7143–7162 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Xiao, W. et al. Toward best practice in cancer mutation detection with whole-genome and whole-exome sequencing. Nat. Biotechnol. 39, 1141–1150 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Koboldt, D. C. Best practices for variant calling in clinical sequencing. Genome Med. 12, 91 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Keskus, A. G. et al. Severus detects somatic structural variation and complex rearrangements in cancer genomes using long-read sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-025-02618-8 (2025).

  37. Cohen, A. S. A. et al. Genomic answers for children: Dynamic analyses of >1000 pediatric rare disease genomes. Genet. Med. 24, 1336–1348 (2022).

    Article  CAS  PubMed  Google Scholar 

  38. Monlong, J., Lorig-Roach, R., Meredith, M. & Negi, S. nanoporegenomics/wambam. Source code. GitHub https://github.com/nanoporegenomics/wambam (2025).

  39. Bushnell, B. BioInfoTools/BBMap. Source code. GitHub https://github.com/BioInfoTools/BBMap/blob/master/sh/reformat.sh (2025).

  40. Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).

  41. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  42. Lake, J. A. & Sequencing (CoLoRS), C. of L. R. Consortium of Long Read Sequencing Database (CoLoRSdb). Zenodo https://doi.org/10.5281/zenodo.11511513 (2024).

  43. Chen, N.-C. et al. Improving variant calling using population data and deep learning. BMC Bioinf. 24, 197 (2023).

    Article  CAS  Google Scholar 

  44. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  PubMed  Google Scholar 

  47. Szegedy, C. et al. Rethinking the inception architecture for computer vision. Proc. IEEE Conference on Computer Vision and Pattern Recognition 2818–2826 (2016); https://doi.org/10.1109/CVPR.2016.308

  48. Poplin, R. et al. google/deepvariant. Google (2025). Source code. GitHub https://github.com/google/deepvariant (2025).

  49. Kingma, D. P. & Ba, J. ADAM: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2017).

  50. Ahmad, T. KolmogorovLab/Wakhan. Source code. GitHub https://github.com/KolmogorovLab/Wakhan (2025).

  51. Bergstrom, E. N. et al. AlexandrovLab/SigProfilerAssignment. Source code. GitHub https://github.com/AlexandrovLab/SigProfilerAssignment (2025).

  52. Díaz-Gay, M. et al. AlexandrovLab/SigProfilerMatrixGenerator. Source code. GitHub https://github.com/AlexandrovLab/SigProfilerMatrixGenerator (2025).

  53. CASTLE panel: Cancer Standards Long-read Evaluation. Datasets. Sequence Read Archive https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA1086849 (2025).

  54. Childhood Cancer Data Initiative (CCDI): Comprehensive Genomic Sequencing of Pediatric Cancer Cases (CMRI/KUCC) Datasets. dbGAP https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002529.v2.p1 (2025).

  55. DeepSomatic: Accurate Somatic Small Variant Discovery for Multiple Sequencing Technologies. Datasets. dbGAP https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs004188.v1.p1 (2025).

  56. Park, J. Supporting data for: Accurate somatic small variant discovery for multiple sequencing technologies with DeepSomatic. Zenodo https://doi.org/10.5281/zenodo.16595168 (2025).

  57. Park, J. et al. google/deepsomatic. Google (2025). Source code. GitHub https://github.com/google/deepsomatic (2025).

Download references

Acknowledgements

HCC1395-HCC1395BL ONT sequencing was supported by the National Cancer Institute of the NIH under grant award number U01CA253405. M.S.F. reports grants from Braden’s Hope for Childhood Cancer, Big Slick, Black & Veatch Foundation, Masonic Cancer Alliance, Noah’s Bandage Project, Elizabeth and Monte McDowell, Cancer Center Auxiliary, and the Department of Defense (grant no. W81XWH-20-1-0358). B.P. and J.P. were supported by NIH grant nos. R01HG010485, U41HG010972, U24HG011853 and OT2OD033761. M.K., A.G.K., A.B. and T.A. were supported by the Intramural Research Program of the NIH. The contributions of the NIH authors were made as part of their official duties as NIH federal employees, are in compliance with agency policy requirements and are considered Works of the United States Government. However, the findings and conclusions presented in this paper are those of the authors and do not necessarily reflect the views of the NIH or the US Department of Health and Human Services. This work was supported by grant award number HT9425-23-1-0844 from the Congressionally Directed Medical Research Programs (CDMRP). This research includes work performed in TGen’s Collaborative Sequencing Center, a City of Hope Comprehensive Cancer Center supported shared resource (grant no. NCI-P30CA033572).

Author information

Authors and Affiliations

Authors

Contributions

B.P., K.S. and M.K. helped conceive and direct the study. J.P. performed data analysis. D.E.C., P.-C.C., A. Kolesnikov, L.B., J.C.M., A.C. and K.S. contributed to DeepSomatic development. J.P., A.B. and A. Keskus performed cell line data processing. J.G., B.M. and K.H.M. contributed to ONT data sequencing. S.S. performed Hi-C data sequencing. M.K., A. Keskus, A.B. and T.A. contributed to Severus development. J.S., Y.Z. and B.T. performed Illumina data sequencing. G.N., A.H. and N.R. performed ONT sequencing of the HCC1395 cell line. B.Y., I.P., L.A.L., C. Bi, A.W., M.G., T.P. and M.S.F. performed PacBio data sequencing. F.P.B., R.R., S.M., T.R.R.-B. and C. Brown performed glioblastoma sample coordination and sequencing. J.P., B.P., D.E.C., K.S. and M.K. drafted the paper.

Corresponding authors

Correspondence to Andrew Carroll, Mikhail Kolmogorov, Benedict Paten or Kishwar Shafin.

Ethics declarations

Competing interests

K.S., D.E.C., P.-C.C., A. Kolesnikov, L.B., J.C.M. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. M.S.F. is a part of the speakers bureau for Bayer and PacBio. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Qian Liu, Kai Ye and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–22, Commands: examples of how command line utilities were run.

Reporting Summary

Supplementary Table 1

Twenty-one Supplementary Tables, consisting of DeepSomatic analysis results. Includes sequencing data details, benchmarking results and benchmark dataset details.

Supplementary Table 2

Six Supplementary Tables, consisting of analysis results for clinical pediatric blood cancer samples using DeepSomatic and ClairS.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, J., Cook, D.E., Chang, PC. et al. Accurate somatic small variant discovery for multiple sequencing technologies with DeepSomatic. Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02839-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-025-02839-x

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing