Abstract
Somatic variant detection is an integral part of cancer genomics analysis. While most methods have focused on short-read sequencing, long-read technologies offer potential advantages in repeat mapping and variant phasing. We present DeepSomatic, a deep-learning method for detecting somatic small nucleotide variations and insertions and deletions from both short-read and long-read data. The method has modes for whole-genome and whole-exome sequencing and can run on tumorânormal, tumor-only and formalin-fixed paraffin-embedded samples. To train DeepSomatic and help address the dearth of publicly available training and benchmarking data for somatic variant detection, we generated and make openly available the Cancer Standards Long-read Evaluation (CASTLE) dataset of six matched tumorânormal cell line pairs whole-genome sequenced with Illumina, PacBio HiFi and Oxford Nanopore Technologies, along with benchmark variant sets. Across samples, both cell line and patient-derived, and across short-read and long-read sequencing technologies, DeepSomatic consistently outperforms existing callers.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 /Â 30Â days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
The CASTLE cell line sequencing produced in this study is openly available at NCBI SRA BioProject PRJNA1086849 (ref. 53) and on GitHub https://github.com/CASTLE-Panel/castle (ref. 25). PacBio HiFi sequencing data are available at https://www.pacb.com/connect/datasets. Accession codes for the samples are organized in Supplementary Table 1. Sequencing of the clinical samples is under controlled access and is available through dbGaP study phs002529 (ref. 54) and phs004188 (ref. 55). CASTLE benchmarking sets derived from DeepSomatic and aligned reads used for training and evaluation are available in the Google Cloud at https://console.cloud.google.com/storage/browser/brain-genomics-public/publications/park2024_deepsomatic. All benchmarking sets and variant-calling outputs are available via Zenodo at https://doi.org/10.5281/zenodo.16595168 (ref. 56).
Code availability
DeepSomatic and DeepVariant code is available publicly on GitHub at https://github.com/google/deepsomatic (ref.â57) and https://github.com/google/deepvariant (ref.â48). pbmm2 used for aligning PacBio HiFi reads is available from GitHub at https://github.com/PacificBiosciences/pbmm2. Filtering COSMIC coding variants OakVar available on GitHub at https://github.com/rkimoakbioinformatics/oakvar (ref. 32). Our training loop is available on GitHub at https://github.com/google/deepvariant/blob/r1.8/deepvariant/train.py (ref.â48). Somatic variant caller evaluation was performed using files from SEQC2 at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/data/WGS/. Variant calls for MuTect2 and SomaticSniper are from SEQC2 at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/analysis/SNVs/vcfs/WGS/. Scripts for generating training and benchmarking sets are available on GitHub at https://github.com/jimin001/DeepSomatic_manuscript.
References
Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719â724 (2009).
Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94â101 (2020).
Alexandrov, L. B. & Stratton, M. R. Mutational signatures: the patterns of somatic mutations hidden in cancer genomes. Curr. Opin. Genet. Dev. 24, 52â60 (2014).
Perera-Bel, J. et al. From somatic variants towards precision oncology: evidence-driven reporting of treatment options in molecular tumor boards. Genome Med. 10, 18 (2018).
Garcia-Prieto, C. A., MartÃnez-Jiménez, F., Valencia, A. & Porta-Pardo, E. Detection of oncogenic and clinically actionable mutations in cancer genomes critically depends on variant calling tools. Bioinformatics 38, 3181â3191 (2022).
Farswan, A. et al. Branching clonal evolution patterns predominate mutational landscape in multiple myeloma. Am. J. Cancer Res. 11, 5659â5679 (2021).
Li, W. & Freudenberg, J. Mappability and read length. Front. Genet. 5, 381 (2014).
Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311â317 (2012).
Koboldt, D. C. et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568â576 (2012).
Wilm, A. et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 40, 11189â11201 (2012).
Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213â219 (2013).
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591â594 (2018).
Sahraeian, S. M. E. et al. Deep convolutional neural networks for accurate somatic mutation detection. Nat. Commun. 10, 1041 (2019).
Krishnamachari, K. et al. Accurate somatic variant detection using weakly supervised deep learning. Nat. Commun. 13, 4248 (2022).
Musunuri, R. L. et al. Lancet2: improved and accelerated somatic variant calling with joint multi-sample local assembly graphs. Preprint at bioRxiv https://doi.org/10.1101/2025.02.18.638852 (2025).
Fang, L. T. et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat. Biotechnol. 39, 1151â1160 (2021).
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597â614 (2020).
Damaraju, N., Miller, A. L. & Miller, D. E. Long-read DNA and RNA sequencing to streamline clinical genetic testing and reduce barriers to comprehensive genetic testing. J. Appl. Lab. Med. 9, 138â150 (2024).
Kolesnikov, A. et al. Local read haplotagging enables accurate long-read small variant calling. Nat. Commun. 15, 5907 (2024).
Zheng, Z. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat. Comput. Sci. 2, 797â803 (2022).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983â987 (2018).
Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 18, 1322â1332 (2021).
Kolmogorov, M. et al. Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483â1492 (2023).
Zheng, Z. et al. ClairS: a deep-learning method for long-read somatic small variant calling. Preprint at bioRxiv https://doi.org/10.1101/2023.08.17.553778 (2023).
Kolmogorov, M. & Gokce, A. CASTLE-Panel/castle. Datasets. GitHub https://github.com/CASTLE-Panel/castle (2025).
Keskus, A. G. et al. Severus detects somatic structural variation and complex rearrangements in cancer genomes using long-read sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-025-02618-8 (2025)
DÃaz-Gay, M. et al. Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment. Bioinformatics 39, btad756 (2023).
Vasimuddin, M., Misra, S., Li, H. & Aluru, S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In Proc. 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 314â324 (IEEE, 2019).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094â3100 (2018).
Bergstrom, E. N. et al. SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genomics 20, 685 (2019).
Lansdon, L. A. et al. Successful classification of clinical pediatric leukemia genetic subtypes via structural variant detection using HiFi long-read sequencing. Preprint at medRxiv https://doi.org/10.1101/2024.11.05.24316078 (2024).
Kim, R. rkimoakbioinformatics/oakvar. Source code. GitHub https://github.com/rkimoakbioinformatics/oakvar/ (2025).
Steiert, T. A. et al. A critical spotlight on the paradigms of FFPE-DNA sequencing. Nucleic Acids Res. 51, 7143â7162 (2023).
Xiao, W. et al. Toward best practice in cancer mutation detection with whole-genome and whole-exome sequencing. Nat. Biotechnol. 39, 1141â1150 (2021).
Koboldt, D. C. Best practices for variant calling in clinical sequencing. Genome Med. 12, 91 (2020).
Keskus, A. G. et al. Severus detects somatic structural variation and complex rearrangements in cancer genomes using long-read sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-025-02618-8 (2025).
Cohen, A. S. A. et al. Genomic answers for children: Dynamic analyses of >1000 pediatric rare disease genomes. Genet. Med. 24, 1336â1348 (2022).
Monlong, J., Lorig-Roach, R., Meredith, M. & Negi, S. nanoporegenomics/wambam. Source code. GitHub https://github.com/nanoporegenomics/wambam (2025).
Bushnell, B. BioInfoTools/BBMap. Source code. GitHub https://github.com/BioInfoTools/BBMap/blob/master/sh/reformat.sh (2025).
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).
An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56â65 (2012).
Lake, J. A. & Sequencing (CoLoRS), C. of L. R. Consortium of Long Read Sequencing Database (CoLoRSdb). Zenodo https://doi.org/10.5281/zenodo.11511513 (2024).
Chen, N.-C. et al. Improving variant calling using population data and deep learning. BMC Bioinf. 24, 197 (2023).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308â311 (2001).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434â443 (2020).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68â74 (2015).
Szegedy, C. et al. Rethinking the inception architecture for computer vision. Proc. IEEE Conference on Computer Vision and Pattern Recognition 2818â2826 (2016); https://doi.org/10.1109/CVPR.2016.308
Poplin, R. et al. google/deepvariant. Google (2025). Source code. GitHub https://github.com/google/deepvariant (2025).
Kingma, D. P. & Ba, J. ADAM: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2017).
Ahmad, T. KolmogorovLab/Wakhan. Source code. GitHub https://github.com/KolmogorovLab/Wakhan (2025).
Bergstrom, E. N. et al. AlexandrovLab/SigProfilerAssignment. Source code. GitHub https://github.com/AlexandrovLab/SigProfilerAssignment (2025).
DÃaz-Gay, M. et al. AlexandrovLab/SigProfilerMatrixGenerator. Source code. GitHub https://github.com/AlexandrovLab/SigProfilerMatrixGenerator (2025).
CASTLE panel: Cancer Standards Long-read Evaluation. Datasets. Sequence Read Archive https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA1086849 (2025).
Childhood Cancer Data Initiative (CCDI): Comprehensive Genomic Sequencing of Pediatric Cancer Cases (CMRI/KUCC) Datasets. dbGAP https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002529.v2.p1 (2025).
DeepSomatic: Accurate Somatic Small Variant Discovery for Multiple Sequencing Technologies. Datasets. dbGAP https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs004188.v1.p1 (2025).
Park, J. Supporting data for: Accurate somatic small variant discovery for multiple sequencing technologies with DeepSomatic. Zenodo https://doi.org/10.5281/zenodo.16595168 (2025).
Park, J. et al. google/deepsomatic. Google (2025). Source code. GitHub https://github.com/google/deepsomatic (2025).
Acknowledgements
HCC1395-HCC1395BL ONT sequencing was supported by the National Cancer Institute of the NIH under grant award number U01CA253405. M.S.F. reports grants from Bradenâs Hope for Childhood Cancer, Big Slick, Black & Veatch Foundation, Masonic Cancer Alliance, Noahâs Bandage Project, Elizabeth and Monte McDowell, Cancer Center Auxiliary, and the Department of Defense (grant no. W81XWH-20-1-0358). B.P. and J.P. were supported by NIH grant nos. R01HG010485, U41HG010972, U24HG011853 and OT2OD033761. M.K., A.G.K., A.B. and T.A. were supported by the Intramural Research Program of the NIH. The contributions of the NIH authors were made as part of their official duties as NIH federal employees, are in compliance with agency policy requirements and are considered Works of the United States Government. However, the findings and conclusions presented in this paper are those of the authors and do not necessarily reflect the views of the NIH or the US Department of Health and Human Services. This work was supported by grant award number HT9425-23-1-0844 from the Congressionally Directed Medical Research Programs (CDMRP). This research includes work performed in TGenâs Collaborative Sequencing Center, a City of Hope Comprehensive Cancer Center supported shared resource (grant no. NCI-P30CA033572).
Author information
Authors and Affiliations
Contributions
B.P., K.S. and M.K. helped conceive and direct the study. J.P. performed data analysis. D.E.C., P.-C.C., A. Kolesnikov, L.B., J.C.M., A.C. and K.S. contributed to DeepSomatic development. J.P., A.B. and A. Keskus performed cell line data processing. J.G., B.M. and K.H.M. contributed to ONT data sequencing. S.S. performed Hi-C data sequencing. M.K., A. Keskus, A.B. and T.A. contributed to Severus development. J.S., Y.Z. and B.T. performed Illumina data sequencing. G.N., A.H. and N.R. performed ONT sequencing of the HCC1395 cell line. B.Y., I.P., L.A.L., C. Bi, A.W., M.G., T.P. and M.S.F. performed PacBio data sequencing. F.P.B., R.R., S.M., T.R.R.-B. and C. Brown performed glioblastoma sample coordination and sequencing. J.P., B.P., D.E.C., K.S. and M.K. drafted the paper.
Corresponding authors
Ethics declarations
Competing interests
K.S., D.E.C., P.-C.C., A. Kolesnikov, L.B., J.C.M. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. M.S.F. is a part of the speakers bureau for Bayer and PacBio. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Qian Liu, Kai Ye and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1â22, Commands: examples of how command line utilities were run.
Supplementary Table 1
Twenty-one Supplementary Tables, consisting of DeepSomatic analysis results. Includes sequencing data details, benchmarking results and benchmark dataset details.
Supplementary Table 2
Six Supplementary Tables, consisting of analysis results for clinical pediatric blood cancer samples using DeepSomatic and ClairS.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Park, J., Cook, D.E., Chang, PC. et al. Accurate somatic small variant discovery for multiple sequencing technologies with DeepSomatic. Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02839-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41587-025-02839-x