Carl Munoz presents is work at GLBIO 2024

May 14, 2024 3 min read

Carl Munoz gave a presentation on using deep learning diffusion models for denoising of low-depth RNA-seq data at this year Great Lakes Bioinformatics Conference (GLBIO). GLBIO is organized by the Great Lakes Bioinformatics Consortium to provide an interdisciplinary forum for the discussion of research findings and methods.

How low can you go? : Using deep learning diffusion models for denoising of low-depth RNA-seq data

Bulk RNA-seq is often done at a depth of around 20-50 million reads per sample. However, RNA-seq becomes prohibitively expensive for larger-scale projects at this level of coverage. One solution to increase the number of samples without increasing the budget is to reduce the sequencing depth per sample. However, this comes at the cost of reducing the quality of the data. As such, there is a trade-off that must be made between the number of samples and the quality of each transcriptomic profile. There exist imputation tools developed for single-cell RNA-seq and spatial transcriptomics data that circumvent this dilemma by denoising the low-depth RNA-seq data. However, none use the number of reads per sample as information for imputation, which could be crucial in determining exactly the amount of denoising that must be done. Additionally, these methods have not been designed for general usage outside of their respective sequencing technologies. One method that has the potential to address both of these issues, though not yet used in this context, is the denoising diffusion probabilistic model (DDPM) (Ho et al., 2020). Originally designed to randomly generate images, we intend to use the idea behind them to denoise RNA-seq data. Specifically, in the Poisson-JUMP version (Chen & Zhou, 2023), which instead starts with all zero values and adds counts at each step, the intermediate timesteps have a strikingly similar appearance to low-coverage RNA-seq. This suggests a level of compatibility between this model and RNA-seq that would allow us to use a low-coverage sample as an input at an intermediate timestep to generate a denoised sample. As such, we are currently developing a DDPM model capable of denoising RNA-seq data in a manner such that important biological information is conserved (such as sample type and differentially expressed genes). Using The Cancer Genome Atlas (TCGA) (Weinstein et al, 2014), we have established benchmarks on prediction accuracy and differentially expressed genes, which suggest that sequencing depth can be significantly reduced before seeing a significant impact on performance. We are currently in the process of developing the model, which we expect will allow us to denoise RNA-seq from nearly any coverage level. Once completed, this model would allow for cost reductions of RNA-seq in both medical and experimental contexts.