Carl Munoz présente ses travaux de recherche à GLBIO 2024

Carl Munoz a fait une présentation sur l’utilisation de modèles de diffusion d’apprentissage profond pour le débruitage des données de séquençage d’ARN de faible profondeur lors de la Great Lakes Bioinformatics Conference (GLBIO) . GLBIO est organisé par le Great Lakes Bioinformatics Consortium pour fournir un forum interdisciplinaire pour la discussion des résultats et des méthodes de recherche.

How low can you go? : Using deep learning diffusion models for denoising of low-depth RNA-seq data

Résumé original soumis

Bulk RNA-seq is often done at a depth of around 20-50 million reads per sample. However, RNA-seq becomes prohibitively expensive for larger-scale projects at this level of coverage. One solution to increase the number of samples without increasing the budget is to reduce the sequencing depth per sample. However, this comes at the cost of reducing the quality of the data. As such, there is a trade-off that must be made between the number of samples and the quality of each transcriptomic profile. There exist imputation tools developed for single-cell RNA-seq and spatial transcriptomics data that circumvent this dilemma by denoising the low-depth RNA-seq data. However, none use the number of reads per sample as information for imputation, which could be crucial in determining exactly the amount of denoising that must be done. Additionally, these methods have not been designed for general usage outside of their respective sequencing technologies. One method that has the potential to address both of these issues, though not yet used in this context, is the denoising diffusion probabilistic model (DDPM) (Ho et al., 2020). Originally designed to randomly generate images, we intend to use the idea behind them to denoise RNA-seq data. Specifically, in the Poisson-JUMP version (Chen & Zhou, 2023), which instead starts with all zero values and adds counts at each step, the intermediate timesteps have a strikingly similar appearance to low-coverage RNA-seq. This suggests a level of compatibility between this model and RNA-seq that would allow us to use a low-coverage sample as an input at an intermediate timestep to generate a denoised sample. As such, we are currently developing a DDPM model capable of denoising RNA-seq data in a manner such that important biological information is conserved (such as sample type and differentially expressed genes). Using The Cancer Genome Atlas (TCGA) (Weinstein et al, 2014), we have established benchmarks on prediction accuracy and differentially expressed genes, which suggest that sequencing depth can be significantly reduced before seeing a significant impact on performance. We are currently in the process of developing the model, which we expect will allow us to denoise RNA-seq from nearly any coverage level. Once completed, this model would allow for cost reductions of RNA-seq in both medical and experimental contexts.