Le Laboratoire Lemieux participera à la 32e Conférence ISMB
Plusieurs étudiant.e.s du laboratoire Lemieux présenteront leurs travaux de recherche dans le cadre de la 32e conférence sur les Intelligent Systems for Molecular Biology (ISMB). La conférence se tiendra du 12 au 16 juillet à Montréal: venez voir les posters de nos étudiant.e.s ! Pour en apprendre plus sur les travaux qui y seront présentés, continuez à lire.
K-mer Walking: An Efficient Reference-Free Algorithm for Flanking Sequence Reconstruction
Poster présenté par Nicolas Jacquin
With the rapid expansion of transcriptomics as a field, it’s easy to forget that the references we use to analyze our transcriptomics data are incomplete, and that they introduce bias by filtering out reads that don’t align to their reference. But reads outside of what our references consider protein-coding can still be of great biological significance, and sometimes are even found to code for non-canonical proteins, as outlined by Laumont et al. (2016). This latent part of the transcriptome is underexplored, and more tools to further facilitate its analysis could greatly improve our understanding of proteomics in general.
When working with non-canonical protein-coding sequences, the only well supported way to retrieve flanking sequences would be to align those sequences to a reference, keeping hits outside of the annotated region, and use the reference sequences that flank those hits. However, this can not only be resource intensive and hard to perform on a systematic scale for multiple samples, it also means that if the real flanking sequences are sufficiently different from the ones in the reference, the results could be completely wrong. In this work, we showcase a new tool we are currently developing that allows for fast retrieval of flanking sequences of protein-coding RNA within the raw data of multiple RNASeq experiments, without the need of ever aligning those reads to a reference.
Use of Gene Expression Profiles to Predict the Biological Activity of Chemical Compounds in Cancer Cells
Poster présenté par Léa Kaufmann
Compound analysis is a long and costly process for developing new drug therapies. Chemoinformatics aims to speed it up by making predictions based on the compound’s chemical structure, but it does not take into account biological interactions. We hypothesize that the transcriptomic profile of a cell treated with a chemical compound is a better representation of the compound than its structure for predicting its activity. First, we use L1000 quantifications from the LINCS database (Moshkov N et al, 2023) to predict the expression of a gene in a cell line from the transcriptomic profile of another cell line, treated under the same conditions. To achieve this objective, we implement a feed-forward, two-hidden layer regression neural network. Our results indicate that a cell’s transcriptomic profile serves as a reliable vector representation of a compound’s effect and offers the potential for predicting the outcomes of treatment in various cell lines.
Subsequently, we attempt to predict the outcomes of high-throughput screening assays from the ChemBank dataset (Petri Seiler et al., 2008) using the LINCS transcriptomic profiles. Our main challenge is training our model with a limited amount of data because we intersect the LINCS and ChemBank datasets to obtain common compounds. Our results are heterogeneous and must be improved by exploring new model architectures or datasets. Nevertheless, they are very encouraging. Our work will enable the in silico identification of compounds that would be good candidates for the pre-clinical phase of drug development against certain cancers.
Using Denoising Diffusion Probabilistic Models for the Denoising of low-depth RNA-seq
Poster présenté par Carl Munoz
Though RNA-seq has allowed for a deeper understanding of cellular activity, sequencing costs limit the number of samples that can be sequenced per experiment. Sequencing costs can be reduced by decreasing the number of reads sequenced per sample but also reduces the data quality. Existing techniques to artificially increase data quality, such as imputation in single-cell RNA-seq, are not adapted for bulk RNA-seq data. One model that has the potential to solve this issue is the denoising diffusion probabilistic model (DDPM). This model, originally designed for image generation, iteratively denoises random data across several timesteps to generate a realistic sample similar to the data it was trained on. There also exists an alternative method that instead starts with all-zero values and increases these values iteratively. It so happens that the partially denoised samples at intermediate timesteps strongly reassemble low-coverage RNA-seq. As such, it would appear possible to develop and train a similar DDPM that receives low-coverage transcriptomic profiles at the corresponding intermediate timestep and iteratively denoise said data to reach the same quality as standard-coverage RNA-seq. Here, we present the preliminary results of the development of such a model, where we use the TCGA dataset to benchmark the effect of low-coverage RNA-seq and train a DDPM model we have implemented. These results suggest that the DDPM denoises low-coverage RNA-seq with greater accuracy than more naïve methods. This model has the potential to create new sequencing standards for transcriptome-wide RNA-seq analysis for per-patient sequencing or large-scale experimental analyses.
Concordance Index Stabilization for Survival Analysis from Cancer Gene Expression with Regular and Deep Cox Proportional Hazards
Poster présenté par Léonard Sauvé
Computational survival analyses from cancer gene expression data allows the identification of cancer molecular subgroups displaying differential prognoses allowing new therapeutical targets to be discovered and lead to better patient’s treatment response and survival. The regularized Cox proportional hazards and the Cox deep neural network were developed to tackle this question, but there is few evidence that the deep models are statistically superior to their linear counterpart. We believe that might be linked to models having not reached convergence. We propose to compare the measure of concordance index, which is the measurement of a predictive performance in survival analysis, reported on fully converged models, when their loss is at its minimum. To reach convergence, we first applied a sigmoid function to the output of the models, and then selected hyper-parameters necessary and evaluated their performance by cross-validation. With these findings, our analysis reveals no significant performance difference by using a deep model. Secondly, we reveal that reducing the dimensionality of the input gene expression profiles with Principal Components Analysis does not improve the performance of the models. Finally, we compare our Julia implementation with a standard survival analysis library in Python. With this study, we aim to fix concordance index evaluation of deep and regular survival analysis models.