Nicolas Jacquin begins his doctorate in the Lemieux Laboratory

Following the submission of his master’s thesis, Nicolas Jacquin continues his studies in the Lemieux Laboratory and begins a doctoral thesis! His thesis, entitled Transcriptomics by k-mers through the adaptation of factored vector representations and the identification of genomic contexts, is now available in the Papyrus system of UdeM.

The continuous rise of transcriptomics and sequencing technologies has led to the development of numerous pipelines for transcriptomic data analysis. However, these methods all rely on aligning sequences to a reference genome to generate a transcriptomic profile. This alignment introduces biases and often fails to capture rare but potentially phenotypically significant genomic events, such as gene fusions. To overcome this limitation, it is necessary to produce reference-free transcriptomic profiles. This would allow for the representation of an RNA sample directly from sequencing reads, without relying on gene annotations, while retaining the predictive capability of a “classical” transcriptomic profile for phenotypes dependent on transcriptomic information. However, this approach presents a challenge of high dimensionality, as it involves working directly with raw sequencing reads, abandoning the concept of genes as a guide. In this memoire, I first present the development of a structure capable of representing hundreds of RNA-seq samples in memory. I then propose a method that uses neural networks to reduce the dimensionality of the data while preserving the transcriptomic information. This network is trained exclusively on k-mers derived from sequencing reads, and its task is to predict the abundance of k-mers in each sample. This generates a low-dimensional space (called an embedding) that is representative of transcriptomic profiles used during training, without the need for reference alignment. These low dimensionality embeddings should be able to be use to make all sort of transcriptomics related predictions (cancer type or tissue type classification…). I also present a jointly developed tool that leverages our optimization work on k-mer tables to rapidly and reference-free identify the flanking sequences of a peptide of interest from k-mers derived from sequencing experiments, thereby enabling the discovery of flanking sequences for non-canonical peptides (such as peptides from immunopeptidomics).