Nicolas Jacquin presents his research work at the CSHL 2024 Meeting on Biological Data Science

Nicolas Jacquin presented a poster on the reconstruction of transcriptomic profiles from k-mers and using machine learning during the *Biological Data Science meeting *, organized by Cold Spring Harbor Laboratory (CSHL)

Reconstructing transcriptomic profiles from short K-mers using machine learning

As the field of transcriptomics grows, it has become common practice to align reads to reference genomes as part of an analysis pipeline. While these references make our data understandable to us, they also introduce strong biases in the context of samples displaying high mutational load or genomic rearrangements. In cancer transcriptomics, reads that come from genes that have undergone massive mutations may be filtered out, masking an important feature of the sample. Worse yet, reads that confirm gene fusions tend to be either clipped or filtered out. In this work, we lay the groundwork for reference-free transcriptomics. A table of k-mers (subsequences of raw reads of constant size k) and their abundance should hold at least just as much biologically significant information as a classical transcriptomic profile (the expression level of each gene). All the information required to build a classical transcriptomic profile is still contained in a k-mer profile, and so is all the information that would be biased by the reference. However, biologically significant and interpretable data is hard to access from such a profile, given that a single RNA-Seq can produce up to 700 million different k-mers, depending on the depth and selected k. Not only that type of profile is so voluminous that it requires implementing specialized structures, it also contains a lot of noise arising from sequencing errors. As such, the space that represents k-mer abundances (and thus the transcriptomic profile) must be reduced to a more reasonable size. To do so, we rely on the “factorized embeddings” architecture of deep neural networks (Trofimov et al., 2020). We train such model to predict the abundance of each k-mer across multiple sequencing experiments (samples) from the k-mer sequences and an embedding for each sample. To make these predictions, the embedding corresponding to each sample must change to become a representation of the overall transcriptomic profile. Once the network has finished training, those weights (embeddings) should be a low-dimensionality representation of the transcriptomic profiles. These weights can then be used as input to an independent neural network, such as a classifier, to make predictions that should be possible given the transcriptomic profiles of each sample used to train the embeddings. We have currently managed to produce embeddings from the Leucegene dataset (https://leucegene.ca/) and the TCGA-BRCA cohort (cancer.gov/tcga), with up to 0.95 Pearson correlation between the predicted k-mer abundances and the real values. The extraction of information from the trained embedding is a work in progress, but shows promise with an accuracy of up to 65% for predicting the tissue type of each sample of the Leucegene dataset.