Baseline Acute Myeloid Leukemia Prognosis Models using Transcriptomic and Clinical Profiles by Studying the Impacts of Dimensionality Reductions and Gene Signatures on Cox-Proportional Hazard

Résumé

Gene marker extraction to evaluate risk in cancer can refine the diagnosis process and lead to adapted therapies and better survival. These survival analyses can be done through computer systems and Machine Learning (ML) algorithms such as the Cox-Proportional-Hazard model from gene expression (GE) RNA-Seq data. However, optimal tuning of CPH from genome-wide GE data is challenging and poorly assessed so far. In this work we propose to interrogate an Acute Myeloid Leukemia (AML) dataset (Leucegene) to derive key components of the CPH driving down its performance and discovering its sensitivity to various factors in hoping to ameliorate the system. In this study, we compare the projection and selection data reduction techniques, mainly the PCA and LSC17 gene signature in combination with the CPH in a ML framework. Results reveals that CPH performs better with a combination of clinical and genetic expression features. We determine that projections performs better than selections without clinical information. We ascertain that CPH is affected by overfitting and that this overfitting is linked to the number and the content of input covariables. We show that PCA links clinical features via ability to learn from the input data directly and generalizes better than LSC17 on Leucegene. We postulate that projection are preferred than selection on harder task such as assessing risk in the intermediate subset of Leucegene. We extrapolate that these findings apply in the more general context of risk detection via machine learning in cancer. We see that higher capacity models such as CPH-DNNs systems can be improved via survival-derived projections and combat overfitting through heavy regularization.

Léonard Sauvé
Léonard Sauvé
Étudiant au doctorat en bio-informatique

Étudiant au Doctorat en Bio-informatique | Développement de systèmes automatiques d’évaluation du risque en leucémie myéloïde aiguë à partir de données d’expression de gènes

Sébastien Lemieux
Sébastien Lemieux
Chercheur principal

Chercheur principal, Unité de recherche en bio-informatique fonctionnelle et structurale, IRIC | Direction scientifique de la plateforme de Bio-informatique | Professeur agrégé, Département de biochimie et médecine moléculaire, Université de Montréal