Photo by Braňo on Unsplash

Topic models for cancer research

How tools developed in linguistics can help medicine

Filippo Valle
3 min readMay 26, 2022

--

Text analysis is indeed one of the most promising fields in which machine learning is expanding. Read and extract information from written documents has become a powerful tool to gain and mine information from large text dataset.

The basics of topic modeling: Latent Dirichlet Allocation

One of the most famous tool to extract information from a bunch of texts is the framework called: Topic Modeling.

The aim of topic modeling is to describe a set of texts (documents) as a mixture of topics (you can think of them as a set of low-dimension coordinates). Each of this topic is represented by a probability distribution over your vocabulary.

The output of your model is composed by two sets of probabilities:

  • P(topic | document) describes the mixture of topic that contribute to a document;
  • P(word | topic) describes which words are important in which topic.

One of the first-introduced and most used model is the so-called Latent Dirichlet Allocation.

A network approach to topic models

In 2018 it was introduced [4] a new way of performing topic modeling. The idea is quite simple: imagine each of your document as a node connected to the word it is composed of. You can also weight the link with the number of times a word appear in the document. You will have a bipartite network with texts on one side and words on the other. Performing community detection on this kind of network is equivalent [4] to do topic modeling.

Bipartite network
A simple bipartite network. Image from https://www.mdpi.com/2072-6694/12/12/3799

A versatile framework

The network description of topic models has a good advantage: it can be easily extended to every kind of data that can be represented as a bipartite network. In Natural Language Processing this is often referred to as Bag of Words.

Gene expression data

A completely different, and apparently uncorrelated, framework is the biological framework of gene expression.

It is possible to obtain data concerning the number of messenger RNA in given cells or tissues’ samples. It has been shown that this kind of data share a lot of statistical properties [3] with linguistics and this encourages us to use algorithms and models developed in linguistics also in this context.

Breast cancer’s subtypes

One setting topic modeling can be used to classify Breast Cancer subtypes.

Breast Cancer can be divided in subclasses due to their molecular characteristics, each of these subtypes has different prognosis and treatments. Being able to identify them is of crucial importance.

Whit this kind of data it is possible to build a network in which your documents are breast cancer samples and your words are the genes. Topic modeling will help to identify Breast Cancer subtypes and will help the clinicians to exploit which are the genes (words) that play a role in the sub-typing of the cancer [1][2].

Breast cancer subtypes
Clusters of Breast Cancer. Image from https://www.mdpi.com/2072-6694/12/12/3799

References

[1] Valle F. et al.; A topic modeling analysis of TCGA transcriptomic data. Cancers 2020, 12 (12) 1150.

[2] Valle F. et al.; Multiomics topic modeling for breast cancer classification. Cancers 2022, 14(5) 3799.

[3] Lazzardi S. et al.; Emergent statistical laws in single-cell transcriptomic data. BiorXiv, 2021.

[4] Gerlach M. et al; A network approach to topic models. Science Advences, 2018, 4(7).

--

--

Filippo Valle

Interested in physics, ML application, community detection and coding. I have a Ph.D. in Complex Systems for Life Sciences