Predicting T cell specificity using ML and deep phenotypic data

“Can we predict T cell specificity with digital biology and machine learning?” was the title of a recent publication in Nature Reviews Immunology by Hudson et al.. The author’s conclusion was: It’s tough. Why bother in the first place? Because T cells target all sorts of diseases, their receptor sequences (TCR) are relatively straightforward to determine and, most importantly, these TCRs can be turned into therapeutics.

Most methods try to predict T cell specificity from the TCR sequence alone, even though the amount of information contained in these few amino acids is naturally limited. Schattgen et al. gave this a twist by co-clustering TCR sequences with transcriptional profiles. With our just published Cell Reports paper called “In-depth analysis of human virus-specific CD8+ T cells delineates unique phenotypic signatures for T cell specificity prediction” we weigh in with plenty of multimodal data (that we made publicly available) and cool analysis. In brief, we (i.e. ImmunoScape) describe and share in-depth profiling data of virus-specific CD8+ T cells and demonstrate that ML can be used to predict T cell antigen specificity from phenotypic data alone. We identify unique phenotypes of T cells specific for different virus antigens and furthermore train machine learning models, which infer phenotypic signatures and correctly predict T cell specificity from these deep phenotypic profiles. Most importantly, we functionally validate our ML predictions by confirming antigen specificity through in vitro testing, including an example of a so-called unseen epitope!

Artist’s impression of our ML-based T cell specificity prediction

Data

The multimodal data we generated for this paper was produced with advanced techniques, honed during the service years of ImmunoScape, during which we generated immunological insights for drug, vaccine and disease monitoring studies for big pharma companies. They come in two main forms: 1) T cell specificity screening data using CyTOF (mass cytometry) and 2) deep profiling VDJ CITE-Seq data. We use a proprietary multiplexing technology (developed by our co-founder Evan Newell) which allows us to screen millions of T-cells against hundreds of epitope candidates at once, while also generating readout for dozens of phenotypic markers. This allows us to understand existing T cell specificities and also provides phenotypic information. This is followed by a VDJ CITE-Seq approach, which measures the phenotype of the just identified antigen-specific T cells in two modalities: gene expression and surface markers, as well as the TCR sequence and antigen specificity per single cell. VDJ CITE-Seq is more or less a commodity approach, but what makes it so powerful in our hands is that it is informed through the upstream CyTOF approach: we only use peptides that were found in the screening approach, rather than blindly adding random peptides.

Machine learning

For this paper we trained two ML models: one for the CyTOF data and one for the VDJ CITE-Seq data. In the latter case we actually integrate both phenotypic modalities. In both cases we only predict the viral source and not the viral epitope. The actual epitope for the corresponding TCR still has to be determined separately, a process which is called deorphanization. Predicting the class instead of the epitope seems like a shortcoming, but has the following advantages 1) we have plenty more data to train with (by bunching all data for one epitope for the same virus together), which is obviously good for the ML models and 2) we can find TCRs specific for antigens/epitopes that were not even part of our experimental panels i.e. the input data (also known as “unseen” epitopes). And we describe and validate exactly such an example in the paper! Experimental deorphanization itself is a hard problem, but manageable for viruses, whose peptidomes are limited in size. Even without deorphanization these simple ML models can already be used to filter or prioritise TCR candidates.

As an aside: Our study is a proof of concept for the possibility of leveraging rather simple ML concepts together with solid data to predict antigen specificity from phenotypic data. Some people will wonder about the application deep learning (DL) methods, which these days has become almost synonymous with AI/ML. DL methods can be extremely powerful and while we have a few in our arsenal, we always try non-DL methods first, as a matter of principle. And that’s for two simple reasons: they are easier to compute and, more importantly, they are immediately explainable (refer to the regression coefficients in the paper for example).

Why publish as a company?

One might think that writing papers and sharing data plus analysis is unusual, even counterproductive for a startup. At ImmunoScape we strongly believe the opposite is true and in fact, we have a history of publishing papers. Every published paper not only show-cases our work, but also validates our approaches through peer review, which is important. Everyone can promise the moon on their website or in pitch decks, but a peer-reviewed article proves credibility, especially when paired with in vitro validation and supporting open data.

Obviously, we cannot give away our secret sauce, insights into our advanced methods or tumour-specific data which we use routinely in-house for the discovery of TCRs associated with solid tumours.

Going forward

This paper contains just proof of concept data and analysis. We have produced orders of magnitude more data, also produced with complementary technologies, focusing on solid cancer. We have furthermore built an array of sophisticated machine learning models for a number of applications. Naturally, we cannot release any of this, because we are using them for our mission to discover and develop next-generation TCR therapeutics. Let’s just say there is plenty in the pipeline.

Having said this, while big multimodal data and advanced computational methods are required for success, they are not sufficient: it’s the marriage of experimental and computational approaches (and I consider AI/ML just a subset of these), which makes all the difference. I am grateful for working in a company where the wet- and dry lab work very closely together, open communication is the norm and silos are absent. I am also grateful for working alongside exceptional talent, such as Florian Schmidt the first author of this study, and for being surrounded by colleagues that are across the board simply nice and a pleasure and fun to work with.