NOETIK

TARIO-2: A Whole-Transcriptome Foundation Model from H&E Alone

Abhishaike Mahajan — Fri, 17 Apr 2026 19:09:49 GMT

tldr: At Noetik, we build foundation models of patient biology. Accordingly, we generate data best suited for molecularly characterizing patient tissue to train these models. However, the specific modality of data we believe is the richest—18,963-plex spatial transcriptomics (SpT)—is a research assay not used in clinical practice. To help ensure our models can be deployed more universally, we’ve created an evolution of a prior model called ‘TARIO-2’. TARIO-2 has been trained on rich multimodal data on thousands of patients, and acts as a translation layer between routinely-collected tumor data (H&E) and SpT. In this piece, we report early accuracy metrics, how we interpret the results, and what models like it may mean for the future of increasingly personalized cancer care. Finally: TARIO-2’s performance in distinguishing responders from non-responders in a novel treatment regimen setting has been accepted to a major upcoming conference, and we look forward to sharing more details of this work closer to that presentation.

An interactive demo of TARIO-2 predictions on real cancer WSI’s can be found here.

To discuss partnerships over this model, reach out to partnerships@noetik.ai.

Introduction

At Noetik, we build foundation models of human biology. In other words, our models attempt to understand the full complexity of a patient’s tissue well enough to predict which drugs will work and why. Our earlier models, trained on 18,963-plex spatial transcriptomics (SpT) data collected at subcellular resolution, have demonstrated that this kind of understanding is achievable, given sufficiently rich measurements. But richness and availability of human biological data are always in tension. SpT is extremely expensive to generate, rarely collected outside academic settings, and never used in clinical practice. If our models can only interpret patients whose tissue has been profiled with SpT, then they can only interpret a vanishingly small fraction of cancer patients, and they definitely can’t do it in the clinic.

H&E pathology images sit at the other end of this tradeoff. They are inexpensive, routinely collected, and, most critically, already exist for nearly every cancer patient. At first glance, they reveal dramatically less than a spatial gene map. But H&E contains far more than meets the human eye. Spatial and morphological patterns in tumor and adjacent cells and tissues can be strongly correlated with gene expression: some cases are obvious, like genes expressed in easily identified cell types, while others are not salient to the human eye but predictive nonetheless.

TARIO-2 is a new foundation model we have built to exploit exactly this fact. It extends the TARIO architecture, which we previously developed for SpT data, to multimodal sequences of H&E and SpT tokens. Critically, at inference time TARIO-2 requires only an H&E image as input1. From this image alone, TARIO-2 generates a predicted expression map for every gene in its training panel, effectively embedding that patient into the same rich biological representation our SpT-trained models operate in, at the cost of GPU-minutes.

As a result, TARIO-2 has allowed us to dramatically expand our partnership base for clinical trials, as the barrier to entry is not “enough physical tumor tissue to extract SpT data from“, but simply “do you have H&E images?“ Nearly every pharma company, every clinical trial, and every hospital biobank can answer yes to that question.

In a forthcoming piece, we will show how we are applying TARIO-2 to H&E tumor images in the context of clinical trials. In that use case, the pretrained TARIO-2 can learn from existing therapeutic outcomes data and, consequently, drive the design of upcoming clinical trials.

But even before response prediction enters the picture, the ability to embed any clinical H&E into a shared, high-dimensional representation of patient biology is immediately and independently valuable. This essay discusses this particular use-case of TARIO-2, how accurate it is, and what its predictions reveal about the relationship between tissue morphology and the transcriptome. Moreover, while this essay discusses one way of using this model, we view TARIO-2 as a more general simulator of cancer biology, which supports a wide variety of use cases.

If you are interested in partnering on any of these models, please reach out to partnerships@noetik.ai.

How accurate is it?

Generally speaking: accurate for thousands of genes across various cancer types, with the accuracy slowly diminishing for genes with more spatially diffuse expression patterns (which are intrinsically harder to learn).

First, some context. TARIO-2 treats SpT data like images aligned to corresponding H&E images. Except instead of being 3-channel RGB images like H&E, the SpT data contain roughly 20,000 “colors”; one for each gene in the whole-transcriptome panel TARIO-2 is trained on. These visuo-transcriptomic data are generated at submicron (and therefore subcellular) resolution, but at inference time we typically make H&E-to-SpT predictions at a much coarser grid, trading spatial fidelity for inference speed. This is merely a compute limitation, though: the more GPUs we throw at the problem, the higher resolution the predictions (hinting at an additional “scaling axis” described below.)

To give a visualization on what model predictions look like: on the left, we see the raw RNA abundance of this gene in a single tumor sample (in yellow), in the middle, the binned values of those ground-truth transcripts, and on the right, the model’s predictions (yellow is high, purple is low) when given only the H&E of that sample. You can see that while the grids are larger than any individual cell, in aggregate, the general spatial pattern they make up closely matches the ground truth.

Fig 1. Comparison between raw transcripts (yellow), binned transcript counts, and TARIO-2 predictions across multiple genes for the same sample

We can quantify the accuracy of TARIO-2 by binning the ground truth transcripts into the same grid-like format as TARIO-2’s outputs and computing the spatial correlation coefficient for each gene across every tumor sample. The higher the correlation, the better the model’s prediction. But while this is a correct estimation of raw accuracy, it is an underestimation of where the model is most useful. Not all genes display interesting spatial patterns worth predicting. Some are housekeeping genes expressed uniformly; others are expressed or detected at levels too low to form any predictable spatial structure.

To separate the genes that do have recoverable spatial signals from those that don’t, we need a measure of how concentrated a gene’s expression is across space, averaged over many tissue samples. We use a metric called Moran’s I for this; genes with a higher Moran’s I have more spatial signal, and thus the model’s performance of those genes matter more.

With that technical context, we can finally present TARIO-2 accuracy, on patients held out of training and categorized by cancer subtype. TARIO-2 is trained on 2,545 patients and validated on a held-out set of 213, with no patient overlap between the two. Note that because we generate data from many large, spatially distinct regions of each patient tissue sample, the actual number of multimodal tissue “images” we train and validate on is orders of magnitude larger.

Fig 2. Quantification of TARIO-2 accuracy on held-out patient samples, stratified by cancer type. a) Spatial correlation as a function of the number of genes included, where genes are ranked by spatial structure (Moran’s I). Shaded regions: 95% CI across samples. b) Correlation between spatially-averaged TARIO-2 predictions and total transcript detections across samples. Error bars: 95% CI across genes.

The left plot shows spatial correlation as a function of how many genes are included, ranked from most to least spatially structured. For the highest-signal genes, TARIO-2 achieves spatial correlations around 0.3–0.5, degrading gracefully as noisier genes are included. Performance varies by cancer type, likely reflecting greater morphological heterogeneity in some cancers.

The right plot shows bulk correlation, which collapses each sample into a single expression value per gene and asks how well TARIO-2 recovers overall abundance regardless of spatial arrangement. These correlations are notably higher, around 0.6 across most cohorts, meaning that even when the precise spatial pattern is difficult to recover, the model still captures which samples have higher or lower levels of a given gene.

Because it’s hard to grok what a spatial correlation of 0.5 means versus a correlation of 0.2, some examples of (sample x gene) pairs at different levels of spatial correlation are shown below.

Fig 3. Qualitative examples of TARIO-2 predictions for (gene, sample) pairs across a range of correlation strengths. Top row: ground truth binned transcript counts. Bottom row: TARIO-2 predictions. Samples have been selected to demonstrate a diversity of prediction accuracies; as correlation lowers, the ground truth transcripts grow more diffuse and sparse.

In the cases where a firmly delineated spatial pattern exists in the data, TARIO-2 can recover it, often achieving correlations far above the averages presented in Figure 2. As the transcriptomic pattern grows sparser or more diffuse—the pattern that most genes have—TARIO-2 predictions decline in accuracy, but in those cases the absolute spatial pattern matters less than the bulk values.

This said: it’s worth pointing out that even predictions that correlate only weakly with ground truth can be biologically meaningful. For example, the rightmost prediction in Fig 3 has a correlation of 0.17 but is far from random: some spatial patterns in the H&E are at least suggestive of higher or lower gene expression levels, even though the measured expression is sparse and noisy (possibly due to false negatives in the SpT assay itself). So although some genes are much harder to predict than others, they nevertheless provide a meaningful training signal that TARIO-2 can use to infer the overall biology of the tissue.

We’ve attached some comments in the footnotes over comparisons to other models.2

Three ways to improve TARIO-2

The first, most obvious one to improve TARIO-2 is by increasing the amount of pretraining data for the model. Below, we compare a TARIO-2 trained on 93 patients to a TARIO-2 trained on 2,545 patients (the full dataset). Training on the full dataset quantitatively improves the prediction accuracy on 18947 of the 18964 genes in the SpT panel. Qualitative results are even starker, as predicted spatial patterns for many genes and tumors only resemble the ground truth when TARIO-2 is trained on the full dataset. And while 93 patients may seem like an extremely small amount of data, this is actually much larger than all public, cellular-resolution spatial transcriptomics datasets.

Fig 4. When trained on dataset sizes that mirror the upper-end of public SpT datasets (3%~ of our internal SpT data), TARIO-2 predictions are often dramatically worse.

Our ability to generate aligned, diverse, multimodal data at scale is thus a critical driver of model performance. Concretely, we are generating hundreds of patients’ worth of training data per month and are set to accelerate.

Second, we’ve observed that larger versions of TARIO-2 generalize better to held-out patients than the base model shown here. Since TARIO-2 is in the same architectural family as TARIO-1, we anticipate that further scaling model parameters and context length (which is proportional to the amount of tissue the models see at once) will continue to improve performance. A model that can see both a tumor and a distantly brewing immune response in the same field of view should, in principle, be able to make more accurate predictions than one that only sees small glimpses of tissue.

It is worth emphasizing that the TARIO-2 presented here is a 200M-parameter model, which is at the lower end of the tested TARIO size range. Our earlier work established that performance continues to scale out to 4B parameters, and that larger models better exploit increasingly complex inputs. The results shown here should therefore be read as a floor on performance.

The third, most interesting avenue of improvement is modulating the resolution and complexity of the tissue predictions themselves. In fact, we consistently find that inference at fine-grained resolutions (18 µm) results in predictions that are quantitatively better than coarse-inference resolutions (35 µm) for nearly all genes. Higher-resolution predictions can occasionally produce dramatic jumps in performance as well. Observe below, where the fine-grained gene prediction (18 µm) has a fundamentally different (and far more accurate!) spatial pattern than the coarse one (35 µm).

Fig 5. When TARIO-2 predicts at higher resolutions (using more compute), the results are often far more accurate than lower-resolution predictions.

What is the limit? Given that the spatial transcriptomics data we train on has sub-micrometer resolution, we could conceivably increase the prediction resolutions to predict single-cell and even non-uniform subcellular distributions of gene expression. This comes with a steep computational cost, though: higher-resolution modeling and inference means more tokens and larger models to take advantage of this longer context. Once again: the results here are a lower bound on TARIO-2’s actual capability, limited mainly by the number of GPUs used for training and inference.

TARIO-2 encodes holistic patient biology

We find that TARIO-2’s H&E-based predictions capture not only local spatial gene expression patterns, but also their relationship to the overall biology of the patient. In other words, the totality of spatial gene expression patterns—which is well-inferred by TARIO-2 given a patient’s H&E—correlates with other important, patient-level cancer biology.

For instance, we find that a tSNE constructed from TARIO-2 gene predictions is well-organized by both cancer type (e.g. lung, breast, ovarian) and histologic diagnosis (e.g. adenocarcinoma, squamous cell carcinoma, serous carcinoma). This is easily visible via unsupervised clustering of these embeddings and quantifiable via standard methods like linear decoding.

Fig 6. tSNE of TARIO-2 gene predictions from H&E. a) Embedding space labeled by primary site of input tumor samples. b) Embedding space labeled by histologic diagnosis.

This is not too surprising, given that trained pathologists can easily tell one cancer type from another.

More notable is that TARIO-2 embeddings are predictive of tumor genotype, which strongly correlates with disease biology and treatment effectiveness but is not easily detected by even the highly trained human eye.

As an example: we trained linear decoders on TARIO-2 embeddings to predict whether a given NSCLC patient has a tumor mutation in the EGFR, KRAS, STK11, or PTEN genes. These predictions are accurate on held-out patients (AUROC of 0.65-0.75). At the population level (shown below), the differences between predictions also accurately reflect known biology: EGFR, KRAS, and PTEN mutations rarely co-occur, and patients with mutations in both the KRAS and STK11 genes occupy a distinct region of the embedding space compared to those with only a KRAS mutation.

Fig 7. tSNE of TARIO-2 predictions on NSCLC samples, showing that predicted gene mutants display expected organization phenomena: EGFR, KRAS, and PTEN mutants (panels a, b, d) rarely co-occur, and KRAS/STK11 mutants (panel c) occupy a distinct region of TARIO-2 embedding space.

Finally, because TARIO-2 embeddings capture patient-level biology from any H&E—including patient biopsies from clinical trials—they can be directly used to predict patient response to treatment. This is an immensely valuable use case, because predicting which patients respond to a drug means designing more efficient clinical trials with higher probability of success. We’re excited to soon share work that includes a real-world example of using TARIO-2 in exactly this way, now accepted to an upcoming oncology conference.

Can TARIO-2 be better than ‘ground-truth’?

One thing we suspect across many of the models we train at Noetik, including both TARIO-1 and TARIO-2, is that their predictions may actually expose latent biology better than the raw measurements they were trained on. SpT assays in particular, despite their miraculous improvements over the past few years, are well known to produce many false negatives (i.e., failure to detect gene transcripts.) Thus what we really want is not a model that perfectly reconstructs noisy raw data, but one that “denoises” the measurements to encode the underlying biology of the data-generating process – which parts of the tissue are expressing how much of each gene.

A good illustration of this “denoising” behavior is a 2018 computer vision paper titled ‘Noise2Noise: Learning Image Restoration without Clean Data’. The setup is simple: take a corrupted image as input, take a differently corrupted version of the same image as the target, and train a model to map one to the other. What the authors found is that a model trained across many such noisy pairs will converge not on any single noisy observation but on their average; which, if the noise is unbiased, is the clean image.

Here is a useful graphic from the paper to understand what’s going on:

Image taken from Figure 2, Lehtinen, Jaakko, et al.

We bring this up because we think TARIO-2 is doing something similar. Although we can’t generate multiple (noisy) observations of SpT data from the same tissue sample, TARIO-2 is well-poised to learn from morphological redundancy across samples. The model does not see the same tissue twice, but it sees similar tissue thousands of times. One reason we think TARIO-2’s predictions may be closer to the true underlying biology than the raw SpT data is that, especially for sparsely detected genes, the model outputs are noticeably smoother than the punctate transcript detections and often track human-visible patterns in the H&E inputs.

Right now these observations are mainly food for thought. However, they could be pointing toward a broader theme for foundation models in the life sciences—most of which are trained on noisy snapshots of an underlying biological system, with data quality limited by current technology. Just the other week, Meta released TRIBE v2, a foundation model that can simulate fMRI’s in response to sight, sound and language. In their blog post, they wrote:

Surprisingly, TRIBE v2’s predictions are often more representative of the typical response than an actual fMRI scan. While raw recordings are inherently noisy – distorted by heartbeats, movement and device artifacts – TRIBE v2 predicts a canonical brain response, which is actually more correlated with the group’s average neural activity than almost any single fMRI recording.

All of this implies we should not think of foundation models like TARIO-2 as mere replacements for expensive assays. Rather, they are systems that learn, by seeing far more examples than a human ever will and by discovering patterns invisible to us, the underlying biology that evades any single assay run on a single sample.

Conclusion

Earlier this year, Elliot Hershberg profiled Sid Sijbrandij—the founder of GitLab—and his effort to treat his own osteosarcoma after exhausting the standard of care. Sid’s approach involved running every available assay as frequently as possible and maintaining a massive, structured repository of the results: single-cell RNA sequencing, bulk RNA-seq, high-resolution microscopy, immune profiling, and more. Based on this information, his team developed over ten personalized therapies and, almost certainly thanks to these efforts, his cancer is currently in remission.

It is an extraordinary story. It is also, as several commentators noted, a story about a singular billionaire.

This said, it is important to specify that the “singular” part meant the “billionaire” part was necessary. Sid’s wealth could immediately buy the answer to many expensive questions about his tumor: which immune populations are at the margins, what proteins are most abundant, and so on. But what could not be bought outright was the context to interpret those answers. Is this expression pattern unusual? Does this immune infiltration profile resemble the patients who responded to that combination therapy, or the ones who didn’t? What was likely far more expensive than the tumor profiling was that the team needed to actually grasp what this data meant, manually assembling that reference frame, synthesizing incomplete literature and clinical intuition, all for a sample size of one.

TARIO-2 will help build that reference frame at scale, allowing us to represent all patients in the same interpretable biology space. When an H&E is run through the model, the output is nominally a spatial transcriptome, yes, but it is a coordinate in this space, allowing clinicians and researchers to ask the sort of comparative questions that Sid’s team answered by hand, for one person, over many months. TARIO-2 generates the substrate for answering them in minutes.

Of course, models like these do not replace the assay. SpT remains the richest single measurement of a tumor we have access to, and for patients where it could be extracted, we generate that alongside H&E. But for the overwhelming majority of cancer patients for whom SpT can never be generated, because of cost, logistics, or otherwise, TARIO-2 will allow for their biology to be understood and contextualized in a way that was previously impossible.

And though contextual understanding is valuable, what is even more valuable is the potential for that understanding to be acted on by machine intelligence directly. As we mentioned earlier, an upcoming article will discuss research accepted to a major conference, in which we show how TARIO-2 distinguishes non-responders from responders based on pre-treatment H&E alone in a clinical trial setting. We look forward to sharing more detail about this case, which is for us the strongest demonstration of how foundation models can directly impact clinical decision-making and benefit patients.

If you are interested in partnering on the models discussed here, please reach out to partnerships@noetik.ai.

This article would not be possible without significant contributions from Eshed Margalit, Daniel Bear, Lacey Padron, Dexter Antonio, Michela Meister, Ron Alfa, Dan Millman, and Dulce Ovando Morales.

While we won’t dive into the technical details of TARIO-2 here, we can say that TARIO-2 is an autoregressive model trained to “predict the next token” in the style of LLMs. This is important because we have found that these models improve when shown longer sequences of tokens, and the inputs to TARIO-2—high-resolution H&E images and micron-scale detections of transcripts for 18,000 genes—comprise a lot of tokens.

SpT is still a nascent technology and there are major differences between the different platforms on which researchers are now generating data; detection method, spatial resolution, sensitivity and specificity, etc. TARIO and TARIO-2 are, as far as we know, the only models trained on anywhere near this scale of CosMx data (>100M cells across >2500 patients). So although we are encouraged by the current model’s accuracy at predicting high-resolution spatial gene expression patterns in held-out patients, there is still plenty to learn about the relative contributions of H&E training data, paired SpT data, SpT platform, model architecture, training loss function, and other factors, and how they compare to others. This is a major focus of ongoing work.

Scaling behavior of TARIO

Abhishaike Mahajan — Tue, 24 Feb 2026 15:04:32 GMT

tl;dr: We have trained a new self-supervised model across thousands of human cancer spatial transcriptomes, nicknaming it TARIO. We have found that the model exhibits scaling behavior across three axes. We have not yet discovered a ceiling to further scaling, and plan to dramatically expand both our compute and data generation capabilities in the coming year. We are always looking for ML, engineering, and wet-lab talent to push this further. Please reach out to me directly at abhishaike.mahajan@noetik.ai to chat!

Introduction

Last year, we trained a foundation model on one of the largest sets of spatial transcriptomics data in existence, called OCTO-VC. A few months back, we put together a few essays (1, 2, 3) on how we’ve used this model to tackle clinical-stage problems that were previously intractable, with the hopes that it would be interesting for oncologists, immunologists, and clinicians to read through.

Separately, we have been exploring fundamentally different approaches to modeling spatial biology; new architectures, new loss functions, and new tokenization schemes that depart from the design choices underlying our earlier work. The result is a new model, built from the ground up, that we refer to as TARIO1.

Now, TARIO is particularly interesting in that it is a fair bit more amenable to scaling than anything we’d developed before. And after we noticed the relative ease with which more compute could be thrown at the model, a lot of questions were raised internally. Do we see any benefits to doing so? What axes can be scaled? Do we see ‘emergence’? In the end, we decided it would be a good use of our time to run the very expensive experiments necessary to answer these. This article is the end result, compiling together observations from dozens of training runs.

We hope this is interesting! But, more importantly, we are writing this essay because we are always searching for computational research, engineering, and wet-lab talent who want to push this sort of work even further. We are generating one of the largest spatial cancer datasets in the world (all collected from human tumor samples), are working on types of ML research that is conclusively pre-paradigmatic, and plan to dramatically ramp up our training scale in the coming months by an order of magnitude. We’re open to remote candidates, but also have a beautiful office in South San Francisco. And we are especially excited about working with people who have never touched biology at all! If you’ve worked on any modeling problems in vision, text, robotics, or otherwise, we’d still love to chat with you to convince you that cancer is worth spending some years working on.

If any of this seems up your alley, please reach out to me directly at abhishaike.mahajan@noetik.ai to discuss further.

Scaling behavior

First, some background information. We unfortunately cannot share too many details on how TARIO actually works, other than that it is a generative transformer, trained entirely on 1000-plex spatial transcriptomics data collected from thousands of human tumor resections.

What we can say is that there are three primary dimensions that TARIO can be scaled upon.

One, the obvious one, simply increasing the parameter count of the model.

Two, also an obvious one, increasing the number and diversity of input transcript tokens.

Three, more interesting, increasing the context length of the model. In the ideal setting, TARIO could ingest the totality of data-points extracted from a tumor sample, but, unfortunately, this is roughly 10 million tokens worth of data. So in practice, TARIO is trained on smaller, local neighborhoods of cells, but we can expand this spatial context (in either training or inference) to observe how this improves model predictions. For clarification, we’ll refer to scaling here in terms of ‘transcripts’, e.g., a context length of 10 means that TARIO will use up to 10 nearby gene transcripts from its immediate spatial neighborhood (around ~100 um squared) to make its prediction for any given cell in a tumor sample.

One last question: how do we judge performance improvements? For the purposes of this post, we’ll primarily use validation loss, which for TARIO is analogous to the “next token prediction” loss used in evaluating LLMs on text data. This means we are judging the models’ ability to reconstruct true spatial gene expression patterns. We recognize that these metrics are ultimately a proxy for what ultimately matters, which is predicting patient response to drugs, and we’ll discuss that a bit more in the conclusion.

With that background, here are the results:

Parameter + context length scaling

We tested five model sizes (20M, 100M, 500M, 2B, and 4B parameters) and four context lengths (2048, 4096, 8192, and 16384 transcripts). Here is a figure of % improvement in validation loss across the genes that are measured in our spatial transcriptomics assay as scaling occurs.

The sweet spot, interestingly, is when both parameter count and context length—which again, corresponds to the number of gene transcripts the model sees in a local neighborhood of tissue—are scaled in tandem. A smaller model doesn’t benefit much from a very large context window, likely because the model simply doesn’t have the capacity to make use of all that information. And a larger model with a very small context window is mostly a waste of compute, since most of the parameters simply aren’t put to work. Here, the 4B model shows no signs of plateauing, even with the largest context size. This is as opposed to the smallest 20M model, which stops benefitting from increased context when moving from 8192 to 16384 transcripts, and even the 100M/500M models start to level out.

What are these models actually using these longer contexts for? We suspect the answer is what you’d suspect: integrating in the information from hundreds of spatially nearby cells to construct semantically meaningful structures.

For example, consider the following image of TARIO predictions at different context lengths. At the largest window, a semicircle of immune-infiltrated tumor at the bottom (pink) and a tertiary lymphoid structure (green) at the top pop out, both of which are confirmed to exist via pathologist analysis of this patient’s histology. At the shortest context, the structures are only weakly inferred, and less separated.

In other words, at the largest scales, the model is learning something real about how tumors spatially organize themselves, not merely overfitting to local transcript noise. It is extremely likely that other clinically meaningful, but non-human-legible structures, are surfacing in these representations too.

How can we scale further from here on out? While we can train an even larger model in parameter count, the plot above suggests that we’ll get the most bang for our buck if we scale the context length at the same time. What is the limit there? In the current spatial windows that the model is operating in (100 x 100 um), there are typically 20,000~ transcripts at most. But the spatial windows themselves can also be dramatically expanded; our tumor samples are a hundred times larger than the current spatial context windows, and can contain tens of millions of transcripts in total.

Extending out even further, we are switching to whole spatial transcriptomics (as opposed to the 1000-plex spatial transcriptomics used in this model) which will allow us to extract an extra order of magnitude of information from each slide. In this world, the new upper limit is a context window of 100M transcripts.

And this is all without interleaving in the paired proteomic, pathology, and genomic modalities we extract from a patient’s tumor, which would almost certainly push up the total possible set of input tokens by another order of magnitude.

This is all to say: we do not see a ceiling in sight.

Data scaling

As Jack Morris best put it, There Are No New Ideas in AI… Only New Datasets. The historical arc of machine-learning in literally every field suggests that the single best predictor of progress is not really cleverness, but access to better data. Common Crawl begat GPT, the PDB begat AlphaFold, and so on. We strongly believe spatial transcriptomics, derived from actual human tumors, is going to be one of those datasets for cancer.

As a quick aside: this is part of why we’re skeptical of purely computational efforts in this space. If you don’t control your own data generation pipeline, you’re ultimately bottlenecked by whatever public datasets exist, and public biological datasets here are, almost universally, too small, too noisy, and too narrowly curated to support the type of scaling that models most benefit from! It’s a pain to collect your own data, but this section of the essay could not exist without that pain!

So: how does TARIO scale with data?

Well, there’s an interesting story to sketch out here. Let’s start with something basic, and train a 100M + 4096 transcript context TARIO model (chosen because it’s a decent model and cheap to train) on increasing fractions of our two largest cancer subtypes: LUAD and CRC (lung adenocarcinoma and colorectal cancer), and observe how well it does on gene-level predictions across those two cancer subtypes and PDAC (pancreatic cancer), which is an ‘unseen’ cancer subtype.

For reference, everything is represented at a % improvement over the loss given by the ‘3% of (LUAD + CRC)’ model, so higher is better.

Okay, neat! We see that LUAD and CRC clearly benefit from more of its subtype being added to the training set, with the improvement leveling off from 40% to 100%. This benefits PDAC as well—indicating some level of generalization to held-out cancer types—but the improvement is weaker. If we ultimately hope to create a model that is ‘pan-cancer’, aggressively gathering more data points only from singular indications does not seem like the way to go.

To help explore this further, we trained multiple models on combinations of our data: LUAD, CRC, PDAC, and also ‘everything’, which includes a few, less common cancer subtypes, including ovarian, breast, and others. Once again, we will present results in comparison to a baseline, which, in this case, is the ‘100% LUAD + CRC’ model. Again, higher is better.

Lots of numbers here! Let’s walk through the primary results.

The first takeaway is that new indications are worth collecting because they help that specific indication. You can observe this from the ‘100% of (LUAD + CRC + PDAC)’ model; the inclusion of PDAC dramatically improved performance on PDAC compared to a model trained on no PDAC at all (100% of (LUAD + CRC)).

The second, more interesting takeaway is that new indications are worth collecting because it helps already-seen indications a little. You can observe this by the fact that there is a bump in LUAD + CRC validation performance as you include PDAC, and another bump when you include ‘everything’. A similar phenomenon is seen for PDAC validation performance, which improves from 3.21 to 3.98 when ‘everything’ is included.

The third takeaway is that specialized models are quite bad at doing anything else beyond what they were trained on, and are beaten by a fully general model. The model trained entirely on PDAC data is quite good at predicting PDAC, but awful at everything else, and was still beat out on PDAC by the ‘100% of everything’ model.

Our fourth and final takeaway, which confirms what we’ve already seen, is that we do not observe especially strong generalization performance when evaluating a model on a cancer subtype it has never seen before. For instance, we observe that the ‘100% everything, but not PDAC’ model is only slightly better on PDAC than a model trained only on LUAD + CRC, despite the former model having seen a fairly broad universe of cancer. It may be the case that this disappears at a certain level of data scale, but we suspect that what we are stumbling across here is something that everyone eventually discovers in their application of machine-learning to the life-sciences: biology is very heterogenous, and going out of distribution is easy.

Importantly, we consider this last point a strong affirmation that building our data collection infrastructure was the right call. If generalization across cancer subtypes were easy, as in, if training on LUAD and CRC magically transferred to PDAC, then there would be much less value in controlling your own sample pipeline. But it doesn’t! The only reliable way to get good performance on a cancer subtype is to have actually trained on that cancer subtype, which means the bottleneck is, and will continue to be, access to high-quality tumor samples across the full diversity of human cancers.

The practical takeaway is clear: we should be collecting as many different cancer types as possible, even if we can only get small numbers of samples from each. A hundred samples across ten rare cancer types are almost certainly more valuable than a thousand additional samples of NSCLC. This is not how most data collection efforts in this space have historically worked, but our-data-driven conclusion is that it’s critical to embrace this “bitter lesson” for scaling foundation models of patient biology.

Conclusion

Parameter, data, and context scaling all work, and none of them seem to be plateauing.

Yet, despite how exciting all these results may be, one may be left with a rather important question: how much does this all matter? Ultimately, the goal of these models—and scaling them—is not to predict spatial gene expression, but to create a representation space of biology that is rich enough to do something clinically useful. Nothing else matters, not really.

And the one clinically useful thing that perhaps matters the most is: how well does TARIO predict patient response to cancer drugs? As in, can the model look at a patient's tumor before treatment and tell you whether they're going to respond?

Early results are encouraging. For some clinical datasets, it seems that larger-scale models do produce more accurate patient-level predictions. Unfortunately, the vast majority of our response data is sourced through partnerships with pharmaceutical companies, and we’re not at liberty to share those results just yet. The core challenge here is that TARIO requires spatial transcriptomics data for inference, and generating spatial transcriptomics requires physical tissue. And physical tissue with paired response data is hard to get! Most historical clinical trials do not bank tissue with spatial transcriptomic analysis in mind, and even when they did, the samples have often degraded beyond use. As a result, the substrate necessary to use this model is, to put it bluntly, rare.

But this is not true for H&E slides. Nearly every pathology lab in the world has archived H&E slides going back decades, millions of them, with extremely rich response data to go alongside it.

Because of this, we’re currently training a model to bridge that gap: converting H&E into predicted spatial transcriptomes, which would unlock TARIO-level analysis for essentially any tumor sample that’s ever been sliced and stained. We plan to cover this model in an upcoming post, discussing where we see these types of models being most useful, its own unique scaling behavior, and, most importantly, how well it predicts patient response across hundreds of cases.

Note: much of the heavy lifting behind this work was done by the extremely talented Daniel Bear, Eshed Margalit, Jake Schmidt, and Yubin Xie, amongst others. If you join us, you may get to work with them!

Unlike OCTO-VC, which is trained based on the self-supervised task of "masked autoencoding"—filling in randomly hidden pieces of the training data examples—TARIO is an autoregressive transformer, trained on a spatial transcriptomics-appropriate version of the infamous "next token prediction" task behind GPT, Claude, Gemini, and so on. Also, whereas OCTO-VC is meant to simulate the effects of multimodal patient context on individual virtual cells, the point of TARIO is to model much larger regions of spatial transcriptomics data.

GSK Licenses NOETIK’s OCTO-Virtual Cell Foundation Models to Transform Cancer Therapeutics R&D

Ron Alfa — Thu, 08 Jan 2026 19:19:27 GMT

We are announcing a five-year foundation model licensing agreement with GSK today.

This is a huge milestone for NOETIK, but it also represents a shift for the biopharma industry: the move from AI services collaborations to licensing of biological foundation models as AI infrastructure. This is among the first and largest transactions monetizing a biological foundation model as a scalable enterprise asset.

GSK is licensing our OCTO-VC virtual cell foundation models to work on non-small cell lung cancer and colorectal cancer. GSK's AI and therapeutics teams will also get access to one of the most extensive oncology multimodal spatial training sets in existence. They are committing $50 million in upfront payments and near-term milestones, along with an annual license fee to access these powerful capabilities.

The most important part of this is the potential impact these models will have on the development of new medicines for patients through integration into GSK’s drug discovery and development process.

Most drug discovery is still guesswork. Using self-supervised learning to train foundation models on one of the largest spatial biology datasets in oncology, we can interrogate cancer biology—from the level of the tissue to the genome—to simulate outcomes for therapeutics before they enter into the clinic. GSK now has access to these models to query tumor biology at a scale and resolution that wasn't possible before.

This partnership also defines a new asset class. Foundation models as scalable enterprise assets, and a growing impact for AI-guided discovery and development on new medicines.

It’s going to be an exciting year!

For another take on this deal and the shift toward licensing biological foundation models, we recommend reading Andy Dunn’s exclusive feature in Endpoints News.

How do you use a virtual cell to do something actually useful? (3/3)

Abhishaike Mahajan — Fri, 19 Sep 2025 14:43:13 GMT

Note: this is part three of a series of three posts discussing how the therapeutics team at Noetik have used our virtual cell model, OCTO-VC, for practically useful, therapeutics-relevant tasks. The Introduction section will stay the same for each one, skip down to the next section if you’ve already read one of these before.

Part 1: Identifying anti-PD-1 responders

Part 2: Refining clinical trial eligibility to the right subgroups

Part 3: Virtual perturbations that shift T cell effector state in humans

Table of contents:

Introduction

A lot of people have been very interested in ‘virtual cells’ lately. An exact definition is difficult to find, but one offered by a recent Cell perspective paper is the following:

Our view of [a virtual cell] is a learned simulator of cells and cellular systems under varying conditions and changing contexts, such as differentiation states, perturbations, disease states, stochastic fluctuations, and environmental conditions. In this context, a virtual cell should integrate broad knowledge across cell biology. [Virtual cells] must work across biological scales, over time, and across data modalities and should help reveal the programming language of cellular systems and provide an interface to use it for engineering purposes.

It’s an exciting idea! A computational simulation of a cell should be, theoretically, exceedingly useful for all sorts of clinical and preclinical research, by virtue of being able to eschew expensive wet-lab efforts in favor of cheaper (and potentially more reliable) GPU time. So it is no surprise that a great deal of research is already being actively done in this area. Elliot Hershberg, a venture capitalist at Amplify Partners, recently compiled a small summary of ongoing work here:

But as with every promised revolution in the life sciences, the revolution will hesitantly admit some nuances upon questioning.

Of highest concern is the fact that nearly all virtual cell model efforts being worked on are not virtual cells of human biology, but rather cancer cell lines, which—while convenient, well-characterized, and infinitely malleable—are far from the true physiological complexity of healthy or diseased human tissue. Due to this, figuring out how their insights extend into assisting with the drug development process is usually another hard problem in and of itself. But, to be clear, this doesn’t mean they aren’t useful. Biological research being done on cancer cell lines is a common phenomenon at the preclinical research stage, which is what nearly all virtual cell models are currently geared towards assisting.

This partially answers the question why, despite how exciting ‘virtual cells’ seem, there are very few, clear-cut examples of how such methods will be ultimately used. That vagueness is partly built into the reality of early-stage biology, so it’ll be years before the ultimate impact of this line of research is felt.

But one area of virtual cells that could have a concrete value-add in the immediate short-term is the deployment of them at the clinical stage of drug development. After all, this is where the real bottlenecks lie: trials are slow, expensive, and fraught with uncertainty, and even small improvements here can ripple into huge downstream gains. Of course, while the opportunity here is massive, the downside of touching this area is that it is hard to do. Very, very hard. As a result, there is almost no virtual cell effort meant to operate at the clinical stages of drug development, even though the translation problem there is, theoretically, ‘easy’.

Other than us. Noetik is building virtual cells with the explicit goal of assisting with clinical-stage problems: identifying responders to drugs and refining patient inclusion criteria for trials. At the same time, we believe that the tools we create in this process will also have powerful applications in pivotal, high-risk areas of preclinical research, such as target selection, while remaining grounded in human-level data. All three will be discussed in this essay series.

How do we do this? Our view is simple: the shortest path to usefulness is not maximal simulation on unrealistic biology, but grounded observations into realistic biology. We built that foundation first. Every datapoint that trains our virtual cell models comes from human tumor resections: 77M cells across ~2,500 patients across a dozen+ cancers, with paired spatial transcriptomics, spatial proteomics, exomes, and H&E’s from each one collected in our lab. In total, this is easily one of the largest datasets of its kind. And not a single cell line. We strongly believe that this means the path from in-silico workflows to something clearly translatable is far more direct: human to human, rather than detouring through unrealistic animal or cell models.

That difference matters! In cancer, translation is the bottleneck. Drugs fail, not because they don’t work in preclinical settings, but because they don’t work in real human patients.

Using this human-derived tumor data, one of the virtual cell models we’ve created is ‘OCTO-VC’. This model is entirely trained on 1000-plex spatial transcriptomes, and its core task is deliberately prosaic: given the transcriptome of a few neighboring cells, reconstruct the “center cell” transcriptome—over every cell, in every tumor, for every patient. We released a (very long) post late last year discussing it in depth for those who are curious about the machine-learning details, alongside an online demo.

But what wasn’t discussed in that earlier post is how one can use models like this for clinically meaningful, non-trivial problems.

In this essay series, we hope to do exactly that, by showing three case studies of times where OCTO-VC was directly useful for our therapeutics team.

This is part 3, which will discuss how we would use the perturbation capabilities of the model to identify novel targets for T cell effector state modulation.

Virtual perturbations that shift T cell effector state in humans

Therapeutic Context:

Two particularly common lung cancer mutations you’ll often see people discussing are KRAS and STK11. KRAS is one of the most frequent oncogenic drivers (i.e. causes the cancer in the first place), whereas STK11 is a tumor suppressor gene whose inactivation disrupts cellular metabolism and immune signaling. And, while KRAS-mutant tumors are quite common, STK11 shows up alone less frequently, more-so appearing alongside KRAS.

Tumors with this genetic combination are often referred to, unsurprisingly, as ‘KRAS STK11’. And, when the two mutations do appear together, the combination produces a particularly aggressive biology: tumors that are metabolically rewired, immunologically “cold,” and broadly resistant to both standard chemotherapies and immune checkpoint blockade. As expected, the clinical data consistently show the impact of this on patients: significantly shorter lifespans.

As of today, there are no approved therapies that directly address the KRAS STK11 genotype. Patients are typically treated with the same immunotherapy regimens offered to the broader non-small cell lung cancer population: immune checkpoint blockades. While this often works fine in KRAS patients, the efficacy of this class of drugs is far worse for the KRAS STK11 patients. And, given that the latter group isn’t particularly rare, millions of patients are likely underserved.

Question:

Which therapeutic targets, if targeted, would help cancer patients with KRAS STK11 mutations?

What we found:

Well, perhaps we should first ask a simpler question: what exactly is the fundamental difference between KRAS and KRAS STK11 patients in cell-types most relevant to immunotherapy? KRAS patients, after all, respond well to immunotherapy, so they could be considered a model population for understanding what “good” looks like in terms of immune biology. Afterwards, we can move onto assessing what targets are most relevant to shifting KRAS STK11 tumors to have that particular phenotype.

For both of these, we leaned heavily on OCTO-VC’s ability to simulate cellular states.

First, to assess differences between the two population genotypes, we set up a ‘virtual CD8⁺ T cell simulation’. Here, we asked OCTO-VC to predict the “expected”, or virtual, CD8⁺ T cell in the genetic and microenvironmental context of each patient's tumor. And what we found is that one of the strongest differences in gene expression between KRAS and KRAS STK11 patients were a class of genes called granzymes, specifically GZMA and GZMK, which are known to be a practical readout of ‘CD8⁺ T-cell effector function’, the capacity for a T cell to kill cancer via cytotoxic mechanisms.

In the below plot, Gene A is GZMA and Gene B is GZMK. We’ll discuss in the next section why we believe these virtual cell predictions are a much better way to assess patient-level differences compared to the raw transcript values, but for now, we’ll move on.

Step one completed, we’ve identified a therapeutically relevant difference between the two genotypes. Importantly, the marker does meet some sanity checks too. Granzyme expression has shown strong associations with response to PD-1/PD-L1 therapy in human tissue, clearly indicating that it is clinically meaningful for immunotherapy. So, one particular axis of improving the prospects of KRAS STK11 patients could be to simply find some way to increase granzyme levels.

But understanding the best ways to do this has been far from straightforward. Cytokine stimulation or blocking checkpoint molecules like TIGIT have all been shown in preclinical animal models to boost granzyme expression. Yet the current translational record is mixed: interventions that should theoretically raise granzyme levels often fail to yield durable tumor clearance in human clinical trials.

What’s going on here? Are granzymes the wrong lever to pull?

Perhaps, but there’s some reason to believe that some of the previous attempts to increase granzymes (in humans) did not, in fact, actually increase granzymes. After all, the molecular impact of at least one of those attempts seems to rely on entirely different mechanisms of action, ones that, empirically, ended up having no real patient benefit. The fundamental problem here may not be that granzymes aren’t worth modulating in humans, but rather, the targets that modulate them depend on the species. In other words, if you study mice only, you’re going to arrive at the wrong target.

After all, the structures of granzymes substantially differ between humans and mice. Broader than this is that the fact that immunity is a very, very species-specific topic. Consider inflammation, a close relative to our subject, and what a 2013 PNAS paper has to say about the role of mouse studies here (bolding added by me):

Murine models have been extensively used in recent decades to identify and test drug candidates for subsequent human trials. However, few of these human trials have shown success. The success rate is even worse for those trials in the field of inflammation, a condition present in many human diseases. To date, there have been nearly 150 clinical trials testing candidate agents intended to block the inflammatory response in critically ill patients, and every one of these trials failed.

All this to say: if we want to modulate granzymes, and come to useful conclusions about how to do so, we should work directly with human data. One way to do this (perhaps the only way to do it!) is to rely on OCTO-VC’s ability to perform virtual perturbations in real human data.

With the same computational framework as before—asking the model to predict virtual CD8⁺ T cells in a specific tumor microenvironment—we added one more step: knocking out a single gene across the tumor. From there, we ask how the virtual CD8⁺ T cell’s transcriptome would shift in response to that, comparing it to the baseline expected transcriptome of that cell type. We can do this systematically across thousands of genes to run a virtual screen. The knockout serves as a proxy for a drug, and the predicted impact on the virtual CD8⁺ T cell serves as a proxy for patient response. Of course, this impact is not at all guaranteed to be causal, merely strongly correlated and conditional on the spatial environment, but it can lead to useful hints.

We did exactly this perturbation across our KRAS STK11 patient cohort, searching for targets that consistently increased one of the granzymes, GZMA, expression in CD8⁺ T cells in real tumors. The virtual screen produced a clear signal: the top-scoring hit (Gene 20) was an adhesion protein, which we’ll call Target A.

Target A is particularly intriguing because a study published only a few years ago showed that inhibiting this target (in co-cultured human tumors with T cells) leads to increased T cell expression of a granzyme. One nuance is that that papers granzyme studied GZMB, not GZMA, but the two can be quite correlated. But most compelling of all, beyond in-vitro results, is that there are two human cancer trials that have tested drugs meant to inhibit Target A!

How have these trials gone? It’s a mixed bag: patients responded decently in one trial, but not in the other. But both of them are using the exact same inclusion criteria: elevated levels of Target A. We strongly believe that this may have hurt both of the trial readouts.

Remember, inhibiting Target A in KRAS tumors is unlikely to yield immense benefits, since we suspect the primary mechanism-of-action of Target A is in in increasing granzyme activity, and those tumors already harbor abundant granzyme activity. In contrast, KRAS STK11 tumors, which have depressed granzyme levels, stand to gain the most from Target A inhibition. So, by enrolling patients purely on the basis of ‘high Target A expression’, the trials were almost certainly accidentally enriched for KRAS patients—by virtue of KRAS being found in 30% of all cancers, while KRAS STK11 are found in 10% of all cancers— inadvertently selecting a patient demographic least likely to respond to the drug.

Both of the trials, in other words, potentially stacked the deck against themselves. The correct strategy would have been to include KRAS STK11 status in the inclusion criteria, thereby focusing on patients with the greatest mechanistic rationale for benefit. But the trials did not do this and, as a result, the final efficacy readouts of the drug may be worse than it could’ve been.

Future Directions:

In one fell swoop, our virtual cell model uncovered not only a therapeutically relevant target, but also inclusion criteria on what patients it is most relevant for. Though there are already ongoing trials for this particular target, we strongly believe that the correct inclusion criteria for it are not being used.

Is there a principled way this could’ve been done without OCTO-VC?

For finding the granzyme difference between the two genotypes, theoretically yes, but practically no.

For finding target A, neither practically nor theoretically.

One, on the granzyme difference: though granzymes are known to be markers of CD8⁺ T cell effector function, their modulation in genetic subcontexts like KRAS STK11 has not, as far as we can tell, been systematically mapped. But even if you had collected the same spatial transcriptomics dataset we had, discovering this relationship without OCTO-VC would’ve been challenging. Why? Because raw transcript values are, generally speaking, untrustworthy. To assess the transcriptional differences between two cohorts based on raw genes, you would need CD8⁺ T cells to actually be present in sufficient numbers within the tumor microenvironment and ensure that those cells were captured with sufficiently-high resolution.

This is rarely the case! In most of our samples, even correctly tagging a cell as being a CD8⁺ T cell is difficult, to say nothing of their transcripts, which are often sparse, heterogeneous, and noisy, making it difficult to detect consistent patterns. Virtual cells, produced by OCTO-VC, solve this bottleneck by being able to reconstruct what a CD8⁺ T cell state would look like in that genetic and microenvironmental context; conditioned on the spatial transcriptomic environments the model has observed across millions of cells.

And two, on finding target A: even if you could extract a clean signal from the raw data, discovering targets that modulate the granzyme phenotype further would be largely intractable. The typical way people would study this further is via animal studies, and, as we’ve mentioned, there is a massive gulf between what mice immune systems tell you and what human immune systems tell you. The only way to reliably explore the area is via screening targets in a human, in vivo context, which necessitates the usage of virtual cell models like OCTO-VC to do it at any reasonable scale.

And though Target A was discovered without OCTO-VC, its discovery relied on cell culture data. The results of this coincidentally translated to humans, but, given how often cancer drugs fail, it’s a very expensive coin flip to make and not something we consider particularly principled.

These results are, to put it lightly, exciting. The history of cancer drug development has shown us time and time again that translation is the bottleneck. The problem has always been that what works in a mouse, or in a dish, rarely works in a patient. That’s it. Fixing this is how we make a dent in stopping the millions of lives that are lost to cancer every year. And we fix it by not being able to predict the results of a functional assay, or cell-line experiment, or mouse experiment. We fix it by trying to predict what happens when a human being with a real tumor gets treated. That’s the only question that matters. Everything else is a proxy, a bad proxy, one that has led to 90%+ of all cancer drugs failing during clinical trials.

We are not the first ones to claim that predictions like that are possible, but we believe that we are one of the first to show concrete evidence of it actually being done. And remember, the results we have today are the worst ones we’ll ever have. Each day, the practical utility of the model that fueled these results gets better and better, both as its underlying dataset grows richer and our understanding of how to best deploy it is refined.

The trajectory to us feels obvious; in time, models like OCTO-VC will become routine parts of how oncology as a field functions. In such a world, patients don’t waste precious time on ineffective treatments, entirely new targets that once seemed unworkable become viable options, and trial populations are enriched for the responders who stand to benefit the most. We have strong conviction that not only is this world possible, but that it is already beginning to emerge.

If any of this seems interesting, please reach out to info@noetik.ai to chat further.

If you’d like to read our prior case studies, here is Part 1 and Part 2.

Finally, if you’re curious to understand more ML-specific technical details about how the virtual CD8⁺ T cell’s actually work, we have an older post that discusses exactly that.

How do you use a virtual cell to do something actually useful? (2/3)

Abhishaike Mahajan — Fri, 12 Sep 2025 15:33:06 GMT

Note: this is part two of a series of three posts discussing how the therapeutics team at Noetik have used our virtual cell model, OCTO-VC, for practically useful, therapeutics-relevant tasks. The Introduction section will stay the same for each one, skip down to the next section if you’ve already read one of these before.

Part 1: Identifying anti-PD-1 responders

Part 2: Refining clinical trial eligibility to the right subgroups

Part 3: Virtual perturbations that shift T cell effector state in humans

Table of contents:

Introduction

A lot of people have been very interested in ‘virtual cells’ lately. An exact definition is difficult to find, but one offered by a recent Cell perspective paper is the following:

Our view of [a virtual cell] is a learned simulator of cells and cellular systems under varying conditions and changing contexts, such as differentiation states, perturbations, disease states, stochastic fluctuations, and environmental conditions. In this context, a virtual cell should integrate broad knowledge across cell biology. [Virtual cells] must work across biological scales, over time, and across data modalities and should help reveal the programming language of cellular systems and provide an interface to use it for engineering purposes.

But as with every promised revolution in the life sciences, the revolution will hesitantly admit some nuances upon questioning.

Of highest concern is the fact that nearly all virtual cell model efforts being worked on are are not virtual cells of human biology, but rather cancer cell lines, which—while convenient, well-characterized, and infinitely malleable—are far from the true physiological complexity of healthy or diseased human tissue. Due to this, figuring out how their insights extend into assisting with the drug development process is usually another hard problem in and of itself. But, to be clear, this doesn’t mean they aren’t useful. Biological research being done on cancer cell lines is a common phenomenon at the preclinical research stage, which is what nearly all virtual cell models are currently geared towards assisting.

That difference matters! In cancer, translation is the bottleneck. Drugs fail, not because they don’t work in preclinical settings, but because they don’t work in real human patients.

But what wasn’t discussed in that earlier post is how one can use models like this for clinically meaningful, non-trivial problems.

In this essay series, we hope to do exactly that, by showing three case studies of times where OCTO-VC was directly useful for our therapeutics team.

This is part 2, which will discuss how we would use the model to expand clinical trial eligibility in a real clinical trial.

Refining clinical trial eligibility to the right subgroups

Therapeutic Context:

In our last essay, we talked about how tumor complexity is beyond any human to genuinely grasp, and how we arrived at an understanding of anti–PD-1 response through model embeddings. But in cases when explicit response labels aren’t available, the challenge then shifts. We cannot ask whether responders and non-responders separate. Instead, we must reason through proxy; using machine-learned patterns that recur across tumors that closely correlate with suspected mechanisms of action (MoA) of the drug being studied.

Some background information: as of today, most patient inclusion criteria in cancer clinical trials rely on disturbingly coarse, overly-reductionistic patient inclusion criteria: % of PD-L1 expression across a tumor biopsy (like the previous case study), whether the histology says a tumor is “triple-negative,” or if sequencing shows the presence of a particular mutation. But even when the cancer field explores the value of more complex markers — which oncologists clearly recognize as important! — the published signals are nearly always fragmentary, with a single local motif, and rarely grasp the full neighborhood or architectural context of a tumor microenvironment.

Why is this? Why aren’t markers more complex? Much of it comes down to the fact that the logistics of designing and validating even mildly complex assays are essentially intractable. Every hypothesis requires years of prospective planning, the right tissue samples, and the ability to multiplex the correct set of markers from the start. If an important signal is missed, the entire study has to be restarted. This is to say nothing of biomarkers that are ML-enabled, operating across dozens or hundreds of axes at once, which is virtually never explored.

As a consequence, trial sponsors are forced into the simplest, most reductive criteria, not because they believe those are the best biology, but because they are the only practical levers available within trial timelines.

Question:

One of the things we’ve been most excited about is using OCTO-VC to take previously impractical hypotheses for drug response prediction, and test them out at scale.

The question here requires some extra context and, because we’re actively exploring it, some obfuscation.

Last year, the FDA halted a late-stage cancer clinical trial run by a large biopharma, not because efficacy wasn’t observed, but because, midway through the trial, efficacy was observed only in ‘Subgroup Z’. As a result, this forced the biopharma to submit a protocol amendment to restrict follow-up trials to only be on the Subgroup Z cohort. This is quite a blow to them, since that cohort is a fair bit smaller!

But, as is typical in cancer trials, patients wind up in Subgroup Z due to an extremely coarse biomarker. Theoretically, the drug that was part of this trial is over a well-trodden target, so there should be a much better way to separate out the ideal patient population. But, like we mentioned, doing any sort of large-scale biomarker study would normally require an enormous multi-year biomarker program—prospectively designing assays, collecting new tissue samples, and validating them across multiple sites. That’s the standard, slow way.

With OCTO-VC, we can invert the order of operations. Instead of starting with a hypothesis, locking in the markers, and then waiting years for data to trickle back, we start with the existing atlas of human tumors and ideate on new ways to separate out responder/non-responder patients.

So, our question is: can OCTO-VC come up with better stratification criteria for selecting responders?

What we found:

First, a basic sanity check was met. Subgroup Z cohort is quite distinct in our embedding space; in the graph below, it is the small, bright yellow-green segment on the left.

And we knew that that yellow-green segment was filled with responders to the drug. So an easy question to ask OCTO-VC is: what other, more complex marker overlaps with that segment and has a mechanistic rationale for the overlap? After some iterative searching, our therapeutics team found a strong signal: a particular ‘tumor microenvironment concept’ that seems highly enriched in Subgroup Z, but also extends outside of it. While we won’t expand on what the concept is, we believe that it is unlikely to be noise given how biologically relevant it is to the therapy in question.

Here, that concept is shown in the same embedding plot through color; high meaning ‘highly enriched for that concept’:

Circled is the true-response cohort, which is high in that concept. But you can notice, slicing through the region outside of Subgroup Z, another large pocket of people high in that concept.

In other words, we believe this biopharma has set overly conservative inclusion criteria. By doing so, they not only leave billions of dollars in potential revenue untapped but, more importantly, will leave an immense number of patients without access to a therapy that has a clear mechanistic reason for meaningfully improving, or even saving, their lives.

Future Directions:

One striking aspect of the OCTO-VC’s embedding space—something that continues to surprise even us—is how clearly it aligns with therapeutic problems, despite having had absolutely no access to perturbational or labeled data. After all, OCTO-VC could separate higher-order cancer definitions (e.g., Subgroup Z vs. not-Subgroup-Z) directly from human tissue, and, with some human judgement, was able to surface MoA-relevant subtypes within them; ones that would be far too costly to ever pull out in the real world. And this phenomenon of ‘clinically meaningful organization’ seems to reoccur across the embedding space!

As an example, basic Leiden clustering of an OCTO-VC embedding space (the same one discussed in this essay, but different from the previous essays PD-1 embedding space) demonstrates tissue-level characteristics that align with therapeutic MoA’s. Annotations here are provided by humans:

How is this possible? How can the model, without explicit supervision, uncover patterns that map so directly onto biological mechanisms and therapeutic relevance?

One argument is that cancer is a particularly special disease, extremely well-suited to self-supervision tasks. Unlike many other therapeutic areas, oncology has historically been driven by mechanism-based stratification; cancer drugs are often developed and approved not for a broad, undifferentiated population, but for genetically or phenotypically defined subgroups. As a result, the very axes that determine drug response are the same ones that structure our human, tissue-level data. And machine learning is very, very good at dissolving complex, high-dimensional data into those underlying axes.

Of course, turning this work, and others like it, from an analysis into an actual regulatory argument is often another challenge in of itself for many virtual cell efforts. A model can suggest that patients with a certain microenvironmental signature are likely to respond, but to satisfy regulators, that suggestion has to be translated into a practical assay. But this is the core of what makes “virtual cells” particularly useful if they are derived from human data, and not cancer cell lines: this translation is straightforward.

After all, the signatures that OCTO-VC surfaces always have a direct connection to real human tumors. The signatures are often intricate, something that would require years of effort and millions of dollars to define through traditional approaches, but still can be boiled down to a set of measurable markers, morphologies, or local interactions if needed. As an example, the tumor microenvironment concept we discussed above is something that is very amenable to being turned into an assay.

We strongly believe that this ability to create these complex definitions of responder cohort—given only hours of GPU time—can not only expand patient cohorts (as we’ve discussed here), but also rescue otherwise unpromising drugs and open entirely new therapeutic opportunities that were previously invisible under traditional stratification methods.

If this sounds interesting, and you think you’d like to talk further, please reach out to info@noetik.ai!

In our final section, we’ll discuss how even though OCTO-VC is most useful for clinical-stage problems, like patient selection, the same human-tissue-grounded programs also help prioritize targets, without changing the core principle: grounding every insight in real human tissue.

How do you use a virtual cell to do something actually useful? (1/3)

Abhishaike Mahajan — Fri, 05 Sep 2025 13:12:03 GMT

Note: this is part of a series of three posts discussing how the therapeutics team at Noetik have used our virtual cell model, OCTO-VC, for practically useful, therapeutics-relevant tasks. The Introduction section will stay the same for each one, skip down to the next section if you’ve already read one of these before.

Part 1: Identifying anti-PD-1 responders

Part 2: Refining clinical trial eligibility to the right subgroups

Part 3: Virtual perturbations that shift T cell effector state in humans

Table of contents:

Introduction

A lot of people have been very interested in ‘virtual cells’ lately. An exact definition is difficult to find, but one offered by a recent Cell perspective paper is the following:

Our view of [a virtual cell] is a learned simulator of cells and cellular systems under varying conditions and changing contexts, such as differentiation states, perturbations, disease states, stochastic fluctuations, and environmental conditions. In this context, a virtual cell should integrate broad knowledge across cell biology. [Virtual cells] must work across biological scales, over time, and across data modalities and should help reveal the programming language of cellular systems and provide an interface to use it for engineering purposes.

But as with every promised revolution in the life sciences, the revolution will hesitantly admit some nuances upon questioning.

That difference matters! In cancer, translation is the bottleneck. Drugs fail, not because they don’t work in preclinical settings, but because they don’t work in real human patients.

But what wasn’t discussed in that earlier post is how one can use models like this for clinically meaningful, non-trivial problems.

In this essay series, we hope to do exactly that, by showing three case studies of times where OCTO-VC was directly useful for our therapeutics team. This is part 1, which will discuss how we used the model to find true anti-PD-1 responders inside PD-L1–positive cohorts.

Identifying anti-PD-1 responders

Therapeutic Context:

One of the most common (and effective) therapies in cancer are anti-PD-1 drugs. The underlying biology is straightforward: many tumor cells express PD-L1 on their surface, which binds PD-1 receptors on T cells to dampen T-cell activity. Anti-PD-1 (or anti-PD-L1) antibodies block this inhibitory interaction, allowing T cells to attack the tumor. But not all cancers rely on this pathway. Some tumors have little to no PD-L1 expression, meaning that drugs operating on that mechanism would, in principle, have limited effect. This has led to a common clinical rule of thumb: patients are considered potential candidates for anti-PD-1/PD-L1 therapy if ≥1% of their tumor cells express PD-L1, or are PD-L1+. But this still isn’t perfect. Even with this inclusion metric, roughly half of patients still do not respond to this therapy, even if they are PD-L1+, and it is unclear why.

Question:

Can OCTO-VC improve how well we can identify responders to this drug?

What we found:

Seeing how well OCTO-VC can help us here is quite straightforward: create a high-dimensional embedding of each of our tumor cores that we have responder data for, and see if the embeddings of responders differ from those of non-responders. And, most importantly, is the clustering better than a good baseline, as in, the usual patient inclusion criteria?

You may be instinctively surprised by the fact that OCTO-VC’s value here doesn’t come from the usual virtual cell trick of simulating perturbations, but instead from the far simpler act of representation. But this is, in fact, the most reliable way to rely on models like this; it allows our underlying, extremely rich data to ‘speak for itself’ without needing human intervention.

Using a small cohort of patients—15 responders and 24 non-responders, with both groups meeting the “ideal candidate” criterion above, or PD-L1 tumor proportion score ≥1%—we generated an embedding for each of their tumors using OCTO-VC. The below graph shows the embeddings for all our core samples, reduced to two dimensions via PCA. The ones we have PD-1 responder data for are colored in either green or magenta.

The responders seem to mostly be in the lower right quadrant, so there’s meaningful separation in the entirely unsupervised embedding. And, training a basic model on the PCA reduction allows us to quantify the signal, showing that predictions match up well with the response cluster, and that it is above chance. Here is the associated confusion matrix of the trained model:

Remember, we’re working off a pre-selected patient population here. If “1% of tumor cells expressing PD-L1” were a really good biomarker with no real room left to improve upon, we wouldn’t be able to further subdivide the likely-to-respond patient population any further. The fact that we’re able to easily spot the “response cluster” in the embedding space is encouraging to us, and implies that OCTO-VC is capturing response-relevant biology that the 1% rule misses.

Future Directions:

The cancer field has been through a lot of definitions on what is the ‘most’ important factor to care about regarding a tumor. At first, it was about histology. Lung cancer could be separated into small-cell, squamous, non-small-cell, and so on. Then arrived the genetic era, when EGFR mutations or ALK fusions could, by themselves, dictate treatment. Now we are in the protein marker era, with PD-L1 expression being the most commonly deployed stratifier for checkpoint blockade.

None of these were wrong, but each was only capturing a small fragment of the whole.

Tumors are not uniform entities, but rather shifting ecosystems of cells, pathways, and immune structures. Understanding this complexity, to a large extent, may be beyond human intuition or comprehension. We at Noetik strongly believe that machine intelligence is the only way to grasp the tumor microenvironment in full.

The building of OCTO-VC, and the fact that anti-PD1 responders are so clearly separated in its embedding space, implies to us that this conviction is directionally accurate. Also, since the underlying data is all sourced from human tumors, we can easily pin down what other biological features predicted anti-PD-1 responders correlate with, both to reassure ourselves that they make sense and that they can be converted to usable assays. And indeed, both are present: CD8 infiltration, high interferon gamma levels, and antigen presentation markers (to name a few) align with responder status.

Of course, the real significance here is not in what we can do for anti-PD1 therapies—many people have worked on this exact subject before—but rather, in how easily our methodology can be extended to any arbitrary cancer drug. In other words, if OCTO-VC can isolate a subset of checkpoint responders from within an already enriched PD-L1+ population, then it should also be able to refine other trial cohorts. Our first partnership is with Agenus, a public biotechnology company, to see if our model is capable of accurately distinguishing between responders/non-responders from a recent clinical trial that Agenus ran. We’re looking forward to reporting what our results are here!

If any of this seems interesting, and you’d like to chat further, please reach out to info@noetik.ai!

Of course, one won’t always have response data. We think there are useful ways that OCTO-VC can be used in those situations as well, which is something we’ll discuss in part 2 of this series, covering how our model can be used for expanding eligibility in clinical trial design even when lacking access to true response/non-response data.