There are questions in biology that seem simple until one tries to answer them precisely. What exactly does a gene do? For decades, the answer has been piecemeal: each gene, a function; each function, an experiment. But Inside a cell, reality is much less orderly. Genes do not work alone, nor do they follow a single script. They organize, combine, change roles depending on the context. They are, more than pieces, actors in a work that is continually rewritten.
A study published in Patterns and led by scientists at the Icahn School of Medicine at Mount Sinai proposes a radically different way to address this problem. Instead of analyzing genes one by one, they have developed an artificial intelligence model capable of learning how they work together. The result is something that can be described as the first attempt to build a functional “map” of the genes within our cells.
The tool, called GSFM (acronym for Gene Set Foundation Model) and available for free on the Internet, is based on an idea borrowed from the world of language. Models like ChatGPT do not understand words in isolation, but rather based on the sentences in which they appear. “Bank” does not mean the same thing in a conversation about economics as it does in one about parks. Meaning emerges from context. Something similar happens with genes.
“Genes rarely act alone – says Avi Ma’ayan, leader of the study -. They participate in multiple biological processes and form distinct groupings depending on where and when they are active. The same gene can play different roles depending on the context, just as a word changes its meaning depending on the phrase.“And the GSFM proposal is, precisely, to learn that context.
To achieve this, the Ma’ayan team compiled millions of “gene sets” from scientific studies and gene expression databases. Each of those sets represents a kind of snapshot: which genes appear together in a given condition, disease, or biological process. Instead of focusing on how intensely a gene is expressed (the classic approach), the model looks at its context, with whom it appears.
Training is more like solving a puzzle than memorizing data. The system is shown a portion of a set of genes and asked to guess which ones are missing. Repeated millions of times, this process allows artificial intelligence to discover hidden patterns: which genes usually collaborate, which ones appear in similar situations, which combinations make biological sense.
Over time, the model builds an internal representation of those relationships. Not a static list, but a dynamic network of associations. That’s where the “map” appears. The reality is that it is not a physical map, nor a specific image of the interior of the cell, but a frame of reference. A way to place each gene in relation to others, to understand what role it can play in different contexts. And the implications are profound.
One of the most immediate uses is to shed light on genes we barely know about. If a gene appears systematically together with others involved in a specific process, for example, inflammation or cell growththe model can infer its possible function without the need for initial experiments. It does not replace the laboratory, but it guides where to look.
It also makes it possible to identify genes involved in diseases, suggest new therapeutic targets or reinterpret large volumes of biological data that until now have been difficult to decipher. In a field dominated by complexity, Having a system that organizes that information can make a decisive difference.
Perhaps one of the most striking aspects of the model is its ability to anticipate discoveries. In the tests carried out, The GSFM was trained with data published up to a specific date and then evaluated for their ability to predict relationships that would only be confirmed in subsequent studies. In many cases, he was right.
Not because he “knew” the answer, but because he had learned the rules of the system well enough to intuit it. That nuance is important. This type of artificial intelligence does not discover new laws in the classical sense, but it does reveal patterns that were hidden in the accumulation of data. It is a form of knowledge that emerges from scale.
The conceptual change is also relevant. Until now, many models in computational biology have been based on gene expression data, that is, how much a gene is activated under certain conditions. GSFM introduces a different perspective by focusing on gene sets, a less exploited but extraordinarily rich source of information.because it captures functional relationships directly.
In the long term, Ma’ayan’s team envisions integrating this system with other artificial intelligence models. For example, combine it with language models to generate understandable explanations of genetic functionsor with pharmacological models capable of predicting how drugs interact with cells. The underlying idea is to build a kind of “ecosystem” of artificial intelligences that collaborate in the understanding and manipulation of biological systems.
Despite its importance, it must be recognized that it is not a definitive or complete map. It is a first approximation, built from the available data, and its usefulness will depend on how it is integrated with the experimental work. But even as a starting point, it points a clear direction.
Biology has long advanced by breaking down complex systems into smaller and smaller parts. Now he begins to follow the opposite path: rebuilding the whole. If genes are words, this model begins to understand sentences. And in that step, from the isolated to the connected, may be one of the keys to understanding how life really works.…and how to intervene in it with greater precision.