Alex Carlin

Why is progress slow in generative AI for biology?

Progress in generative AI enabling text, image, video, and sound generation is proceeding rapidly. In contrast, progress in generative ML for biology is slow. At least compared with the speed of progress in the processing of human-created data like text and images. This is primarily due to the high cost of data validation.

Unlike other areas of generative ML, where data is readily available and inexpensive, generative ML for biology faces significant challenges in both obtaining and validating new data. This disparity is evident in two main ways.

First, the source material for NLP is human-created text, which is abundant. You can obtain this data by crawling the web or using one of the many crawls available. In contrast, collecting biological sequence data requires not only a huge amount of work, to collect and process physical DNA, and then sequence it. This process, required the development of new technologies for sequencing and collection, each of which introduces various biases and complexities.

Critically, there is no biological equivalent of a textbook—a high-quality, instructional piece of text—readily available. The closest available would be datasets such as UniProt, which are carefully curated and deduplicated sets of protein sequence data. And there's only one of those.

Second, the process of validation differs greatly between the two fields. In NLP, evaluating a computer-generated text is straightforward; humans can quickly determine if the text is coherent and meaningful. However, in biology, validating whether a sequence of DNA or protein letters "makes sense" is much more complex. Even computationally predicting whether a sequence will fold correctly and perform a desired function requires significant computation—and that is simply another prediction. To actually test the "meaning" of a biological sequence, it must, after you designing the DNA that encodes it, be inserted into cells, cultured in the right environment, and harvested. The final step involves extracting and testing the molecule via biochemical assay, a process that is time-consuming, costly, and often requires specialized equipment. (This is likely the largest understatement in the blog, so far.)

In essence, while NLP allows for almost immediate evaluation of text outputs using a human machine that is very, very good at recognizing if text or an image is coherent, biological machine learning requires weeks of expensive laboratory work to validate every output.

Machine learning advances through datasets and evaluations, not merely algorithms. Therefore, the scarcity and high cost of biological data and its experimental validation slow the progress of machine learning in biology compared to NLP, where data generation and evaluation are relatively inexpensive.

That's why machine learning in biology will advance more slowly than in NLP. The availability and cost of data and the complexity of validation simply put a significant limit on progress that is not present in other fields of generative ML.