Protein language model

Technology

A type of AI model, analogous to an LLM, trained on genomic data to understand protein structures and predict new, functional proteins like OpenCrisper-1.

First Mentioned

10/22/2025, 4:07:38 AM

Last Updated

10/22/2025, 4:10:18 AM

Research Retrieved

10/22/2025, 4:10:18 AM

Summary

A protein language model is an advanced AI-driven technology, analogous to Large Language Models (LLMs) used in natural language processing, designed to understand and manipulate protein sequences. Pioneered in 2018, these models are trained on vast datasets of protein sequences to uncover hidden patterns related to protein structure, function, and stability. A notable application is Profluent Bio's OpenCrispr-1, a revolutionary gene-editing tool developed by the Berkeley-based startup. OpenCrispr-1 is more effective than existing CRISPR technology and has been open-sourced, strategically aiming to democratize gene-editing by allowing researchers and entrepreneurs to bypass the restrictive patent landscape dominated by institutions like the Broad Institute, MIT, and Harvard. This innovation holds significant potential to accelerate advancements in life sciences, agricultural technology, industrial biotechnology, drug discovery, vaccine development, and the treatment of genetic diseases by removing barriers to innovation.

Referenced in 1 Document

Document 01a95b73...

Research Data

Extracted Attributes

Core Function
Uncovering hidden patterns related to protein structure, function, and stability; predicting protein sequence function and fitness
Key Capability
Extracting features from massive data and fine-tuning for specific downstream tasks
Technology Type
Protein Language Model (PLM)
Architectural Basis
Often based on Transformer architecture
OpenCrispr-1 Status
Open-sourced
Training Data Scale
Tens of millions to billions of protein sequences
OpenCrispr-1 Feature
More effective than existing CRISPR technology
Underlying Principle
Self-supervised machine learning on vast amounts of protein sequence data
Developer of OpenCrispr-1
Profluent Bio
Example Model (Gene Editing)
OpenCrispr-1
Training Data Characteristics
Unlabeled protein sequences across multiple species
Developer Location (Profluent Bio)
Berkeley, California, United States

Timeline

The Transformer architecture, foundational for many Large Language Models (LLMs) and subsequently Protein Language Models, was introduced. (Source: Wikipedia)
2017-01-01
The first protein language model was introduced by Berger and former MIT graduate student Tristan Bepler PhD '20. (Source: Web Search Results)
2018-01-01
Berger and colleagues utilized a protein language model to predict sections of viral surface proteins less likely to mutate, identifying potential vaccine targets against influenza, HIV, and SARS-CoV-2. (Source: Web Search Results)
2021-01-01
PortT5, a protein language model based on the T5 architecture, was developed by Elnaggar et al. (Source: Web Search Results)
2021-01-01
ProstT5, a protein language model, was developed by Heinzinger et al. (Source: Web Search Results)
2023-01-01
Ankh, a protein language model, was developed by Elnaggar et al. (Source: Web Search Results)
2023-01-01
Profluent Bio, a Berkeley-based startup, developed and open-sourced OpenCrispr-1, a revolutionary AI-driven gene-editing tool based on a protein language model. (Source: Provided Summary)
XXXX-XX-XX

Wikipedia

View on Wikipedia

Large language model

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) and provide the core capabilities of chatbots such as ChatGPT, Gemini and Claude. LLMs can be fine-tuned for specific tasks or guided by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained on. They consist of billions to trillions of parameters and operate as general-purpose sequence models, generating, summarizing, translating, and reasoning over text. LLMs represent a significant new technology in their ability to generalize across tasks with minimal task-specific supervision, enabling capabilities like conversational agents, code generation, knowledge retrieval, and automated reasoning that previously required bespoke systems. LLMs evolved from earlier statistical and recurrent neural network approaches to language modeling. The transformer architecture, introduced in 2017, replaced recurrence with self-attention, allowing efficient parallelization, longer context handling, and scalable training on unprecedented data volumes. This innovation enabled models like GPT, BERT, and their successors, which demonstrated emergent behaviors at scale such as few-shot learning and compositional reasoning. Reinforcement learning, particularly policy gradient algorithms, has been adapted to fine-tune LLMs for desired behaviors beyond raw next-token prediction. Reinforcement learning from human feedback (RLHF) applies these methods to optimize a policy, the LLM's output distribution, against reward signals derived from human or automated preference judgments. This has been critical for aligning model outputs with user expectations, improving factuality, reducing harmful responses, and enhancing task performance. Benchmark evaluations for LLMs have evolved from narrow linguistic assessments toward comprehensive, multi-task evaluations measuring reasoning, factual accuracy, alignment, and safety. Hill climbing, iteratively optimizing models against benchmarks, has emerged as a dominant strategy, producing rapid incremental performance gains but raising concerns of overfitting to benchmarks rather than achieving genuine generalization or robust capability improvements.

Web Search Results

Recent advances and future trends for protein–small molecule ...
In recent years, the application of natural language models to protein amino acid sequences, referred to as protein language models (PLMs), has demonstrated a significant potential for uncovering hidden patterns related to protein structure, function, and stability. The critical functions of proteins in biological processes often arise through interactions with small molecules; central examples are enzymes, receptors, and transporters. Understanding these interactions is particularly important
Evaluating the advancements in protein language models ... - Frontiers
has been a springtime phenomenon. These large-scale protein language models, based on tens of millions to billions of protein sequences that are self-supervised and pre-trained, represent the state-of-the-art in predicting protein sequence function and fitness. By pre-training on huge datasets of unlabeled protein sequences, these models are capable of automatically extracting features from massive data and fine-tuning them on specific downstream tasks. Protein language models focus on three [...] The emergence of protein language models solves the notable problems of previous approaches by efficiently utilizing large amounts of unlabeled protein sequence data through self-supervised learning, which can identify amino acids that have remained unchanged during the evolutionary process and are often critical for protein function. Their training data contains protein sequences across multiple species, which enables the models to learn the commonalities and differences in protein sequences [...] PortT5 (Elnaggar et al., 2021), ProstT5 (Heinzinger et al., 2023), and Ankh (Elnaggar et al., 2023) are protein language models based on the T5 (Text-to-Text Transformer) architecture. The T5 model was originally designed to deal with sequence-to-sequence problems, such as machine translation. The unique feature of T5 is that it unifies a variety of NLP tasks into a single text-to-text transformational process, by embedding the task T5 is unique in that it unifies various NLP tasks into a
Researchers glimpse the inner workings of protein language models
In 2018, Berger and former MIT graduate student Tristan Bepler PhD ’20 introduced the first protein language model. Their model, like subsequent protein models that accelerated the development of AlphaFold, such as ESM2 and OmegaFold, was based on LLMs. These models, which include ChatGPT, can analyze huge amounts of text and figure out which words are most likely to appear together. [...] In the new study, the researchers wanted to dig into how protein language models make their predictions. Just like LLMs, protein language models encode information as representations that consist of a pattern of activation of different “nodes” within a neural network. These nodes are analogous to the networks of neurons that store memories and other information within the brain. [...] In a 2021 study, Berger and colleagues used a protein language model to predict which sections of viral surface proteins are less likely to mutate in a way that enables viral escape. This allowed them to identify possible targets for vaccines against influenza, HIV, and SARS-CoV-2. However, in all of these studies, it has been impossible to know how the models were making their predictions.
Predictive and therapeutic applications of protein language models
Protein language models (pLMs) are rapidly emerging as revolutionary artificial intelligence technologies that bring transformative changes to drug discovery and therapeutic research. pLMs acquire rich representational capabilities from large-scale sequence datasets, enabling the solution of various biological problems that were difficult with conventional methods. In this review, we provide a comprehensive overview of various pLMs and their implementations, exploring their potential utility in [...] ## Keywords Artificial intelligence Bioinformatics Computational biology Drug discovery Protein language model ## Abbreviations BCR B Cell Receptor BERT Bidirectional Encoder Representations from Transformers BFDB Big Fantastic Database CNN Convolutional Neural Network DMS Deep Mutational Scanning ESM Evolutionary Scale Modeling GPT Generative Pretrained Transformer GO Gene Ontology GVP Geometric Vector Perceptron IDP Intrinsically Disordered Protein LM
Learning the Protein Language: Evolution, Structure and Function
Using these and other tools, protein language models are able to synthesize the enormous quantity of known protein sequences by training on 100s of millions of sequences stored in protein databases (e.g. UniProt, Pfam, NCBI 15,49,50). The distribution over sequences learned by language models captures the evolutionary fitness landscape of known proteins. When trained on tens of thousands of evolutionarily related proteins, the learned probability mass function describing the empirical [...] language model were powerful features for solving other prediction problems. Since then, various language model-based protein embedding methods have been applied to these and other protein prediction problems through transfer learning, including protein phenotype prediction 28,32-34, residue-residue contact prediction 32,70, fold recognition 33, protein-protein 71,72 and protein-drug interaction prediction 35,73. Recent works have shown that increasing language model scale leads to continued [...] Deep language models are an exciting breakthrough in protein sequence modeling, allowing us to discover aspects of structure and function from only the evolutionary relationships present in a corpus of sequences. However, the full potential of these models has not been realized as they continue to benefit from more parameters, more compute power, and more data. At the same time, these models can be enriched with strong biological priors through multi-task learning.

Protein language model

First Mentioned

Last Updated

Research Retrieved

Summary

Referenced in 1 Document

Research Data

Extracted Attributes

Core Function

Key Capability

Technology Type

Architectural Basis

OpenCrispr-1 Status

Training Data Scale

OpenCrispr-1 Feature

Underlying Principle

Developer of OpenCrispr-1

Example Model (Gene Editing)

Training Data Characteristics

Developer Location (Profluent Bio)

Timeline

Wikipedia

Large language model

Web Search Results