UniProt is like a candy store
Being someone who's interested in protein science these days is like being a kid in a candy store. There's been continual improvement in molecular biology tools, making lab work easier and faster. Computational biologists today enjoy an ecosystem that's richer than ever before, with software powerhouses like Facebook (now Meta) releasing state of the art pre-trained transformer models for protein language modeling. Most important, protein scientists today enjoy free, unfettered access to massive amounts of biological information that has been collected by countless groups over many years, organized, hyperlinked, and presented in a highly usable form.
I am talking about UniProt, of course. If you know proteins, you know UniProt. UniProt is a free, open-access library of biological information. It is carefully curated, by hand, and every piece of data is usefully structured and interlinked. If you don't, take a look at the linked page.
I mean just look at it. It's a link to the E. coli proteome page on UniProt. It's a live list of the 6,463 proteins that Escherichia coli K12 cells make. Like every single protein that E. coli makes, it's cataloged here. Each one has its own page. On this page, though, they're filtered down to just enzymes. And then those enzymes are categorized in a tree by functional classification (EC number). And so we can see that, of all the enzymes that E. coli make, 16% are oxidoreductases. We can see that E. coli makes only three different kinds of superoxide dismutase, but 149 different kinds of aldehyde and carbonyl reductases. It's such a joy to see this kind of information so densely presented. And this is the same for every organism—not every organism on earth, but every organism whose genome has been sequenced, perhaps tens of thousands of organisms.
It's so incredible to have this amount of information at your fingertips when you are seeking to design a new enzyme function. You can take inspiration from all the different kinds of enzymes that cell is already making. And of course if this species doesn't make it, there's a species that does—and UniProt has that sequence too. It's a truly inspiring collection of scientific work conducted over a long period of time by countless millions of scientists over the past 50 years to collect all this information, and it's real treat to see it all nicely organized and presented, for free, to anyone who wants it. A complete catalog of the world's most advanced technology—all the "source code" of biology—for everyone to use.
It would be impossible to overstate the importance of the sequence datasets in UniProt, yet the thing that really truly amazes me when I look at that page isn't the neat table of sequences and metadata. It isn't the cool bar charts showing the enzyme category breakdowns. It isn't the hyperlinks and search tools. It's these two numbers.
At the top, the number 6,463: the number of proteins in the E. coli proteome. And this number on the sidebar: "3D structure (1,700)". The number of those proteins that have a solved 3-D structure determined by experiment. More than 25% of the E. coli proteome has a solved crystal structure.
This implies that we have an atomic-level understanding of a huge chunk of an E. coli cell. At the same time, that's just one species. Certainly one of the better-studied species, at that. But if you're interested in proteins and protein structure, it sure is an exciting number.