Reverse engineering life: a journey to the centre of the cell

September 28th, 2023

As I sit here attempting to write up 7 years of work into a Big Paper (and thesis) during these twilight months of my PhD, I’ve been reminded of why I chose to train as a scientist: biology is damn cool.

One of my favourite things is explaining to non-science friends why I find biology so exciting. This often comes with questions about what my thesis is about. I think I’ve nailed the elevator pitch by now, but for those of you who haven’t had the fortune of being cornered at a party listening to me ramble, I’ve decided to publish this piece to lay it all out there and hopefully share my appreciation for the magic that is life at the cell and molecular level — a necessary introduction to explaining what I’ve spent years thinking about and working on.

This is intended for a lay audience, so I’m keeping it high-level and simplifying things as much as possible. If you’re a biologist and came here to find out what I do:

I use BioID to characterize the protein composition of nuclear bodies in human cells.

If you aren’t a biologist but you want to know what that means, then I invite you to read on. We’ll start with a general overview of living things and how they’re studied, the tools used to do this work, and eventually get to explaining the cryptic one-liner above.

Cells: the fundamental unit of life

All living things are made of cells, from the countless single-celled microbes which surround us, to the trillions of cells that make up a human. A single cell is a machine more complex than anything built or designed by humans by orders of magnitude. You can think of cells as being made up of parts: lipids that make up membranes which separate what’s inside from what’s outside, proteins — the molecular machines that make life happen, DNA — the “source code” that makes up our genetic material and the basis for inheritance, RNA — the intermediary molecule made from DNA templates, which is used as a template by ribosomes (which themselves are ancient molecular machines consisting of many proteins and RNAs) to guide the joining of amino acids in a specific order to create proteins, and finally carbohydrates and myriad other metabolites and small molecules.

It’s tempting to think of cells as simple “bags of molecules” but the reality is much more complicated than that. Spatial organization is a recurring theme in biology, and just like the cell membrane separates inside from outside, eukaryotic cells have compartments which we call organelles. Some of these will be familiar to most people: the mitochondria (the “powerhouse” of the cell which produces energy in the form of ATP) and the nucleus (the “control centre” of the cell, where your genome (aka all your genetic material, aka DNA) is stored, and where gene regulation occurs). Organelle functions are determined by the parts found within them, and it’s no surprise that cells need to organize all these parts in some way. There are billions of proteins in a single cell, and they each need to go to specific places to do specific things.

These proteins are encoded by the genome, which in humans is ≈3 billion base pairs in size. This means that there are 3 billion DNA “letters” (A, T, C, and G) which are strung together in an order that represents code for genes and as instructions for gene regulation. Remember that DNA is a physical molecule. Each of the trillions of cells in your body has a copy of this genome, and strung end-to-end, the genome is about 1 metre in length. But our cells each have **two **copies of the genome, which means every cell has 2 metres of DNA that has to be compacted into a few microns worth of volume in the nucleus. Crazy. On top of that, the genome has to be accessible to all the proteins which interact with it in specific ways, at specific times, to make RNA which will in turn make proteins.

Your genome contains ≈20,000 protein-coding genes, and all the proteins in a cell or organism is referred to as the proteome (remember this, we’ll come back to it later). Each of those proteins can also be modified in countless of ways, meaning there is a mind-boggling diversity of proteins that work together to make life happen. Some estimates say there are ≈100,000 unique proteins in human, and I don’t think that includes the many chemical modifications that happen to proteins. What’s amazing is that protein coding genes make up only ≈1.5% of the genome, with the rest being instructions for how, when, and where to express these proteins. There are over 200 different types of cells in the human body, and each one needs to do different things to make up different tissues and systems. These different cells all express different specific subsets of proteins, and this relies on those genes being regulated. But it all comes from the same genome.

Cells are incredible self-organizing systems that evolved over billions of years to do some amazing things. They grow, divide, respond to signals, and coordinate with each other. They are tiny corners of the universe where entropy is being decreased; where matter is organized at the molecular level. There’s an intricate and continuous molecular dance going on inside cells. These trillions of chemical reactions happening every second are what allow you to read this sentence I’ve written. We are but immensely-complex meat robots.

Understanding how cells work is our key to explaining and controlling life itself. Beyond the visceral satisfaction derived from understanding what makes up our physical existence, it’s what lets us create cures for diseases, what lets us harness the power of living things as technology, and what lets us transcend the cruel injustice of nature. The past century of progress in biology has demystified a lot about life, but there’s still so much for us to figure out.

Reverse engineering life

We’re lucky that evolution is a thing, because we’ve learned a lot about how our own cells work by studying “simpler” organisms like yeast and worms. The basic principles that underlie life — metabolism, gene regulation, moving things around to where they have to go in the cell — are largely consistent. You’ve probably heard that we share something like 98% of our DNA with chimpanzees. I think we share something like 85% of our DNA with mice. Superficially, we’re very different from these animals. But at the cell and molecular level we’re surprisingly similar. I often think back to something my supervisor told me in the early days of my PhD when I questioned using the mouse version of a gene instead of the human version in an experiment: “Boris, for all intents and purposes, you are a mouse”.

There’s a famous quote from a Nobel laureate that I don’t care to dig up right now that describes physics as “natural philosophy”, and by extension, biology as “natural engineering”. We’re talking about living machines and our quest to explain how they work, which is why I like to think of biology as the reverse-engineering of life.

Studying biology is hard. Cells are microscopic. The parts that make them up are too small to see with a normal microscope, so measuring them requires all sorts of seemingly-arcane biochemical techniques. We get better and better at identifying and quantifying these parts with every passing year, and my greatest hope for AGI is that we achieve a holistic model of the cell in my lifetime. “Genetic engineering” is more akin to tinkering, but with enough data and improved models of how life works I believe we’ll eventually be able to engineer life from first principles. A cure for every disease. Unlimited lifespan. Eco-friendly production of chemicals and products from living cells. We can already do some of these things, and the human race stands at the precipice of a future where we have mastery over the living world. I pursued science to help take us over that edge.

Proteomics

Genes are to the genome what proteins are to the proteome. Back in the day, we used to primarily do genetics — studying individual genes and understanding what they do and how they’re regulated. With the advent of DNA sequencing technology (tools to read the order of DNA bases that makes up our “code”) came genomics, and our ability to study genes at the systems level.

Similarly, technology has advanced over the past decades to let us study proteomes instead of individual proteins. Instead of isolating individual proteins to study their biochemical properties, we can now use mass spectrometry to identify all (or at least many) of the proteins in a sample, e.g. the set of proteins we can purify from a sample of cells. Without getting too nitty-gritty, mass spectrometers are like a really sensitive scale that we use to weigh molecules. Specifically, they let us measure the mass and electrical charge of molecules that are being analyzed. In our case, these molecules are peptides (fragments of proteins). Because proteins are chains of amino acids in a specific order, and we can calculate the molecular weight of amino acids (and combinations of them), we can use mass spectrometry to identify and quantify the proteins in our sample. We know (or at least can predict) the amino acid sequence of each protein because we know the DNA sequences that encode for them, and that’s because we have sequenced the human genome.

This is powerful. Proteins are the functional output of genes, and proteomes are the functional output of genomes. We now have tools that show us what parts are in a given cell, and at which quantities.

A very simple example I like to use for illustrating what proteomics enables is that we can take a healthy cell and a cancer cell, purify out all their proteins, identify them using mass spec, and compare them. Maybe we see that the cancer cell has a lot more of proteins A, B, and C but a lot less of proteins X, Y, and Z. This lets us build hypothesis to explain the mechanisms underlying this cancer. Maybe it’s the presence of proteins A/B/C that drive the cancer. Or maybe it’s the absence of X/Y/Z. We could try using a drug to destroy proteins A/B/C or another to increase X/Y/Z. Maybe this will let us kill or disable the cancer cells.

Proteomics is an entire field unto itself; another arrow in our quiver of tools that let us understand cells. Our lab develops proteomics methods and uses them to to understand how proteins associate and interact with each other to perform their functions. Protein–protein interactions are an important area of research because knowing which proteins do things together is crucial for eventually understanding how all the parts in a cell work together to do everything. I joined the Gingras lab for my PhD in part because we are a world leader in using a proximity labeling technique called BioID.

BioID lets us figure out which proteins associate with each other inside a living cell. How it works is we take a protein of interest and tag it with a different protein — an enzyme, specifically a biotin ligase originally from bacteria — and express this inside a cell we engineered. Biotin ligase takes biotin (vitamin B7) and attaches it to nearby proteins. Think of this like attaching a tiny paintbrush to a protein and then letting it loose in the cell to do its normal thing, to go where it normally goes. All the proteins it comes into contact with get “painted” with biotin (“biotinylated”). Next, we bust open these cells and use a protein called streptavidin to selectively pull out all the biotinylated proteins. Then we take all the biotinylated proteins and identify them using mass spec.

This is how we build a spatial map of proteins in a living cell. Our group has done this to produce a human cell map of over 4000 proteins in the model HEK293 cell, providing new information on where all these parts go. My colleagues took proteins that were known to be residents of the different organelles, tagged them with biotin ligase, and saw which proteins got labeled. This took hundreds of experiments using hundreds of cell lines each expressing a different tagged protein, all done by hand. Now, even if we don’t know the specific function of a protein in this dataset (which is the case for most proteins), we at least have a clue about what it’s involved in — because we now know that it resides in an organelle with a known function, or that it associates with proteins that are functionally characterized.

It is this work that laid the foundation for my own PhD project.

Nuclear bodies: enigmatic structures at the centre of the cell

We’ve talked about how spatial organization is an important part of life’s complexity, and remembered how cells have subcompartments called organelles. Membrane-bound organelles are not however the only way cells organize their contents, and more recently we’ve come to appreciate membraneless organelles and domains in the cell. These are structures that are believed to form as a result of liquid–liquid phase separation. You know when you mix oil and vinegar and get distinct droplets that don’t mix? Something like that, but inside living cells, and made of proteins and RNAs. And these droplets are dynamic — proteins and RNA inside them move in and out of them quickly — but they still are seen as distinct, visible structures.

PML nuclear bodies (red) in HEK293 cells. The blue regions are nuclei, indicated by DAPI staining.

We can refer to this general class of structures as biomolecular condensates. This is a relatively new area of study, but is quickly being recognized as an important way that biology is regulated. Inside the nucleus, these are termed nuclear bodies, and are a diverse group of structures that share some common properties. As a sort of membraneless “sub-organelle”, nuclear bodies are distinct domains in the nucleus that are separate from your DNA (which doesn’t actually sit around as “naked” DNA, but is always wrapped up around proteins called histones, forming what we call chromatin).

The largest and most well-known nuclear body is the nucleolus, which some of you might remember learning about in school as the “nucleus in the nucleus”. The nucleolus is very important; it’s where ribosomes are made, and ribosomes are what make proteins. Nucleoli are big enough to see with a standard light microscope, and you can even see them in my image above (regions in each nucleus that are “less blue”). The other nuclear bodies are smaller and appear either as spherical dots or “foci", or elongated structures called “speckles” (because they look “speckly”… yes, really). There are a bunch of different nuclear bodies; they vary in size and shape, appear in different quantities, and are composed of different parts. As you can imagine, they each do different things, and this is defined by the parts that make them up.

Nuclear speckles (red) as stained by the SC35 antibody.

Nuclear bodies are important! We can say this because evolution seems to think they’re important, and indeed, they can be found in organisms across disparate branches of the tree of life. All eukaryotes have nucleoli, and even plant cells seem to have a nuclear body similar to the Cajal body found in our cells. Nuclear bodies are essential for stress responses (like defending from viral infections) and play an important role in gene regulation and RNA metabolism. Studies in mice have shown that nuclear bodies like the paraspeckle are essential for aspects of development, and core components of paraspeckles are implicated in neurodegenerative diseases like Alzheimer’s and ALS.

Exactly what each nuclear body does, how they’re regulated, and how that all relates to disease remains a mystery. Yes, we know what some of them do, but many remain uncharacterized. We return to a common theme: knowing which proteins are in a structure is an important part of explaining how it works. We know many nuclear body proteins thanks to several decades of scientists largely using microscopes to see which proteins overlap with known markers of nuclear bodies (known as colocalization imaging). If a protein co-localizes with a nuclear body marker, we define it as being part of the nuclear body. This has limitations, and our information is incomplete. Ideally we would take a proteomics approach to isolate a specific nuclear body and see which proteins are there using mass spectrometry. Of course people have tried, but remember: nuclear bodies don’t have a membrane! Which means they’re really hard to purify. They basically fall apart.

Which brings us back to our favourite tool: BioID. Earlier work from our lab used BioID to describe the composition of cytoplasmic (non-nuclear) membraneless organelles called stress granules and p-bodies. Knowing that BioID seemed to work well for studying those biomolecular condensates is what de-risked the project I would take on for my PhD: the proteomic characterization of nuclear bodies in a human cell. This would be the first and largest-scale study of nuclear bodies using this technique.

How did this play out? I’ve since profiled over 150 proteins in the nucleus using BioID and generated a high-confidence map of ≈2269 proteins. We can analyze and organize these data into clusters which represent different nuclear bodies and functional groups of proteins, which has expanded our lists of the proteins that we think are in each nuclear body. This approach ended up giving us the best results for paraspeckles and another nuclear body called the nuclear speckle (shown above), so we doubled down and focused on validating new components of these structures predicted by my dataset.

I was lucky to collaborate with the lab of Archa Fox, who discovered paraspeckles in 2002, to explain how one of these new paraspeckle components is important for paraspeckle assembly. I similarly hope that my dataset will help researchers explore the function of other nuclear bodies, and get us a little bit closer to knowing what exactly is going on in our trillions of cells.

What’s next?

If I’m being totally honest, writing this was a bit of “productive” procrastination from working on my manuscript that will formally share this work with the scientific community. My immediate goal is finishing that paper and sharing the resource I’ve built, so that others can explore this new view of nuclear bodies and mine the dataset for new leads on how these enigmatic structures work. Then I’ll expand on that paper and get it into the right format to be submitted as a thesis so I can finally get those letters after my name :-)

Over the past two years I’ve began to focus my work around how we can improve science at the level of systems and institutions — namely by designing new incentives using blockchain technology and through the emerging DeSci movement. I’m excited to build a future where more people want to be scientists and have the opportunity to do so. I want to figure out how scientists can capture more of the value they create. I want to be part of a future where science is more open, more reproducible, more collaborative, and faster.

Future writing will likely focus on those themes, but I hope my long-overdue foray into science communications here today has brought some greater appreciation to the magic of biology.

Subscribe to Boris Dyakov

Receive the latest updates directly to your inbox.

Mint this entry as an NFT to add it to your collection.

Verification

This entry has been permanently stored onchain and signed by its creator.

Arweave Transaction

_CBvt8T0fBsbKKS…lt6SjwgD5pXd83Y

Author Address

0x117e1EbB7D05545…21dF6ADe3C1690B

Content Digest

rTwM5126uX_lk_5…9j7bsiVcQOWtPqE