How Do Genes Work? From DNA Sequence to Protein Function
What Exactly Is a Gene?
A gene is a defined segment of DNA that contains the complete instructions for producing a functional product. Most genes encode proteins, but some produce functional RNA molecules such as ribosomal RNA or transfer RNA that never get translated into protein. In humans, the approximately 20,000 to 25,000 protein-coding genes make up only about 1.5 percent of the total genome, with the remaining DNA serving regulatory, structural, or still-unknown functions.
A typical protein-coding gene in humans includes several distinct regions. The promoter is a DNA sequence upstream of the gene that serves as a landing pad for RNA polymerase and transcription factors, determining when and where the gene is expressed. The coding region contains exons (sequences that end up in the final messenger RNA) interspersed with introns (sequences that are removed during RNA processing). The terminator region signals where transcription should stop.
Genes exist in specific locations on chromosomes called loci. Because humans have two copies of each autosomal chromosome (one from each parent), most genes are present in two copies called alleles. These alleles may be identical or may differ slightly in their DNA sequence, potentially producing different versions of the same protein. The combination of alleles an individual carries for a given gene constitutes their genotype for that gene.
From Gene to Protein: The Central Dogma
The flow of genetic information from DNA to RNA to protein is often called the central dogma of molecular biology, first articulated by Francis Crick in 1958. This process occurs in two main stages. First, during transcription, the DNA sequence of a gene is copied into a messenger RNA (mRNA) molecule. Second, during translation, the mRNA sequence is decoded by ribosomes to build a chain of amino acids that folds into a functional protein.
Transcription begins when transcription factors and RNA polymerase bind to the gene promoter. RNA polymerase then moves along the template strand of DNA, reading the base sequence and synthesizing a complementary RNA strand. The resulting pre-mRNA molecule is an exact RNA copy of the gene (with uracil replacing thymine). In eukaryotes, this initial transcript undergoes extensive processing before leaving the nucleus.
RNA processing includes three main modifications. A 5-prime cap (a modified guanine nucleotide) is added to the beginning of the mRNA, protecting it from degradation. A poly-A tail (a string of adenine nucleotides) is added to the end, further stabilizing the molecule. Most importantly, introns are removed and exons are spliced together in a process called RNA splicing, producing the mature mRNA that carries the actual protein-coding sequence.
Translation occurs on ribosomes in the cytoplasm. The ribosome reads the mRNA sequence three bases at a time (each three-base unit is a codon), matching each codon to its corresponding amino acid via transfer RNA adapter molecules. The ribosome joins amino acids together with peptide bonds in the sequence dictated by the mRNA, building the polypeptide chain that will fold into a functional protein.
Gene Regulation: Controlling When Genes Are Active
Not every gene is active in every cell at all times. A liver cell and a neuron contain identical DNA, yet they produce very different sets of proteins and perform completely different functions. This diversity arises from gene regulation, the complex system that controls which genes are expressed in which cells and at what levels.
Transcriptional regulation is the primary control point. Transcription factors are proteins that bind to specific DNA sequences near a gene and either activate or repress its transcription. Activators recruit RNA polymerase to the promoter, increasing gene expression. Repressors block RNA polymerase access or recruit chromatin-modifying enzymes that make the DNA physically inaccessible. The combination of transcription factors present in a cell determines which genes are turned on.
Enhancers are regulatory DNA sequences that can be located thousands or even millions of base pairs away from the genes they control. They work by looping through three-dimensional space to make physical contact with gene promoters, bringing activating transcription factors into proximity with the transcription machinery. This long-range regulation allows complex patterns of gene expression to be controlled by multiple inputs.
Epigenetic mechanisms provide another layer of gene regulation. DNA methylation (adding methyl groups to cytosine bases) generally silences genes by preventing transcription factor binding. Histone modifications (chemical changes to the histone proteins around which DNA is wound) can either open or close chromatin structure, making genes more or less accessible for transcription. These modifications can be stably maintained through cell divisions, giving cells a form of molecular memory.
Alternative Splicing: One Gene, Many Proteins
A single gene can produce multiple different proteins through a process called alternative splicing. By including or excluding different combinations of exons during RNA processing, cells can generate distinct mRNA variants from the same gene, each encoding a slightly different protein. This mechanism dramatically increases the protein diversity available from a limited number of genes.
The human genome contains roughly 20,000 protein-coding genes, yet produces an estimated 80,000 to 100,000 distinct proteins. Alternative splicing accounts for much of this discrepancy. Some genes produce only two or three splice variants, while others, like the Dscam gene in fruit flies, can theoretically produce over 38,000 different proteins from a single gene through combinatorial exon selection.
When Genes Malfunction
Mutations in genes can disrupt the normal production of proteins, potentially causing disease. A single base change in the beta-globin gene causes sickle cell disease by producing an abnormal hemoglobin protein that distorts red blood cells. Mutations in the BRCA1 and BRCA2 genes impair DNA repair proteins, dramatically increasing cancer risk. The cystic fibrosis transmembrane conductance regulator (CFTR) gene, when mutated, produces a defective chloride channel protein that causes the symptoms of cystic fibrosis.
Not all mutations are harmful. Many are neutral, changing the DNA sequence without affecting protein function (due to redundancy in the genetic code or occurring in non-critical regions). Some mutations are beneficial, producing protein variants that provide an advantage in particular environments. The sickle cell trait, for example, provides resistance to malaria in its heterozygous form, which explains its persistence in malaria-endemic regions despite being harmful when homozygous.
Genes work as instruction manuals for building proteins. The DNA sequence is transcribed into messenger RNA and then translated into a chain of amino acids that folds into a functional protein. Complex regulatory systems ensure the right genes are active in the right cells at the right times.