

النبات

مواضيع عامة في علم النبات

الجذور - السيقان - الأوراق

النباتات الوعائية واللاوعائية

البذور (مغطاة البذور - عاريات البذور)

الطحالب

النباتات الطبية


الحيوان

مواضيع عامة في علم الحيوان

علم التشريح

التنوع الإحيائي

البايلوجيا الخلوية


الأحياء المجهرية

البكتيريا

الفطريات

الطفيليات

الفايروسات


علم الأمراض

الاورام

الامراض الوراثية

الامراض المناعية

الامراض المدارية

اضطرابات الدورة الدموية

مواضيع عامة في علم الامراض

الحشرات


التقانة الإحيائية

مواضيع عامة في التقانة الإحيائية


التقنية الحيوية المكروبية

التقنية الحيوية والميكروبات

الفعاليات الحيوية

وراثة الاحياء المجهرية

تصنيف الاحياء المجهرية

الاحياء المجهرية في الطبيعة

أيض الاجهاد

التقنية الحيوية والبيئة

التقنية الحيوية والطب

التقنية الحيوية والزراعة

التقنية الحيوية والصناعة

التقنية الحيوية والطاقة

البحار والطحالب الصغيرة

عزل البروتين

هندسة الجينات


التقنية الحياتية النانوية

مفاهيم التقنية الحيوية النانوية

التراكيب النانوية والمجاهر المستخدمة في رؤيتها

تصنيع وتخليق المواد النانوية

تطبيقات التقنية النانوية والحيوية النانوية

الرقائق والمتحسسات الحيوية

المصفوفات المجهرية وحاسوب الدنا

اللقاحات

البيئة والتلوث


علم الأجنة

اعضاء التكاثر وتشكل الاعراس

الاخصاب

التشطر

العصيبة وتشكل الجسيدات

تشكل اللواحق الجنينية

تكون المعيدة وظهور الطبقات الجنينية

مقدمة لعلم الاجنة


الأحياء الجزيئي

مواضيع عامة في الاحياء الجزيئي


علم وظائف الأعضاء


الغدد

مواضيع عامة في الغدد

الغدد الصم و هرموناتها

الجسم تحت السريري

الغدة النخامية

الغدة الكظرية

الغدة التناسلية

الغدة الدرقية والجار الدرقية

الغدة البنكرياسية

الغدة الصنوبرية

مواضيع عامة في علم وظائف الاعضاء

الخلية الحيوانية

الجهاز العصبي

أعضاء الحس

الجهاز العضلي

السوائل الجسمية

الجهاز الدوري والليمف

الجهاز التنفسي

الجهاز الهضمي

الجهاز البولي


المضادات الميكروبية

مواضيع عامة في المضادات الميكروبية

مضادات البكتيريا

مضادات الفطريات

مضادات الطفيليات

مضادات الفايروسات

علم الخلية

الوراثة

الأحياء العامة

المناعة

التحليلات المرضية

الكيمياء الحيوية

مواضيع متنوعة أخرى

الانزيمات
Genomics: Genome-Wide Analysis of Gene Structure and Function
المؤلف:
Harvey Lodish, Arnold Berk, Chris A. Kaiser, Monty Krieger, Anthony Bretscher, Hidde Ploegh, Angelika Amon, and Kelsey C. Martin.
المصدر:
Molecular Cell Biology
الجزء والصفحة:
8th E , P323-327
2026-02-21
38
By using automated DNA sequencing techniques and computer algorithms to piece together sequence data, researchers have determined vast amounts of DNA sequence, including nearly the entire genomic sequence of humans and of many key experimental organisms. This enormous volume of data, which is growing at a rapid pace, has been stored and organized by the National Center for Biotechnology Information (NCBI), US National Institutes of Health, the European Bioinformatics Institute at the European Molecular Biology Laboratory in Heidelberg, Germany, and the DNA Data Bank of Japan. These databases continuously exchange newly reported sequences and make them available to scientists throughout the world on the Internet. By now, the genomic sequences have been completely, or nearly completely, determined for hundreds of viruses and bacteria; scores of archaea; yeasts (eukaryotes); plants, including rice and maize; important model multicellular eukaryotes such as the roundworm C. elegans, the fruit fly Drosophila melanogaster, and mice; humans; and representatives of all of the 35 or so metazoan phyla. The cost of sequencing a megabase of DNA has fallen so low that the entire genomes of cancer cells have been sequenced and compared with the genomes of normal cells from the patients from which they came in order to determine all the mutations that have accumulated in that patient’s tumor cells. This approach is revealing genes that are commonly mutated in all cancers, as well as genes that are commonly mutated in tumors from different patients with the same type of cancer (e.g., breast or colon cancer). This approach may eventually lead to highly individualized cancer treatments tailored to the specific mutations in the tumor cells of a particular patient. The latest automated DNA sequencing techniques are so powerful that a project known as the “1000 Genomes Project” is currently under way, with the goal of sequencing most of the genomes of 2500 randomly chosen individuals from 25 populations around the world in order to determine the extent of human genetic variation as a basis for investigating the relationship between genotype and phenotype in humans. Moreover, privately owned companies have been founded that will sequence much of an individual’s genome for about $100 in order to search for sequence variations that may influence that individual’s probability of developing specific diseases.
In this section, we examine some of the ways in which researchers are mining this treasure trove of data to provide insights about gene function and evolutionary relationships, to identify new genes whose encoded proteins have never been isolated, and to determine when and where genes are expressed. This use of computers to analyze sequence data has led to the emergence of a new field of biology: bioinformatics.
Stored Sequences Suggest Functions of Newly Identified Genes and Proteins
As discussed in Chapter 3, proteins with similar functions often contain similar amino acid sequences that correspond to important functional domains in the three-dimensional structure of the proteins. By comparing the amino acid sequence of the protein encoded by a newly cloned gene with the sequences of proteins of known function, an investigator can look for sequence similarities that provide clues to the function of the encoded protein. Because of the degeneracy in the genetic code, related proteins invariably exhibit more sequence similarity than the genes encoding them. For this reason, protein sequences, rather than the corresponding DNA sequences, are usually compared.
The most widely used computer program for this purpose is known as BLAST (basic local alignment search tool). The BLAST algorithm divides the “new” protein sequence (known as the query sequence) into shorter segments and then searches the database for significant matches to any of the stored sequences. The matching program assigns a high score to identically matched amino acids and a lower score to matches between amino acids that are related (e.g., hydro phobic, polar, positively charged, negatively charged) but not identical. When a significant match is found for a segment, the BLAST algorithm searches locally to extend the region of similarity. After searching is completed, the program ranks the matches between the query protein and various known proteins according to their p-values. This parameter is a measure of the probability of finding such a degree of similarity between two protein sequences by chance. The lower the p-value, the greater the sequence similarity between two sequences. A p-value less than about 10−3 is usually considered significant evidence that two proteins share a common ancestor. Many alternative computer programs have been developed that can detect relationships between proteins that are more distantly related to each other than can be detected by BLAST. The development of such methods is currently an active area of bioinformatics research.
To illustrate the power of this sequence comparison approach, let’s consider the human gene NF1. Mutations in NF1 are associated with the inherited disease neuro fibromatosis 1, in which multiple tumors develop in the peripheral nervous system, causing large protuberances in the skin. After a cDNA clone of NF1 was isolated and sequenced, the deduced sequence of the NF1 protein was checked against all other protein sequences in GenBank. A region of NF1 protein was discovered to have considerable homology to a portion of the yeast protein called Ira (Figure 1). Previous studies had shown that Ira is a GTPase-activating protein (GAP) that modulates the GTPase activity of the monomeric G protein called Ras. As we examine in de tail in Chapter 16, GAP and Ras proteins normally function to control cell replication and differentiation in response to signals from neighboring cells. Functional studies on the nor mal NF1 protein, obtained by expression of the cloned wild type gene, showed that it did, indeed, regulate Ras activity, as suggested by its homology with Ira. These findings suggest that patients with neurofibromatosis express a mutant NF1 protein in cells of the peripheral nervous system, leading to abnormally high signaling through RAS protein leading to excessive cell division and formation of the tumors characteristic of the disease.
Fig1. Comparison of the regions of human NF1 protein and S. cerevisiae Ira protein that show significant sequence similarity. The NF1 and the Ira sequences are shown on the top and bottom lines of each row, respectively, in the one-letter amino acid code. Amino acids that are identical in the two proteins are highlighted in dark blue. Amino acids with chemically similar but nonidentical side chains are highlighted in light blue. Black dots indicate “gaps” in the upper and lower protein sequences, inserted in order to maximize the alignment of homologous amino acids. The BLAST p-value for these two sequences is 10−28, indicating a high degree of similarity. [Data from G. Xu et al., 1990, Cell 62:599.]
Even when the BLAST algorithm finds no significant similarities, a query sequence may nevertheless share a short sequence with known proteins that is functionally important. Such short segments recurring in many different proteins, referred to as structural motifs, generally have similar functions. To search for these and other motifs in a new protein, researchers compare the query protein sequence with a database of known motif sequences.
Comparison of Related Sequences from Different Species Can Give Clues to Evolutionary Relationships Among Proteins
BLAST searches for related protein sequences may reveal that proteins belong to a protein family. Earlier, we considered gene families in a single organism, using the β-globin genes in humans as an example. But in a data base that includes the genomic sequences of multiple organ isms, protein families can also be recognized as being shared among related organisms. Consider, for example, the tubulin proteins, the basic subunits of microtubules, which are important components of the cytoskeleton. According to the simplified scheme in Figure 2a, the earliest eukaryotic cells are thought to have contained a single tubulin gene that was duplicated early in evolution; subsequent divergence of the different copies of the original tubulin gene formed the ancestral versions of the α- and β-tubulin genes. As different species diverged from these early eukaryotic cells, each of these gene sequences further diverged, giving rise to the slightly different forms of α-tubulin and β-tubulin now found in each species.
Fig2. Generation of diverse tubulin sequences during the evolution of eukaryotes. (a) Probable mechanism giving rise to the tubulin genes found in existing species. It is possible to deduce that a gene duplication event occurred before speciation because the α-tubulin sequences from different species (e.g., humans and yeast) are more alike than are the α-tubulin and β-tubulin sequences within a species. (b) A phylogenetic tree representing the relationship between the tubulin sequences. The branch points (nodes), indicated by small numbers, represent common ancestral genes at the time that two sequences diverged. For example, node 1 represents the duplication event that gave rise to the α-tubulin and β-tubulin families, and node 2 represents the divergence of yeast from multicellular species. Braces and arrows indicate, respectively, the orthologous tubulin genes, which differ as a result of speciation, and the paralogous genes, which differ as a result of gene duplication. This diagram is simplified somewhat because flies, worms, and humans actually contain multiple α-tubulin and β-tubulin genes that arose from later gene duplication events.
All the different members of the tubulin family of genes (and proteins) are sufficiently similar in sequence to suggest a common ancestral sequence. Thus all these sequences are considered to be homologous. More specifically, sequences that presumably diverged as a result of gene duplication (e.g., the α- and β-tubulin sequences) are described as paralogous. Sequences that arose because of speciation (e.g., the α-tubulin genes in different species) are described as orthologous. From the degree of sequence relatedness of the tubulins present in different organisms today, evolutionary relationships can be deduced, as illustrated in Figure 2b. Of the three types of sequence relationships, orthologous sequences are the most likely to share the same function.
Genes Can Be Identified Within Genomic DNA Sequences
The complete genomic sequence of an organism contains within it the information needed to deduce the sequence of every protein made by the cells of that organism. For organisms such as bacteria and yeast, whose genomes have few introns and short intergenic regions, most protein-coding sequences can be found simply by scanning the genomic sequence for open reading frames (ORFs) of significant length. An ORF is usually defined as a stretch of DNA containing at least 100 codons that begins with a start codon and ends with a stop codon. Because the probability that a random DNA sequence will contain no stop codons for 100 codons in a row is very small, most ORFs encode proteins.
ORF analysis correctly identifies more than 90 percent of the genes in yeast and bacteria. Some of the very shortest genes, however, are missed by this method, and occasion ally long open reading frames that are not actually genes arise by chance. Both types of mis-assignments can be corrected by more sophisticated analysis of the sequence and by genetic tests for gene function. Of the Saccharomyces genes identified in this manner, about half were already known by some functional criterion such as mutant phenotype. The functions of some of the proteins encoded by the remaining putative (suspected) genes identified by ORF analysis have been assigned based on their sequence similarity to known proteins in other organisms.
Identification of genes in organisms with a more com plex genome structure requires more sophisticated algorithms than searching for open reading frames. Because most genes in higher eukaryotes are composed of multiple, relatively short exons separated by often quite long non coding introns, scanning for ORFs is a poor method for finding genes in these organisms. The best gene-finding algorithms combine all the available data that might suggest the presence of a gene at a particular genomic site. Relevant data include alignment of the query sequence to a full-length cDNA sequence; alignment to a partial cDNA sequence, generally 200–400 bp in length, known as an expressed sequence tag (EST); fitting to models for exon, intron, and splice-site sequences; and sequence similarity to genes from other organisms. Using these computer-based bioinformatic methods, computational biologists have identified approximately 21,000 protein-coding genes in the human genome.
A particularly powerful method for identifying human genes is to compare the human genomic sequence with that of the mouse. Humans and mice are sufficiently related to have most genes in common, although largely nonfunctional DNA sequences, such as intergenic regions and introns, tend to be very different because these sequences are not under strong selective pressure. Thus corresponding segments of the human and mouse genome that exhibit high sequence similarity are likely to be functionally important: exons, transcription-control regions, or sequences with other functions that are not yet understood.
The Number of Protein-Coding Genes in an Organism’s Genome Is Not Directly Related to Its Biological Complexity
The combination of genomic sequencing and gene-finding computer algorithms has yielded the complete inventory of protein-coding genes for a variety of organisms. Figure 3 shows the total number of protein-coding genes in several eukaryotic genomes that have been completely sequenced. The functions of about half the proteins encoded in these genomes are known or have been predicted on the basis of sequence comparisons. One of the surprising features of this comparison is that the number of protein-coding genes within different organisms does not seem proportional to our intuitive sense of their biological complexity. For example, the roundworm C. elegans apparently has more genes than the fruit fly Drosophila, which has a much more complex body plan and more complex behavior. And humans have only about 5 percent more protein-coding genes than C. elegans. When it first became apparent that humans have so few more protein-coding genes than the simple roundworm, it was difficult to understand how such a small increase in the number of proteins could generate such a staggering difference in complexity.
Fig3. Comparison of the number and types of proteins encoded in the genomes of different eukaryotes. For each organ ism, the area of the entire pie chart represents the total number of protein-coding genes, all shown at roughly the same scale. In most cases, the functions of the proteins encoded by about half the genes are still unknown (light blue). The functions of the remainder are known or have been predicted by sequence similarity to genes of known function. [Data from ENCODE Project Consortium, 2012, Nature 489:57; J. D. Hollister, 2014, Chromosome Res. 22:103; L. W. Hillier et al., 2005, Genome Res. 15:1651; FlyBase: FB2015_02 Release Notes, http://flybase.org/ static_pages/docs/release_notes.html; Saccharomyces Genome Data Base 2015, http://www.yeastgenome.org/genomesnapshot.]
Clearly, simple quantitative differences in the number of protein-coding genes in the genomes of different organ isms are inadequate for explaining differences in biological complexity. However, several phenomena can generate more complexity in the expressed proteins of higher eukaryotes than is predicted from their genomes. First, alternative splicing of a pre-mRNA can yield multiple functional mRNAs corresponding to a particular gene. In humans, the mean number of alternatively spliced mRNAs expressed per gene is about 6. Second, variations in the post translational modification of many proteins may produce functional differences. Finally, increased biological complexity results from increased numbers of cells built of the same kinds of proteins. Larger numbers of cells can interact in more complex combinations, as we can see by comparing the cerebral cortices of mouse and human. Similar cells are present in the mouse and in the human cerebral cortex, but in humans more of them make more complex connections. Evolution of the increasing biological complexity of multicellular organisms probably required increasingly com plex regulation of cell replication and temporal and spatial regulation of gene expression in the cells that make up the organisms, leading to increasing complexity of embryological development.
The specific functions of many genes and proteins identified by analysis of genomic sequences still have not been determined. As researchers unravel the functions of individual proteins in different organisms and further detail their interactions with other proteins, the resulting advances will become immediately applicable to all homologous proteins in other organisms. When the function of every protein is known, no doubt, a more sophisticated understanding of the molecular basis of complex biological systems will emerge.
الاكثر قراءة في مواضيع عامة في الاحياء الجزيئي
اخر الاخبار
اخبار العتبة العباسية المقدسة
الآخبار الصحية

قسم الشؤون الفكرية يصدر كتاباً يوثق تاريخ السدانة في العتبة العباسية المقدسة
"المهمة".. إصدار قصصي يوثّق القصص الفائزة في مسابقة فتوى الدفاع المقدسة للقصة القصيرة
(نوافذ).. إصدار أدبي يوثق القصص الفائزة في مسابقة الإمام العسكري (عليه السلام)