~~~ Bug's Bones Bared! ~~~



Well folks, it's finally been done. For the first time in history the complete DNA sequence of a free-living organism has been determined. There are no more secrets for this bug - its bones are bared! In the 28 July 1995 issue of the journal Science, vol. 269, page 496, J. Craig Venter's research team (there are 40 authors listed!) at the Institute for Genomic Research in Gaithersburg, Maryland, in collaboration with Nobel laureate Hamilton Smith of Johns Hopkins University in Baltimore, reports the determination of the linear order of appearance of every single nucleotide in the entire amount of DNA (the genome) present in the bacterium Haemophilus influenzae Rd. Every single gene in this organism has been identified and sequenced - more than 1,700 of them! (also see What the Heck is a Gene?). This accomplishment is astonishing, and the impact on our understanding of gene function, cellular biochemistry, and molecular biology will predictably be significant.

The bacterium Haemophilus influenzae is a human pathogen (can cause disease), which may lead to meningitis (see Mankato!, this series). In this case, a benign (non-harmful) laboratory strain was used for the project. According to the authors, the reason that H. influenzae Rd was selected for this project is because "its genome size, 1.8 Mb (1.8-million bases) is typical for bacteria, its G + C (Guanine/Cytosine) base composition (38%) is close to that of a human, and a physical clone map did not exist." Of course, 1.8-million bases is a lot of bases! However, if this organism's sequence could be determined, then the liklihood of determining other complete bacterial genome sequences would theoretically be high. The G + C base composition relates to the percentage of these two bases relative to all of the bases present in a given genome, including the two other bases present in all DNA, e.g., Adenine and Thymine. As it turns out, various species may be identified in part by their relative G + C content. No one yet understands why this type of base composition consistency within the DNA of a given species is maintained.

One of the particularly intriguing issues associated with this major feat, is the methodology used to obtain the data - a methodology which is directly applicable (with perhaps some modifications) to the determination of any organism's complete genome sequence. These researchers used a computer program to scan thousands of random-fragment sequences to look for common sequences among them, and to then overlap those fragments which shared a precise order of appearance of individual bases. Eventually, the entire sequence was assembled. Normally, sequence determination is extremely laborious. Imagine a very long string of beads which consists of four differently-colored beads that appear randomly throughout the string. Each of the 4 beads may be repeated next to one another in random-lengths, also. This arrangement represents the genetic code which involves the four bases, A, G, C, and T. If one used a pair of scissors that would cut the string only between red/black beads, and a different pair of scissors which would cut the string only between green/blue beads, one could envision the generation of pieces which varied in length as well as composition of the different colors. Further, a cut site for scissors #1 may also lie within a fragment produced by scissors #2. And, a cut site for scissors #2 may also lie within the fragment produced by scissors #1. Consequently, there will be fragments produced by each scissors cut which will contain a linear order of beads common to those of other fragments, and these common beads represent the overlap regions of the different fragments. If one then determined the precise order of the colors for each fragment produced from the first scissor's cut, and compared this fragment's color order to the order of colors produced by the second scissor's cut, one could see that it would be possible for some of the fragments to overlap - have some precise linear order of colors in common. Theoretically then, such fragments could be joined to regenerate the original order for that section of the string of beads. Now, if many different scissors were used, each able to cut the string only at certain places, by comparing the order of bead colors in each of the many fragments one could find fragments that overlapped. Thus, the entire string of beads could be rejoined. One could do this without ever knowing the original order of appearance of the beads. This method is essentially the method currently favored.
The DNA is cut into relatively large random-length pieces through the use of enzymes (isolated from various bacterial species) called restriction endonucleases which "cut" (break) the bonds between certain pairs of bases (A,T,C,G), which results in generation of pieces of DNA. Each of these enzymes can "recognize" a unique molecular shape (see What the Heck is an Enzyme?, this series) formed by a particular order of appearance of these bases. The ends of these large DNA fragments are then sequenced, and the pieces arranged according to their sequence overlap. So, one might see xxxx..... and ------xxxx, to be re-joined to form ------xxxx..... Now, each of these large pieces are essentially "shotgunned" into many much smaller fragments by the use of several enzymes at once; then, each of these smaller fragments are sequenced, overlapped, and compared for overlaps with the larger fragments. Finally, the entire string can be assembled.
Instead of this procedure, Venter's research team used mechanical shearing to shotgun the entire genome of the bacterium first, to generate many, many small fragments of DNA (about 1.6 to 2.0 kb in size). Each of these fragments were sequenced, and the data fed into the computer with no previous attempt to manually determine overlapping segments. The computer program scanned and scanned all of these data to "look" for overlapping regions (parts of each fragment with an identical linear order of letters) - highly unlikely that a segment of any one fragment would share the identical order of letters within a different fragment, unless the two fragments overlapped. Envision the following two fragments with the ------------ sequence also identified:


                                       ATCGATCG---------------AAAATCAGT
       AACTGCGGGG---------------ATCGATCG

Overlap the sequences in common to form:
                              
                                     ATCGATCG---------------AAAATCAGT
            AACTGCGGGG---------------ATCGATCG

and therefore, by placing the fragment sequences directly underneath one another, and aligning identical stretches of letters of the fragments, one could obtain for the original sequence, the following sequence:
            AACTGCGGGG---------------ATCGATCG---------------AAAATCAGT 
After assembly after assembly, the final fragment was joined to form the entire genome sequence. What I have depicted is a much-simplified picture of the lengths of the overlaps, and the positions of the overlaps (ends). The number of bases actually available for comparison can vary greatly. Thus, the more bases being compared for overlap, the more confidence one will have that the sequences actually do overlap. Further, the more fragments generated which have partial identity to other fragments, again allows confidence in the arrangement of the order of the bases and consequently the order of fragments. Try this at home with your kids, a friend, or with yourself.... tape a bunch of narrow strips of paper together to generate two, long, narrow strips. Now, using 5 or 6 differently-colored marking pens, color reasonably short (your choice as to how short "short" is) sections of each strip identically in random lengths and in different colors randomly throughout (remember that each of the two long strips must have identical patterns, though). For safety <grin>, starting at one end of each strip, consecutively number each of these sections, measure each length for each color appearance, and record this information. Then, do the typical "drawer-hunt" for some scissors (No!.... not the "good" ones - put those back!) and just start cutting both strips - each in different places, to generate a bunch of fragments. The more cutting - the smaller the fragments - the more work for you later <grin>. After cutting, place all of the fragments into a bowl and mix them up. Then, begin to assemble the strips by placing "like" colors underneath one another. Make yourself an assembly tree - probably easiest to place the longest piece at the top. Eventually, you will be able to re-assemble an overall color-pattern, with each of the bands of color in the same length and order of appearance as the original strip. While fun to solve the puzzle, it's kind of a pain to do, isn't it? Imagine doing such a thing by interpreting thousands of different fragments (DNA sequencing). But, a similar sort of thing has been accomplished many, many times recently, to provide information about why a particular gene leads to a dysfunctional protein in various diseases - or why certain disease-causing organisms are able to hurt us - which hopefully will allow us to correct certain genetic defects and to design drugs which will kill the little critters which harm us.
In a very nicely written accompanying article by Rachel Nowak (of Science), the potential impact of this accomplishment is discussed. She mentions several things, among them are: the use of this information to compare entire sequences among organisms to help understand evolutionary relationships; the potential to determine the genetic reason why some bacteria are harmful to us (genes which allow infection or are responsible for harmful product production); identification of useful enzymes; and, identification of new antibiotics. Already there are new questions which have been generated by this accomplishment. For example, over 700 genes within H. influenzae Rd are unrelated to any genes described to date, while over 1,000 genes within this bacterium are related to known genes. Also, these data revealed for the first time that this particular laboratory strain of H. influenzae Rd does not have three important enzymes which are important to energy production for the cell. This particular strain of bacterium apparently must use critical by-pass pathways in order to survive. Too, since all of the genes are now known, by specifically removing certain genes one-at-a-time (can be done) one may precisely determine the effect of this procedure on this bacterium as it relates to the expression of other genes, growth, and metabolism. One may eventually determine, at least for this organism, exactly how each gene relates to all remaining genes within the bacterium - an enormously important potential for knowledge.

Book: Don't Touch That Doorknob!

Copyright John C. Brown, 1995
[ Top of Page | What the Heck?? | General Interest | "Bugs" | My HomePage | KU Microbiology ]