~~~ Bug's Bones Bared!
~~~
Well folks, it's finally been done. For the first time in history the
complete DNA sequence of a free-living organism has been determined.
There are no more secrets for this bug - its bones are bared! In the 28
July 1995 issue of
the journal Science, vol. 269, page 496, J. Craig Venter's
research team (there are 40 authors listed!) at the Institute for Genomic
Research in Gaithersburg, Maryland, in collaboration with Nobel laureate
Hamilton Smith of Johns Hopkins University in Baltimore, reports the
determination of
the linear order of appearance of every single nucleotide in the entire
amount of DNA (the genome) present in the bacterium Haemophilus
influenzae Rd. Every single gene in this organism has been
identified and sequenced - more than 1,700 of them!
(also see What the Heck is a Gene?).
This accomplishment is astonishing, and the impact on our understanding
of gene function, cellular biochemistry, and molecular biology will
predictably be significant.
The bacterium Haemophilus
influenzae is a human pathogen (can cause disease),
which
may lead to meningitis (see Mankato!,
this series). In this case, a benign (non-harmful) laboratory
strain was used for the project. According to the authors, the reason that
H.
influenzae Rd was selected for this project is because "its genome
size, 1.8 Mb (1.8-million bases) is typical for bacteria, its G + C
(Guanine/Cytosine) base composition (38%) is close to that of a human,
and a physical clone map did not exist." Of course, 1.8-million bases is
a lot of bases! However, if this organism's sequence could be
determined, then the liklihood of determining other complete bacterial
genome sequences would theoretically be high. The G + C base composition
relates to the percentage of these two bases relative to all of the bases
present in a given genome, including the two other bases present
in all DNA, e.g., Adenine and Thymine. As it turns out, various
species may be identified in part by their relative G + C content. No
one yet understands why this type of base composition consistency within the
DNA of a given species is maintained.
One of the particularly intriguing issues associated with this major
feat, is the methodology used to obtain the data - a methodology which is
directly applicable (with perhaps some modifications) to the
determination of any organism's complete genome sequence. These
researchers used a computer program to scan thousands of random-fragment
sequences to look for common sequences among them, and to then
overlap those fragments which shared a precise order of appearance
of individual bases. Eventually, the entire sequence was
assembled. Normally, sequence
determination is extremely laborious. Imagine a very long string of beads
which consists of four differently-colored beads that appear randomly
throughout the string. Each of the 4 beads may be repeated next to one
another in
random-lengths, also. This arrangement represents the genetic code which
involves the four bases, A, G, C, and T. If one used a pair of scissors
that would cut the string only between red/black beads, and a
different pair of scissors which would cut the string only between
green/blue beads, one could envision the generation of pieces which
varied in length as well as composition of the different colors.
Further, a cut site for scissors #1 may also lie within a fragment
produced by scissors #2. And, a cut site for scissors #2 may also lie
within the fragment produced by scissors #1. Consequently, there will be
fragments produced by each scissors cut which will contain a linear order
of beads common to those of other fragments, and these common beads
represent the overlap regions of the different fragments. If one
then determined the precise order of the colors for each fragment produced
from the
first scissor's cut, and compared this fragment's color order to the
order of colors produced by
the second scissor's cut, one could see that it would be possible for
some of the fragments to overlap - have some precise linear order
of colors in common. Theoretically then, such fragments could be joined
to regenerate the original order for that section of the string of beads.
Now, if many different scissors were used, each able to cut the string
only at certain places, by comparing the order of bead colors in each of
the many fragments one could find fragments
that overlapped. Thus,
the entire string of beads could be rejoined. One could do this without
ever knowing the original order of appearance of the beads. This method
is essentially the method currently favored.
The DNA is cut into relatively large random-length
pieces through the use of enzymes (isolated from various bacterial
species) called restriction endonucleases which "cut"
(break) the bonds between certain pairs of bases (A,T,C,G), which results
in generation of pieces of DNA. Each of these enzymes can
"recognize" a unique molecular shape
(see What the Heck is an
Enzyme?, this series) formed by a particular order of
appearance of these bases. The ends of these large DNA fragments are
then sequenced, and the pieces arranged according to their
sequence overlap. So, one might see xxxx..... and ------xxxx, to be
re-joined to form ------xxxx..... Now, each of these large pieces
are essentially "shotgunned" into many much smaller fragments by
the use of several enzymes at once; then, each of these smaller
fragments are sequenced, overlapped, and compared for overlaps with the
larger fragments. Finally, the entire string can be assembled.
Instead of this procedure, Venter's research team used mechanical
shearing to shotgun the entire
genome of the bacterium first, to generate many, many small fragments of
DNA (about 1.6 to 2.0 kb in size). Each of these fragments were sequenced,
and the data fed into
the computer with no previous attempt to manually determine overlapping
segments. The computer program scanned and scanned all of these data to
"look" for overlapping regions (parts of each fragment with an
identical linear order of letters) - highly unlikely that a segment of
any one fragment would share the identical order of letters within
a different fragment, unless the two fragments overlapped. Envision the
following two fragments with the ------------ sequence also
identified:
ATCGATCG---------------AAAATCAGT
AACTGCGGGG---------------ATCGATCG
Overlap the sequences in common to form:
ATCGATCG---------------AAAATCAGT
AACTGCGGGG---------------ATCGATCG
and therefore, by placing the fragment sequences directly underneath one
another, and aligning identical stretches of letters of the fragments, one
could obtain for the original sequence, the following sequence:
AACTGCGGGG---------------ATCGATCG---------------AAAATCAGT
After assembly after assembly, the final fragment was joined to form the
entire genome sequence. What I have depicted is a much-simplified
picture of the lengths of the overlaps, and the positions of the
overlaps (ends). The number of bases actually available
for comparison can vary greatly. Thus, the more bases being compared for
overlap, the more confidence one will have that the sequences actually do
overlap. Further, the more fragments generated which have partial
identity to other fragments, again allows confidence in the arrangement
of the order of the bases and consequently the order of fragments. Try
this at home with your kids, a friend, or
with yourself.... tape a bunch of narrow strips of paper together to
generate two, long, narrow strips. Now, using 5
or 6 differently-colored marking pens, color reasonably short (your
choice as to how short "short" is) sections of each strip
identically in random lengths and in different colors randomly
throughout (remember that each of the two long strips must have
identical patterns, though). For safety <grin>, starting at
one end of each strip, consecutively number each of these sections,
measure each length for each color appearance, and record this
information. Then, do the typical "drawer-hunt" for some
scissors (No!.... not the "good" ones - put those back!) and just
start cutting both strips - each in different places, to generate
a bunch of fragments. The more cutting - the smaller the fragments - the
more work for you later <grin>. After cutting, place all of
the fragments into a bowl and mix them up. Then, begin to assemble the
strips by placing "like" colors underneath one another. Make
yourself an assembly tree - probably easiest to place the longest piece
at the top. Eventually, you will be able to re-assemble an overall
color-pattern, with
each of the bands of color in the same length and order of appearance as
the original strip. While fun to solve the puzzle, it's kind of a
pain to do, isn't it? Imagine doing such a thing by interpreting
thousands of different fragments (DNA sequencing). But, a similar
sort of thing has been accomplished many, many times recently, to provide
information about why a particular gene leads to a dysfunctional protein
in various diseases - or why certain disease-causing organisms are able
to hurt us - which hopefully will allow us to correct certain genetic
defects and to design drugs which will kill the little
critters which harm us.
In a very nicely written accompanying article by Rachel Nowak (of
Science), the potential impact of this accomplishment is discussed.
She mentions several things, among them are: the use of this information
to compare entire sequences among organisms to help understand
evolutionary relationships; the potential to determine the genetic
reason why some bacteria are harmful to us (genes which allow infection
or are responsible for harmful product production); identification of
useful enzymes; and, identification of new antibiotics. Already there
are new questions which have been generated by this accomplishment. For
example, over 700 genes within H. influenzae Rd are unrelated to
any genes described to date, while over 1,000 genes within this bacterium
are related to known genes. Also, these data revealed for the first time
that this particular laboratory strain of H. influenzae Rd does
not have three important enzymes which are important to energy production
for the cell. This particular strain of bacterium apparently must use
critical by-pass pathways in order to survive. Too, since all of
the genes are now known, by specifically removing certain genes
one-at-a-time (can be done) one may precisely determine the
effect of this procedure on this bacterium as it relates to the
expression of other genes, growth, and metabolism. One may eventually
determine, at least for this organism, exactly how each gene relates to
all remaining genes within the bacterium - an enormously important
potential for knowledge.