Phylogeny and Reconstructing Phylogenetic Trees

Reconstruction Algorithms


The problem

Suppose all we know is how far apart the speces are as measured by the number of differences in their characters, that is, the entries in the mutation matrix. How can we reconstruct the phylogenetic tree? Of course, we might not be able to, since the mutations are random and need not reflect the actual distances between species, so a better question is: how can we reconstruct the most likely phylogenetic tree? This question could be made precise using statistical theory, but there is no guarantee that the problem is tractable. We can come up with some promising algorithms that should give trees that aren't too far away.

A solution: the "minimum" reconstruction method

It seems reasonable that the two species that share the greatest number of characters are the most closely related. That is, the smallest entry in the mutation matrix indicates which two species diverged most recently. Also, the next smallest entry should indicate which two species diverged just before that. And so forth. That's the idea of the algorithm, but it needs a little clarification. Suppose species 1 and 2 are closest with 4 differences in their characters, and species 1 and 3 are next closest with 6 differences. Then since we conclude that species 1 and 2 diverged most recently, it won't be species 1 and 3 that diverged just before that, rather it will be the ancestral species of 1 and 2 that diverged from species 3 just before that.

Here you see a distance matrix (which is actually a mutation matrix based on a hidden phylogenic tree) and the reconstructed phylogenic tree build from this minimum reconstruction method. You should be able to see how that tree can be derived from the matrix. If you like, you can ask for a different matrix based on different mutations. Note how radically the reconstructed tree varies with each set of mutations. If there are more characters, then you'll see that the reconstructed matrix depends less on which actual mutations occur.

Two other solutions: the "average" and "maximum" methods

These two algorithms are very similar to the minimum reconstruction method. They start out exactly the same by joining the two species that share the most characters. To explain these methods, suppose that species 1 and 2 are closest. Name their ancestral species as species 6. With the minimum method, we effectively determined that the distance between species 6 and any other species such as species 3 was the minimum of the distance from 1 to 3 and the distance from 2 to 3. With the maximum method, instead take the distance from species 6 to species 3 to be the maximum of these two distances. And, of course, for the average method, take the average of those two distances.

It's surprizing how similar the trees resulting from these three algorithms are. You can see for yourself by selecting a different algorithm.

References

The minimum algoritnm is a version of the single linkage clustering algorithm also known as the nearest neighbor technique. The maximum reconstruction method is also known as the complete linkage clustering algorithm and the nearest neighbor technique. Further references on these algorithms:


to mutations. to the cover page. to the next page putting everything together.


David E. Joyce
Department of Mathematics and Computer Science
Clark University
Worcester, MA 01610

Email: djoyce@clarku.edu
My Homepage.