I am still in the phase of my new job where I am inundated with new information every day. Every so often I just get overwhelmed with the number of different ways to do something, whether it’s the number of tools available for the job or even the number of different methodologies. I have reached the point that I officially over-use the phrase, “we don’t have to re-invent the wheel.” I am banning it from my vocabulary.
One of the issues I ran into this week was at the other end of the spectrum of confusion. The issue was not too many methodologies but too few. Maybe I am missing something, but the answer to too many of my questions is simply “BLAST”. For those that read this without any bioinformatics background, BLAST stands for Basic Local Alignment Search Tool, which is an algorithm that takes a gene and search for the best matches in some gene database. It returns a number of statistics which the bioinformatician uses to determine what is the most likely best match to this gene in a given genome. For example, say I have a gene with an unknown function and wish to identify that function, one preliminary step is to BLAST that gene against databases full of genes with known function. If I find a significant “hit” or match, I may be able to hypothesize that this gene has that same function as the matching gene. How I come to that conclusion is based largely on how “good” of a hit is, whether the hit occurs in the functional domain of the gene, and probably several other factors I have yet to learn.
In one of my projects the basic approach is to determine the cross-species similarity between two plants, at the genetic level. Using expression data for one plant, I want to determine which genes play a similar role in another plant. So, I BLAST. Here’s the kicker, for a large percentage of my genes I have multiple good hits in the other species, which indicates similarity such as among gene families. Which genes do I include as having matches? How good of a hit is good enough? My current methodology just takes the top hits that have an e-value less than 1e-20 (fancy way of saying that the probability of that hit in randomly distributed sequences is very low). By this methodology, out of about 50,000 genes, 28,000 of them have hits in the other plant. But I also leave out 19,000 genes from the other plant that would have hits with an e-value of less that .00001, which is still a very good probability of a hit. Is it possible that these are also genes with the same function? Of course.
So here’s the conclusion I reached this week. Say I have a large database of genes and I want to identify genes with a certain property, characteristic or function. It is highly unlikely that I will find ALL the genes meeting my criteria. Instead, I set a threshold so that when search for this criteria, all the genes I obtain I can be relatively certain meet that criteria, even though I may be leaving out a large number of satisfactory genes. I want to make sure that the genes I select are satisfactory, but make draw no conclusions about the genes I leave out. That’s where I draw the line.