Monthly Archives: June 2007

Where to draw the line

blastbook What I learned this week #2

I am still in the phase of my new job where I am inundated with new information every day.  Every so often I just get overwhelmed with the number of different ways to do something, whether it’s the number of tools available for the job or even the number of different methodologies.  I have reached the point that I officially over-use the phrase, “we don’t have to re-invent the wheel.” I am banning it from my vocabulary. 

One of the issues I ran into this week was at the other end of the spectrum of confusion.  The issue was not too many methodologies but too few.  Maybe I am missing something, but the answer to too many of my questions is simply “BLAST”. For those that read this without any bioinformatics background, BLAST stands for Basic Local Alignment Search Tool, which is an algorithm that takes a gene and search for the best matches in some gene database.  It returns a number of statistics which the bioinformatician uses to determine what is the most likely best match to this gene in a given genome.  For example, say I have a gene with an unknown function and wish to identify that function, one preliminary step is to BLAST that gene against databases full of genes with known function.  If I find a significant “hit” or match, I may be able to hypothesize that this gene has that same function as the matching gene. How I come to that conclusion is based largely on how “good” of a hit is, whether the hit occurs in the functional domain of the gene, and probably several other factors I have yet to learn.

In one of my projects the basic approach is to determine the cross-species similarity between two plants, at the genetic level.  Using expression data for one plant, I want to determine which genes play a similar role in another plant.  So, I BLAST. Here’s the kicker, for a large percentage of my genes I have multiple good hits in the other species, which indicates similarity such as among gene families.  Which genes do I include as having matches? How good of a hit is good enough? My current methodology just takes the top hits that have an e-value less than 1e-20 (fancy way of saying that the probability of that hit in randomly distributed sequences is very low).  By this methodology, out of about 50,000 genes, 28,000 of them have hits in the other plant.  But I also leave out 19,000 genes from the other plant that would have hits with an e-value of less that .00001, which is still a very good probability of a hit. Is it possible that these are also genes with the same function? Of course. 

So here’s the conclusion I reached this week.  Say I have a large database of genes and I want to identify genes with a certain property, characteristic or function.  It is highly unlikely that I will find ALL the genes meeting my criteria.  Instead, I set a threshold so that when search for this criteria, all the genes I obtain I can be relatively certain meet that criteria, even though I may be leaving out a large number of satisfactory genes.  I want to make sure that the genes I select are satisfactory, but make draw no conclusions about the genes I leave out. That’s where I draw the line.

Fiber Research, the big picture

cotton5 I finally managed to get a glimpse of the big picture when it comes to fiber research. It may seem fairly obvious now but for some reason, I was too immersed in the A’s, G’s, T’s and C’s of the genetic code and hunting for their patterns, that I forgot to ask the big question: Why?

Answer: Better cotton and more of it. 

In a talk by one of my colleague’s, he laid out a bit of this for me by describing two of the primary cotton species that are most harvested.  The first is called Gossypium hirsutum, which is the sometimes called Texas Maker 1 (TM-1) or Upland. It’s name comes from “hirsute” meaning “hairy”. This cotton species has a high yield, a definite advantage, but as the name suggests the fibers are “hairy”, not as long and strong as other species.  Another variety is Gossypium barbadense, or Pima, which is commonly known as “Egyptian Cotton”.  Its fibers are longer and stronger resulting a higher quality fabric.  However, there is substantially less yield from this variety. 

Now, in comes research in the study of the cotton genome.  The goal is to understand the biological mechanisms and the underlying genetic code that produces the differences in the varieties of cotton.  If we can identify significantly differentially expressed genes in varieties of cotton at different stages of development, and use this information to discover active biological pathways, we may be on our way to understanding the system of biological development in cotton.  Then, knowing that, we will work to produce a cotton plant with the yield of Upland and the quality of Pima.

Materials and Methods

I’m shifting gears in moving from writing Mathematics papers to writing Bioinformatics papers. In my native tongue, namely mathematics, we tend to write in the first person plural. I don’t know the history or reasoning behind this practice, but when you read a mathematics paper in a journal, they are replete with “we”, “us”, and “our”. From the little that I have read, scientific journals tend to be third person, just stating the facts concisely, much like a newspaper article.

However, there is one place where I was originally a bit uncomfortable with the writing style. In having read a few “Materials and Methods” sections they tend to be almost 100% written in passive tense. For example, in a paper I was reading just today (Modifications of cellulose synthase confer resistance to isoxaben and thiazolidinone herbicides in Arabidopsis Ixr1 mutants), I read the following section:

Mutant ixr1 plants were transformed by A. tumefaciens (GV3101) carrying the various cosmid clones according to Bent and Clough (13), and T1 transformants were selected on MS plates containing kanamycin (50 µg ml-1). To score for isoxaben resistance, surface-sterilized seeds of the transgenic T1 plants were germinated on 0.8% agar-solidified medium containing Murashige and Skoog mineral salts (Sigma) and 600 nM isoxaben. The plates were incubated vertically at 25°C under continuous fluorescent illumination (approx 50 µmols photons m-2·s-1) so that the roots grew on the surface of the agar. Isoxaben resistance was scored after 7 days of incubation.

It continues on for several paragraphs with every last sentence in this tense. Not that I necessarily have a better way to write it, I just recall being reprimanded for overusing the passive tense when I was learning to write my research papers in high school and early undergraduate english courses.

On that note, I’ve begun putting together my first paper in the field. We’ll just have to see how that goes.

Comparative Genomics

What I learned this week #1:

Another week has transpired at my new job as a researcher at TTU and I am being inundated with all sorts of new information. For one thing, I have folks milling around behind my workstation doing all sorts of laboratory things, freezing things in liquid nitrogen (or something else very cold that billows smoke), pipetting (if that is even a word), etc. I’m just disappointed they they’re not wearing lab coats with a mad scientist look in their eye. Unfortunately, my job is much less exotic-looking. I sit in front of a computer, all day long. I do have a pretty fancy set up with two 19-in LCD monitors plugged into a pretty hefty computer (two dual-core processors, 4 GB memory).

I have a number of different projects but they all seem to start in the same place and due to my lack of experience with the biology, I don’t have a good feel with how to follow these initial steps. In essence, the geneticist I work for has a great deal of data collected about the expression of genes in cotton over various varieties of cotton and various developmental phases. So, I start with a list of genes that have been identified to have a particular function in another plant, the most common being Arabidopsis since its genome has been entirely sequenced. We then identify if these genes are present in cotton. Once we have this list of genes we then examine their developmental expression and draw conclusions about their role in cotton.

You know, when I state it like that it seems very simple but there are several steps in each of the above steps that can lead to a great deal of work. So far, I have very little to say about the conclusions we draw. So far, all I have really done is the first phase of identifying these genes in Arabidopsis and begin to compile the list of these genes in cotton. Next week, I’ll begin collected the expression data for some of these genes of particular interest.

I should mention that the first project is actually slightly modified, in that we looked at genes that had specific roles identified first in cotton and then found whether these same genes played similar roles in other species. If any computational biologists, functional geneticists (is that a term?), or bioinformaticists read this and it seems naive, please be kind a realize that I don’t speak the language very well yet. I’m absorbing as much as I can as fast as I can. Having a background in applied mathematics and numerical analysis helps but I still feel handicapped.

Technologically speaking,

  • I learned how to migrate a MSSQL database website to a new server.
  • I learned how to update data in a MySQL database with a pre-built utility website (first time using the “DELETE”, “SELECT” commands with a “LIKE” modifier)
  • I installed PHP and MySQL to run on Microsoft IIS, followed by installing ActiveCollab for project management
  • I used query design mode extensive in Microsoft Access and eventually resorting to SQL statements for “UNION” queries
  • I updated the blast database used by NCBI wwwblast on a local utilities site

Alternative Mathematics


Smart man + smart woman = romance
Smart man + dumb woman = affair
Dumb man + smart woman = marriage
Dumb man + dumb woman = pregnancy


Smart boss + smart employee = profit
Smart boss + dumb employee = production
Dumb boss + smart employee = promotion
Dumb boss + dumb employee = overtime


A man will pay $20 for a $10 item he needs.
A woman will pay $10 for a $20 item
that she doesn’t need.


A woman worries about the future until she gets a husband.
A man never worries about the future until he gets a wife.
A successful man is one who makes more money than his wife can spend.
A successful woman is one who can find such a man.


To be happy with a man, you must understand him a lot and love him a little.
To be happy with a woman, you must love her a lot and not try to understand
her at all.


Married men live longer than single men do, but married men are a lot more
willing to die.


A woman marries a man expecting he will change, but he doesn’t.
A man marries a woman expecting that she won’t change, and she does.


A woman has the last word in any argument.
Anything a man says after that is the
beginning of a new argument.

What is bioinformatics?

One of the broadest definitions of bioinformatics that I have come across was in Sorin Draghici’s book, Data Analysis Tools for DNA MicroArrays.

Def: Bioinformatics is the science of refining biological information into biological knowledge using computers.

Under the heading of bioinformatics is a wide variety of different fields of study with a lot of problems under its umbrella. Some of the primary issues addressed, historically, have been sequence analysis, protein structure prediction and the dynamic modeling of complex biosystems. Other areas of fairly recent research has been in protein-protein interations, protein-DNA interactions, enzymatic and biochemical pathways, population-scale sequence data, large-scale gene expression data and ecological and environmental data.

That’s what Draghici has to say about it, anyways. As a newbie, I am discovering all the time new avenues of research and trying to assimilate and categorize all the new information I’m coming across. So far, it looks like my research will begin in the areas of sequence analysis, modeling of biosystems, pathway analysis and analysis of gene expression data.

The first project in which I am involved is simply cross-species comparison of genes that have been identified in one species through expression analysis as having a role in cell wall development. We’ll use this information to predict their role in other plants as well.

I’m already having quite a good time at my new job taking care of some techie stuff, more so than I ever had a chance to do in my last profession. I’ve tweaked a website design to meet the boss’ requirements, migrated a database website from one server to another one (learning a good deal about MS SQL Server in the process), identified and solved a particular issue with the Genespring workgroup server.

I’m also utilizing a new software tool on my own server, something called activeCollab, a project management utility. It is an online database of my current projects and activities. Through a fairly easy to use web interface I am able to enter all my projects along their tasks, messages, milestones, etc. I’ve provided access to my colleagues and my PI to allow them to keep tabs on my progress. Plus, it helps me to make sure that I am staying on task and meeting all my own goals.