Wednesday, August 25, 2010

The Bleeding Edge...

Prior to my postdoc, I've never done research using 'bleeding edge' technology, which the Wikipedia defined as:

Bleeding edge technology refers to technology that is so new that the user is required to risk unreliability, and possibly greater expense, in order to use it.

While every project requires some degree of methods development, pretty much everything I've done before used well established protocols and procedures. This is no longer the case.

One of the techniques of which I've been making extensive use is RNA-seq, or short-read sequencing of RNA. The gist of this technique is that you extract RNA from a cell fragment it into small pieces and then convert it to DNA. You then generate millions of short reads (~75-100 bp nowadays) off of this DNA using flow-cell based sequencers (see the link I provided to RNA-seq for details) and map (align) these reads to the genome of the organism with which you're working. The read 'density' (or number of reads mapping to any given region of the genome) is correlated with abundance of the RNA in the original pool, and thus you get a digital read out of the abundance of all of the transcripts in the cell. In principle this technology has many benefits over traditional methods of transcriptome profiling, such as allowing isoform detection, and is widely expected to eventually replace microarrays.

While this is all fine and dandy, the more I use RNA-seq, the more concerned I become about it. A lot of people have jumped on this bandwagon and there hasn't been a lot of work done to investigate potential biases and caveats associated with the instruments required to generate these data. Here's a couple of examples: The relationship between number of reads mapping to a given gene/exon and its expression level only holds if reads are randomly distributed with respect to what's being sequenced - there shouldn't be sequences that are preferentially sequenced or underrepresented. Unfortunately, this does seem to be the case - either during fragmentation or sequencing library preparation, biases are introduced making certain sequences more or less common than would be expected by chance. It appears that this does not have a large effect on expression estimates of highly expressed genes, but genes with low expression, or short coding sequences show more variability than they should.

A much more significant problem is that many papers seem to have assumed that RNA-seq is somehow beyond the need to 'normalize' data (that is control for systematic biases). It isn't (see Srivastava and Chen 2010, for example). A slew of recent papers have shown that there are biases associated with short read data, especially when the cells/tissues/organisms being compared have radically different expression profiles. Normalization is required, but rarely applied. Oh, and I haven't even gotten into the serious problem of lack of replicates in many of these studies.

These are just some of the issues RNA-seq users face. Now, you may be asking yourself why I'm telling you this; I assure you it's not a rant. My bigger point is this: Our lab has been spending an inordinate amount of time investigating the very real biases that these issues may be creating in our data. However, at the same time, other groups are using the tools available, limited as some may be, and publishing work under the assumption that such biases are not 'show stoppers'. Whether they are or not is difficult to say at this point, but we're pretty much at the mercy of labs with a much better grasp of statistics to come up with solutions as to how to properly normalize and handle these data.

To what extent is it reasonable to use the tools available and assume that they're 'good enough'? Science is always progressing/refining its products, so is it okay to use particular methods, even if you suspect that they're producing uncontrolled biases? I think that the vast majority of what's being done in the field is quality (as far as we know what quality is) but the use of such new technology is somewhat worrisome.

Thoughts?

Labels:

3 Comments:

At 12:45 AM, Blogger The other Jim said...

Agreed. In the rush to be 1st with a new technology, a lot of bad work is getting into high impact journals.

All next-gen platforms and applications are lacking any real verification. We tried to estimate somatic mutatgenesis rates on one platform (DNA sequencing... so different biases). Our somatic rates came up 20-100x higher that PCR-Clone-SangerSequence methods (which are criticized for being over-estimates due to Taq errors).

Our collaborators who run the facility did not see the problem with this, and just kept chanting "never before examined in this depth... don't know what is really out there". I had them simulate a PCR-Clone-SangerSequence data set from the SOLiD reads, then handed them the data from my experiments. Suddenly they were convinced that there was an accuracy problem ;-)

 
At 12:29 PM, Blogger Carlo said...

Sounds all too familiar. It's really been a huge problem for us. We've spent a huge amount of time mulling over our data. Now that we actually want to do work on what we've got, we're concerned about where to start - there's such a massive dataset that we can't afford to reanalyze it over and over!

 
At 9:00 PM, Anonymous Vimax Pills said...

Thank you for blogging. . .
We learned tips for how to write on people's blogs, like a compliment.
We also learned that if you know something about the topic, you can put it on there if nobody else has written it. We liked the suggestion to only use one punctuation instead of a bunch if you liked it Vimax.

 

Post a Comment

<< Home