Flow chart for transcriptome assembly and quantification of gene expression. Adapted from Martin and Wang (2011).

Data Analysis

Analysis of raw short reads can be separated into two main steps: pre-processing the raw data and transcriptome assembly (see Martin and Wang, 2011 for details).

Pre-proceesing raw short-read sequences

  1. Removal of the following artifacts will improve the sequence read quality.
    1. Sequencing adapters from failed or short DNA insertions during library construction.
    2. Low-complexity reads and near-identical reads arise from PCR amplification.
    3. If identities are known, rRNA and other RNA contamimants should be removed to improve assembly speed.
  2. Sequencing errors can be removed or corrected by analyzing the quality score and/or the k-mer frequency (the number of times that each k-length oligonucleotide appears in a sequence).
    1. Low quality scores indicate possible sequencing errors.
    2. Low frequency for a k-mer indicates a possible sequencing error or a low abundance transcript.