Understanding RNA-Seq Analysis, a Guide for Biologists

Many researchers use RNA-Seq in their study. Some basic understanding of the data analysis process will help you better evaluate your data and get the most out of the results.

This guide is written based on our experience working with real bench scientists to answer specific biological questions using RNA-Seq data.

BioInfoRx provides service to analyze RNA-Seq data for all steps. Contact us if you need help and mention this blog to get free analysis for your first sample.

Step 1. Basic Data Analysis Procedure

There are many different pipelines to analyze RNA-Seq data, and your core facility, service providers or collaborators probably will use one of their favorite pipelines to process the raw data for you.

We recommend the TopHat2-HTSeq-EdgeR pipeline, which is well-documented (see Nature Protocol) and has very robust performance. This pipeline is ideal to identify differentially expressed genes.

Note this pipeline works on gene level, but it does not give you transcript isoform information. There are tools like Cufflinks and RSEM which are designed to detect isoforms and their differentially expression. However, there are still a lot of debates on the methodology and accuracy of isoform detection from RNA-Seq data. For most researchers, we recommend using the gene level results

RNA-Seq Pipeline

Step 2. Quality Assessment of Your Data

It is important to avoid garbage in, garbage out when processing genomic data. Here are a few QC you should be aware with your RNA-Seq data to ensure reliable and accurate results. 

QC Raw Sequence Reads

Raw Data Quality

First, you need to make sure that the raw sequencing reads are of high quality. We often use FastQC to perform QC checks on raw fastq files. Here is an example showing two samples, one has good qualify for all 100 base pairs, the other has a significant drop of data qualify towards the 3' end. For the second sample, you may want to consider trimming low quality reads from 3' before analysis. 

Using FastQC can also identify other issues like high duplication levels, untrimmed adapters, etc. 

Sample Relationship

Next, you want to make sure your biological replicates are really reproducible. The idea is simple, samples from the same condition should look similar, and treated and control samples should look different.

In the example shown at right, a breast cancer cell line is treated with control (DMSO), endogenous estrogen (E2), or environmental estrogens (GEN: genistein, environmental estrogen found in soybean; BPA: bisphenol A, environmental estrogen found in plastics; see ref. here), and two replicates from each condition are measured using RNA-Seq.  We present two ways to look at the sample relationship.

First we can create a MDS (multidimensional scaling) plot. You can see that the biological replicates are indeed similar to each other. Also, E2 and GEN are more different from DMSO than BPA.  

A more informative method is to cluster the samples based on gene expression levels. Here we plot the most variable genes with a heatmap. The sample clustering is similar to what we have seen in MDS Plot. From the heatmap, it is clear that E2 and GEN change the expression of many genes, and there are more up regulated gene. Overall E2 has a stronger effect than GE. BPA only has limited effect on gene expression. 

From this step, you can also identify outliers (e.g. a control sample which looks like treated, or vice versa), which you may want to remove before moving to downstream analysis.

MDS Plot MDS Plot


Heatmap after Clustering Clustering Heatmap

Step 3. Visualize Expression Profiles

Gene expression data can be easily uploaded to BxGenomicDB for online visualization.  Let's look at some examples using the ER data set. 

E2 Induced Genes

Gene Profiles of Estrogen Induced Genes

First we want to look at all the genes that are induced by endogenous estrogen E2. With BxGenomicDB, it is easy to filter these genes by setting Fold Change>2 and FDR value

From the plot, it is clear that for most of the E2 induced genes, they are also induced by GEN to a lower extent. BPA may induced some of these genes, but the change is not significant enough to be detected by this experiment (two replicates).


Search and View Specific Groups of Genes

So are there any BPA induced genes? With FDR cut off at 0.05, there isn't any hits. With a relaxed criteria ( P-value <=0.05 and two fold cutoff, see screen shot below), 9 genes show up. Considering poor FDR values, more experiments (e.g. Q-PCR) are probably needed to verify them.

Search BPA Induced Genes

Let's look at the expression profiles of the top three genes with two different ways (use plot options to select graphic format). The bar graph is straightforward, but I like the data points with average bar better because values from individual replicates are shown clearly.


Plot BPA Induced Genes

Step 4. Obtain Biological Insights from RNA-Seq Data

Now you can explore the data and use various tools to answer biological questions and make new discoveries on the mechanism of gene regulation. We are showing some examples below to give you an idea of the commonly used tools.

Functional Enrichment

Differentially Expressed Genes to Functional Category

This is one of the most commonly asked questions and we have a separate blog dedicated to this topic. Briefly, let's see what are the functional categories enriched in E2 induced genes in the breast cancer cell line T-47D ( ref.). With BxGenomicDB, we can direct send the gene list to DAVID and get a list of Gene Ontology categories. We can also export the list, and copy the symbols to the Online Functional Enrichment Tool (based on Homer tool). The screen shots from both methods are shown at left. A direct link to the homer results is here.

From DAVID, it looks like membrane proteins and intracellular signaling cascade and up-regulated by E2 in T-47D cell line. From Homer results, the top matches are the up list in response to estradiol from other experiments, and up regulated genes in a few other breast cancer tissues. Overall, some interesting results and a lot of data mining and follow up can be done here.

Gene Set Enrichment Analysis (GSEA)

The methods (DAVID, Homer) we mentioned above all start with a gene list, and often we need to use a somewhat arbitrary cutoff to get the list. Sometimes different cutoff may lead to different results. GSEA is another very popular method to search for functional enrichment, but it doesn't need a cutoff. GSEA looks at the difference between two biological states of all genes, then come up with what gene sets are statistically significant.

We run GSEQ using ranked gene list based E2 induction. The results shows many interesting functional categories. Again, top matches are the up list in response to estradiol from other publications, but we also identify that the ribosome genes are strongly enriched in E2 induced genes. Next we will look into more details at the ribosome genes.

Ribosome Gene Pofiles Ribosome Gene Profiles


Ribosome Gene Heatmap Ribosome Gene Heatmap

View All Genes in a Pathway or Functional Group

As a follow up of GSEA, we want to see how ribosome genes change their expression levels in response to E2, GEN and BPA. To do this, we can simply copy the gene list form GSEA output, and use bulk search function in BxGenomicDB to find data on all these 82 genes. We can even plot all these genes together. Click the gene profile picture on the left to view top 18 genes.

Sometimes you want to see all genes in one page rather than view many individual plots, then we can create a heatmap. It is clear that most ribosome genes are induced by E2, but not so much by BPA or GEN. The two replicate samples of E2 are not equal. In E2_1 a stronger effect is observed on this group of genes.

Data Overlap and Venn Diagram

From the example of ribosome genes above, these genes are mostly affected only by E2, but not by the two environmental estrogens. To view the overlap of induced genes globally, we can use Venn Diagram. First we export the up-regulated genes from each of the three treatment, then we can enter the gene IDs to the online Data Overlap Tool to compute the overlap and create an Venn Diagram.

Venn Diagram of Induced Genes

Other Integrative Analysis

There are a lot more interesting analyses one can perform on RNA-Seq data. For example, you can compare with expression data from other labs, or integrate ChIP-Seq with RNA-Seq results. We will cover some of these topics in future blogs.



Interested in using the approach described here but need some help? Please contact us to get a free quote for our biologists-friendly analysis service.