Tutorial: From Gene List to Functions

When I worked with researchers on genomic data, the first question I got usually is “What are the functions of the gene that changed in my experiment?” There are actually many tools addressing this question, but I am surprised to find out that most biologists need a lot hand-holding on this.

Here is a step-by-step tutorial for this common workflow. This is written for bench scientists who need to make sense of their genomic data. If you are a bioinformatics guru, you know these already but please let us know if we missed any good tools.


Step 1. Get raw data analyzed

Most time you don't need to worry about this. When you receive array or NGS data from the core facility or service providers, there usually is a large text or Excel file listing all genes with expression values.

Plug: BioInfoRx provides service to analyze array or NGS data for all steps. Contact us if you need help and mention this blog to get free analysis for your first sample.


Filtering Genes

Step 2. Filter for lists of differentially expressed genes

Sometimes this is already done for you. You will see in columns with headers like Fold Change (or log Fold Change), P-value, FDR (sometimes called as adjusted P-value or Q-value). If you don’t see these columns, ask someone for help. If you are familiar with R or other programming languages (unfortunately this excludes 99% of bench scientists), you can use limma for array data, or one of the popular packages (e.g. DeSeq, EdgeR, CuffDiff etc) for NGS data.

Now you can just filter for genes that changed (aka differentially expressed genes). I recommend start with Fold Change larger than 2 (up or down) and FDR less than 0.05. You can play with the cutoffs to get different numbers of hits (see tips below if you have too many or too few hits). BxGenomicDB is an excellent tool to easily test different filtering options. Its advanced search engine allows you to set any filtering criteria, and save the search for future access with one click. Here are some examples using BxGenomicDB to filter for gene lists.

http://bioinforx.com/app/data/files/69_Tips.pngTips for Step 2

If you have standard linear fold change, remember that less than 0.5 is decrease more than 2 fold. If you have log fold change (logFC), logFC more than 1 means increase more than 2 fold, logFC less than -1 means decrease more than 2 fold. I know this is simple math, but you don’t ever want to set the wrong cutoff for fold change.

Based on your data, you may get hundreds of differentially expressed genes, or just a couple. When you have many genes, rank them by Fold Change and you can select the top genes or all genes for next step. If you only have a few genes or no genes, typically it is because you don’t have enough replicates to get significant FDR. But don’t toss the data yet, you can use a less stringent cutoff (like FDR<0.2, or just use p-value instead of FDR) to get the most likely changed genes. Just remember you may need additional evidence (more arrays/NGS, or Q-PCR) to verify your results.

Step 3. Identify enriched functional categories

The basic idea is to look for functional categories that occur more often in your list (from step 2) versus background list (normally all genes in the genome). Functional categories are based on gene annotations provide by other scientific groups or international consortia, such as gene ontology (biological processes, molecular function, cellular component), KEGG Pathway, Interpro protein domains, etc. There are many tools for functional enrichment, one of the most popular tools is DAVID. Gene Set Enrichment Analysis (GSEA) has gain popularity in recent years, but it is only available as an offline tool currently. From enrichment analysis, you typically get a list of terms ranked by P-value and/or FDR.  See an example on the left (click to enlarge).

Functional Annotation

With BxGenomicDB, it is one click from step 2 (gene list) to step 3 (DAVID analysis) because BxGenomicDB passes gene IDs automatically to DAVID. You can see a detailed workflow here.

BioInfoRx also provide our own online tool for Functional Enrichment. It has a simple interface and will give you a fairly complete list of enriched terms. One bonus is that you can get enriched terms from Molecular Signatures Database (MSigDB), the collection of annotated gene sets used by GSEA. This is the closest you can get if you don’t want to bother downloading and learning GSEA.

http://bioinforx.com/app/data/files/69_Tips.pngTips for Step 3

If your list from step 2 is very long (say 1000 or more), you can use the whole list, but also try top genes (ranked by Fold Change). Sometimes the top 100-300 genes give you a more clear picture. If your list from step 2 is very small (~ 10), you may not get any significant enrichment. Try lowing the cutoff in step 2 and get more genes (~30-50 at least).

When you have a good list, you will see many enriched terms. Sometimes there are too many terms to go through, and many of them are similar due to the redundant nature of annotations. Luckily, DAVID has tool called Functional Annotation Clustering, which reports groups and displays similar annotations together which makes the biology clearer and more focused. I highly recommend this tool.