CyVerse_logo

Home_Icon Learning Center Home

RNA-Seq with Kallisto and Sleuth

Goal

Analyze RNA-Seq data for differential expression. Kallisto manual is a quick, highly-efficient software for quantifying transcript abundances in an RNA-Seq experiment. Even on a typical laptop, Kallisto can quantify 30 million reads in less than 3 minutes. Integrated into CyVerse, you can take advantage of CyVerse data management tools to process your reads, do the Kallisto quantification, and analyze your reads with the Kallisto companion software Sleuth in an R-Studio environment.


Manual Maintainer(s)

Who to contact if this manual needs fixing. You can also email Tutorials@CyVerse.org

Maintainer Institution Contact
Jason Williams CyVerse / Cold Spring Harbor Laboratory Williams@cshl.edu

CyVerse_logo

Home_Icon Learning Center Home

Organize Kallisto Input Data

Description:

Kallisto has relatively few input requirements. You will need to have either single or paired end reads, as well as a reference transcriptome. It is suggested that your RNA-Seq reads are analyzed using FastQC, followed by any additional trimming and filtering using and application such as Trimmomatic.


Input Data:

Input Description Example
Reference transcriptome fasta Example transcriptome

Importing Reference Transcriptome

In this example, we will import a reference transcriptome for Arabidopsis from Ensembl. In many cases you can find an appropriate transcriptome from Ensembl for your organism of interest, or provide your own fasta-formmatted transcriptome.

  1. Go to the Ensembl Plants Arabidopsis page: http://plants.ensembl.org/Arabidopsis_thaliana/Info/Index.
  2. In the ‘Gene annotation’ section, click on the ‘Download genes, cDNAs, ncRNA, proteins’ ‘FASTA’ link: ftp://ftp.ensemblgenomes.org/pub/plants/release-36/fasta/arabidopsis_thaliana/.
  3. The transcriptome files will be located in the ‘cdna’ folder: ftp://ftp.ensemblgenomes.org/pub/plants/release-36/fasta/arabidopsis_thaliana/cdna/.
  4. Right-click/command-click on the Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz link and copy the link location/URL (usually right-click) to the clipboard
  5. Login to the CyVerse Discovery Environment.
  6. Open the Data window. In your home directory, create a folder to organize your Kallisto project
  7. In the created folder, go to the Data window’s ‘Upload’ menu, and select, ‘Import from URL…’; paste in the Ensembl link and click ‘Import from URL’ to begin the import

Output/Results

Output Description Example
Reference transcriptome fasta Example transcriptome

Description of results and next steps

This example transcriptome will be indexed in the next step. RNA-Seq reads will be mapped against this set of transcripts. Once you have the transcriptome and your RNA-Seq reads, you can proceed with the next step. We suggest organizing your RNA-Seq reads in the folder created in step 6 above.


Fix or improve this documentation


Home_Icon Learning Center Home

CyVerse_logo

Home_Icon Learning Center Home

Build Kallisto Transcriptome Index

Description:

As described in the Kallisto paper, RNA-Seq reads are efficiently mapped through a pseudoalignment process against a reference transcriptome index. We will build the index in this step.


Input Data:

Input Description Example
Reference transcriptome fasta Example transcriptome
RNA-Seq Reads Cleaned fastq files Example fastq files

Build Kallisto Index and Quantify Reads

We will now index the Arabidopsis transcriptome imported from Ensembl. This transcriptome can be used multiple times for future Kallisto analyses and only needs to be made once. In this tutorial, we have 36 fastq files (18 pairs), so you will need to add these to the Kallisto analyses. Kallisto uses a ‘hash-based’ pseudo alignment to deliver extremely fast matching of RNA-Seq reads against the transcriptome index.

  1. If necessary, login to the CyVerse Discovery Environment.

  2. In the App panel, open the Kallisto v.0.43.1 app or click this link: Kallisto app

  3. Name your analysis, and if desired enter comments and click ‘Next.’ In the App’s ‘Input’ section under ‘The transcript fasta file supplied (fasta or gzipped)’ browse to and select the transcriptome imported in the previous section.

  4. Under Paired of single end choose the format used in your sequencing.

    Sample data

    For the sample data, choose Paired

    Note

    For single-end data you will also need to choose fragment length and fragment standard deviation values in the apps “Options” section. You may also adjust settings for strand-specific reads.

  5. Under ‘FASTQ Files (Read 1)’ navigate to your data and select all the left-read files (usually R1). For paired-end data also enter the right-read files (usually R2) .

    Sample data

    For the sample data, navigate to /iplant/home/shared/cyverse_training/tutorials/kallisto/00_input_fastq_trimmed

    • For FASTQ Files (Read 1) choose all 18 files ending labeled R1 (e.g. SRR1761506_R1_001.fastq.gz_fp.trimmed.fastq.gz)
    • For FASTQ Files (Read 2) choose all 18 files ending labeled R2 (e.g. SRR1761506_R2_001.fastq.gz_fp.trimmed.fastq.gz)
  6. If desired adjust the bootstrap value (See Kallisto paper for recommendations); Click ‘Next’ to continue.

    Sample data

    We will use 25.

  7. If desired adjust the resources required and/or click ‘Next.’

    Sample data

    For the sample data, we will not specify resources.

  8. Finally click ‘Launch Analyses’ to start the job. Click on the Analyses menu to monitor the job and results.

Output/Results

Kallisto jobs will generate and index file and 3 output files per read / read-pair:

Output Description Example
Kallisto Index This is the index file Kallisto will map RNA-Seq reads to. Example Kallisto index
abundances.h5 HDF5 binary file containing run info, abundance estimates, bootstrap estimates, and transcript length information length. This file can be read in by Sleuth example abundance.h5
abundances.tsv plaintext file of the abundance estimates. It does not contains bootstrap estimates. When plaintext mode is selected; output plaintext abundance estimates. Alternatively, kallisto h5dump will output an HDF5 file to plaintext. The first line contains a header for each column, including estimated counts, TPM, effective length. example abundance.tsv
run_info.json a json file containing information about the run example json

Description of results and next steps

First, this application runs the ‘kallisto index’ command to build the the index of the transcriptome. Then the ‘kallisto quant’ command is run to do the pesudoalignment of the RNA-Seq reads. Kallisto quantifies RNA-Seq reads against an indexed transcriptome and generates a folder of results for each set of RNA- Seq reads. Sleuth will be used to examine the Kallisto results in R Studio.


Fix or improve this documentation


Home_Icon Learning Center Home

CyVerse_logo

Home_Icon Learning Center Home

Analyze Kallisto Results with Sleuth

Description:

Sleuth is a program for analysis of RNA-Seq experiments for which transcript abundances have been quantified with Kallisto. In this tutorial, we will use R Studio being served from an VICE instance.


Input Data:

Input Description Example
Kallisto results folder(s) Outputs from a Kallisto quantification Example directory of Kallisto results

Import Kallisto results into R and analyze with Sleuth

We will import the Kallisto results into an RStudio session being run from an Atmosphere image. Then we will follow a R script based on the Sleuth Walkthoughs.

  1. If necessary, login to the CyVerse Discovery Environment.

  2. In the App panel, open the Sleuth RStudio app or click this link:Sleuth app

  3. Name your analysis, and if desired enter comments.

  4. (Optional) In the ‘Notebooks’ section, under ‘Select an RMarkdown notebook to run’ select a notebook.

    Sample data

    For the sample data, navigate to and select /iplant/home/shared/cyverse_training/tutorials/kallisto/04_sleuth_R/sleuth_tutorial.Rmd

  5. In the ‘Datasets’ section, under ‘Data for analysis (outputs of Kallisto quantification)’ choose the folders containing quantification information for all sets of reads.

    Sample data

    For the sample data, navigate to and select /iplant/home/shared/cyverse_training/tutorials/kallisto/03_output_kallisto_results

  6. In the ‘Datasets’ section, under ‘Study design file’ choose a TSV file describing the samples and study design (see Sleuth).

    Sample data

    For the sample data, navigate to and select /iplant/home/shared/cyverse_training/tutorials/kallisto/04_sleuth_R/kallisto_demo.tsv

Tip

See the Example study design (Kallisto_demo_tsv) TSV file. You can create and edit your own in a spreadsheet editing program. The Sleuth explains this file and more is described in this tutorial’s RMarkdown notebook.

  1. If desired adjust the resources required and/or click ‘Next.’

    Sample data

    For the sample data, we will not specify resources.

  2. Click ‘Launch Analyses’ to start the job. Click on the Analyses button to monitor the job and results. In your Analyses menu, you will find a link to your VICE session ; this may take a few minutes to become active.

  3. In your RStudio session, double click on the sleuth_tutorial.Rmd file and follow the tutorial by pressing the green “play” triangles in each section of code. The code for the notebook is replicated below:

      ---
title: "Sleuth RNA-Seq Tutorial - Arabidopsis"
output:
  html_document:
    df_print: paged
---

This tutorial will take you through a sample RNA-Seq analysis using [kallisto](https://pachterlab.github.io/kallisto/about), using an RNA-Seq R package [Sleuth](https://pachterlab.github.io/sleuth/about). This tutorial is based on the one by the [Pachter lab](https://pachterlab.github.io/sleuth_walkthroughs/boj/analysis.html)

## Step 1: Load Sleuth and accessory libraries

Next, we need to load the Sleuth library to begin. We will also check the version:


```{r message=FALSE}
require("sleuth")
packageVersion("sleuth")
```

We will also install a plotting library and some other functions we will need...

```{r echo=FALSE, message=FALSE, warning=FALSE}
library("gridExtra")
library("cowplot")
```

We will also use [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) tools will allow us to pull in recognizable gene names from a database.


```{r echo=FALSE, message=FALSE, warning=FALSE}
library("biomaRt")
```

## Step 2: Load experimental design and label kallisto outputs with metadata

### Locate sample names and describe our experimental design

We need to provide Sleuth with our sample names:

```{r}
sample_id <- dir(file.path("~/kallisto_qaunt_output/"))
sample_id
```

We also need to get the file paths to our results files.
```{r}
kal_dirs <- file.path("~/kallisto_qaunt_output", sample_id)
```

We also need a table that provides more meaningful names for describing our experiment...

```{r}
s2c <- read.table(file.path("~/kallisto_demo.tsv"),
                  header = TRUE,
                  stringsAsFactors = FALSE,
                  sep = "\t")
```

We will add our file paths to the table

```{r}
s2c <- dplyr::mutate(s2c, path = kal_dirs)
```

Let's view the table we have created:
```{r}
s2c
```

## Step 3: Load gene names from Ensembl

Next we need to determine which biomaRt to use. This can be a little complex so be sure to read their [documentation](https://www.bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/biomaRt.html). This [blog post](https://nsaunders.wordpress.com/2015/04/28/some-basics-of-biomart/) is also helpful.

```{r}
marts <- listMarts()
marts
```

If you are not working with these Ensembl data bases you may want to check out documentation on [using BiomaRts other than Ensembl](https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html#using-a-biomart-other-than-ensembl). We are using plants, so...

```{r}
marts <- listMarts(host = "plants.ensembl.org")
marts
```

For now, remember that we will want to use `plants_mart`.

Next, we need to choose a specific data set.

```{r}
plants_mart <- useMart("plants_mart", host = "plants.ensembl.org" )
listDatasets(plants_mart)
```

After a little looking, its the `athaliana_eg_gene` data set that we need. Finally, we need to update our `plants_mart` to be more specific.

```{r}
plants_mart <- useMart("plants_mart", dataset = "athaliana_eg_gene", host="plants.ensembl.org" )
```

Now we want to get specific attributes from the list of genes we can import from biomart

```{r}
listAttributes(plants_mart)
```

We can choose whichever of these we'd like to use. Let's get transcript ids, gene ids, a description, and gene names. Notice, there are many things you may
want to come back for. We must get the transcript id because these are the names of the transcripts that were used in our Kallisto quantification.

```{r echo=FALSE, message=FALSE, warning=FALSE}
t2g <- getBM(attributes = c("ensembl_transcript_id",
                            "ensembl_gene_id",
                            "description",
                            "external_gene_name"),
             mart = plants_mart)
```

We need to make sure the `ensembl_transcript_id` column is named `target_id`

```{r}
ttg <- dplyr::rename(t2g, target_id= ensembl_transcript_id, ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
```


##Step 4: Prepare data for Sleuth

first we need to alter our experimental design so that we consider the full transcriptome sample to be the "control" to compare to...

```{r}
s2c$genotype_variation_s <- as.factor(s2c$genotype_variation_s)
s2c$genotype_variation_s <- relevel(s2c$genotype_variation_s, ref = "wild type")

```


Now we need to tell Sleuth both about the Kallisto results and the gene names (and gene descriptions/metadata) we obtained from biomaRt. The `sleuth_prep` function does this.

```{r warning=FALSE}
so <- sleuth_prep(s2c,
             full_model = ~genotype_variation_s,
             target_mapping = ttg,
             read_bootstrap_tpm=TRUE,
             extra_bootstrap_summary = TRUE)
```

##Step 5: Initial data exploration

### Examine Sleuth PCA

Next, we should check to see if our samples (and replicates) cluster on a PCA (as should expect) or if there are outliers. When we plot by condition, we'd expect that similar colors group together.

```{r}
library(cowplot)
ggplot2::theme_set(theme_cowplot())
plot_pca(so, color_by = 'genotype_variation_s', text_labels = TRUE)
```

Let's try plotting by treatment


```{r}
plot_pca(so, color_by = 'treatment_s', text_labels = TRUE)
```


We can also see genes involved in the the 1st PC by looking at the loadings (primary genes whose linear combinations define the principal components)


```{r}
plot_loadings(so, pc_input = 1)
```

Let's see how this "influential" gene (at least as far as PCA tells us) looks by condition
```{r}
plot_bootstrap(so, 'AT2G34420.1', color_by = 'genotype_variation_s')
```

Let's see how this "influential" gene (at least as far as PCA tells us) looks by treatment

```{r}
plot_bootstrap(so, 'AT2G34420.1', color_by = 'treatment_s')
```


##Step 6: Modeling, testing, and results exploration

### Differential expression testing with Sleuth

Now we need to run a few functions that will test for differential expression (abundance).

First we will create a model

```{r}
so <- sleuth_fit(so, ~genotype_variation_s, 'full')
so <- sleuth_fit(so, ~1, 'reduced')
so <- sleuth_lrt(so, 'reduced', 'full')
```

Now we can get the results of this analysis

```{r}
full_results <- sleuth_results(so, 'reduced:full', 'lrt',
                               show_all = FALSE)
head(full_results)
```

Let's add  Wald test
```{r}
wald_test <- colnames(design_matrix(so))[2]
so <- sleuth_wt(so, wald_test)
```

And start a Shiny Browser

```{r}
sleuth_live(so)
```

Summary Together, Kallisto and Sleuth are quick, powerful ways to analyze RNA-Seq data.


Fix or improve this documentation


Home_Icon Learning Center Home

Prerequisites

Downloads, access, and services

In order to complete this tutorial you will need access to the following services/software

Prerequisite Preparation/Notes Link/Download
CyVerse account You will need a CyVerse account to complete this exercise CyVerse User Portal

Platform(s)

We will use the following CyVerse platform(s):

Platform Interface Link Platform Documentation Quick Start
Data Store GUI/Command line Data Store Data Store Manual Data Store Guide
Discovery Environment Web/Point-and-click Discovery Environment DE Manual Discovery Environment Guide

Application(s) used

Discovery Environment App(s):

App name Version Description App link Notes/other links
Kallisto-v.0.43.1 0.43.1 Kallisto v.0.43.1 Kallisto app Kallisto manual
RStudio Sleuth 0.30.0 RStudio with Sleuth (v.0.30.0) and dependencies Sleuth app Sleuth

Input and example data

In order to complete this tutorial you will need to have the following inputs prepared

Input File(s) Format Preparation/Notes Example Data
RNA-Seq reads FastQ (may also be compressed, e.g. fastq.gz) These reads should have been cleaned by upstream tools such as Trimmomatic Example fastq files
Reference transcriptome fasta Transcriptome for your organism of interest Example transcriptome

Sample Data and Working with Your Own Data

Sample data

About the Sample Dataset In this tutorial, we are using publicly available data from the SRA. This tutorial will start with cleaned and processed reads. The SRA experiment used data from bioproject PRJNA272719. The abstract from that project is reprinted here:

‘To survey transcriptome changes by the mutations of a DNA demethylase ROS1 responding to a phytohormone abscisic acid, we performed the Next-gen sequencing (NGS) associated RNA-seq analysis. Two ROS1 knockout lines (ros1-3, ros1-4; Penterman et al. 2007 [PMID: 17409185]) with the wild-type Col line (wt) were subjected. Overall design: Three samples (ros1-3, ros1-4 and wt), biological triplicates, ABA or mock treatment, using Illumina HiSeq 2500 system’ citation.

Tip

Working with your own data

If you have your own FASTQ files upload them to CyVerse using instructions in the CyVerse Data Store Guide (e.g. iCommands/Cyberduck).


Fix or improve this documentation


Home_Icon Learning Center Home