RNA-Seq with Kallisto and Sleuth¶
Goal¶
Analyze RNA-Seq data for differential expression. Kallisto manual is a quick, highly-efficient software for quantifying transcript abundances in an RNA-Seq experiment. Even on a typical laptop, Kallisto can quantify 30 million reads in less than 3 minutes. Integrated into CyVerse, you can take advantage of CyVerse data management tools to process your reads, do the Kallisto quantification, and analyze your reads with the Kallisto companion software Sleuth in an R-Studio environment.
Manual Maintainer(s)¶
Who to contact if this manual needs fixing. You can also email Tutorials@CyVerse.org
Maintainer | Institution | Contact |
---|---|---|
Jason Williams | CyVerse / Cold Spring Harbor Laboratory | Williams@cshl.edu |
Organize Kallisto Input Data¶
Description:
Kallisto has relatively few input requirements. You will need to have either single or paired end reads, as well as a reference transcriptome. It is suggested that your RNA-Seq reads are analyzed using FastQC, followed by any additional trimming and filtering using and application such as Trimmomatic.
Input Data:
Input | Description | Example |
---|---|---|
Reference transcriptome | fasta | Example transcriptome |
Importing Reference Transcriptome¶
In this example, we will import a reference transcriptome for Arabidopsis from Ensembl. In many cases you can find an appropriate transcriptome from Ensembl for your organism of interest, or provide your own fasta-formmatted transcriptome.
- Go to the Ensembl Plants Arabidopsis page: http://plants.ensembl.org/Arabidopsis_thaliana/Info/Index.
- In the ‘Gene annotation’ section, click on the ‘Download genes, cDNAs, ncRNA, proteins’ ‘FASTA’ link: ftp://ftp.ensemblgenomes.org/pub/plants/release-36/fasta/arabidopsis_thaliana/.
- The transcriptome files will be located in the ‘cdna’ folder: ftp://ftp.ensemblgenomes.org/pub/plants/release-36/fasta/arabidopsis_thaliana/cdna/.
- Right-click/command-click on the Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz link and copy the link location/URL (usually right-click) to the clipboard
- Login to the CyVerse Discovery Environment.
- Open the Data window. In your home directory, create a folder to organize your Kallisto project
- In the created folder, go to the Data window’s ‘Upload’ menu, and select, ‘Import from URL…’; paste in the Ensembl link and click ‘Import from URL’ to begin the import
Output/Results
Output | Description | Example |
---|---|---|
Reference transcriptome | fasta | Example transcriptome |
Description of results and next steps
This example transcriptome will be indexed in the next step. RNA-Seq reads will be mapped against this set of transcripts. Once you have the transcriptome and your RNA-Seq reads, you can proceed with the next step. We suggest organizing your RNA-Seq reads in the folder created in step 6 above.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org
Build Kallisto Transcriptome Index¶
Description:
As described in the Kallisto paper, RNA-Seq reads are efficiently mapped through a pseudoalignment process against a reference transcriptome index. We will build the index in this step.
Input Data:
Input | Description | Example |
---|---|---|
Reference transcriptome | fasta | Example transcriptome |
RNA-Seq Reads | Cleaned fastq files | Example fastq files |
Build Kallisto Index and Quantify Reads¶
We will now index the Arabidopsis transcriptome imported from Ensembl. This transcriptome can be used multiple times for future Kallisto analyses and only needs to be made once. In this tutorial, we have 36 fastq files (18 pairs), so you will need to add these to the Kallisto analyses. Kallisto uses a ‘hash-based’ pseudo alignment to deliver extremely fast matching of RNA-Seq reads against the transcriptome index.
If necessary, login to the CyVerse Discovery Environment.
In the App panel, open the Kallisto v.0.43.1 app or click this link:
Name your analysis, and if desired enter comments and click ‘Next.’ In the App’s ‘Input’ section under ‘The transcript fasta file supplied (fasta or gzipped)’ browse to and select the transcriptome imported in the previous section.
Under Paired of single end choose the format used in your sequencing.
Sample data
For the sample data, choose Paired
Note
For single-end data you will also need to choose fragment length and fragment standard deviation values in the apps “Options” section. You may also adjust settings for strand-specific reads.
Under ‘FASTQ Files (Read 1)’ navigate to your data and select all the left-read files (usually R1). For paired-end data also enter the right-read files (usually R2) .
Sample data
For the sample data, navigate to /iplant/home/shared/cyverse_training/tutorials/kallisto/00_input_fastq_trimmed
- For FASTQ Files (Read 1) choose all 18 files ending labeled R1 (e.g. SRR1761506_R1_001.fastq.gz_fp.trimmed.fastq.gz)
- For FASTQ Files (Read 2) choose all 18 files ending labeled R2 (e.g. SRR1761506_R2_001.fastq.gz_fp.trimmed.fastq.gz)
If desired adjust the bootstrap value (See Kallisto paper for recommendations); Click ‘Next’ to continue.
Sample data
We will use 25.
If desired adjust the resources required and/or click ‘Next.’
Sample data
For the sample data, we will not specify resources.
Finally click ‘Launch Analyses’ to start the job. Click on the Analyses menu to monitor the job and results.
Output/Results
Kallisto jobs will generate and index file and 3 output files per read / read-pair:
Output | Description | Example |
---|---|---|
Kallisto Index | This is the index file Kallisto will map RNA-Seq reads to. | Example Kallisto index |
abundances.h5 | HDF5 binary file containing run info, abundance estimates, bootstrap estimates, and transcript length information length. This file can be read in by Sleuth | example abundance.h5 |
abundances.tsv | plaintext file of the abundance estimates. It does not contains bootstrap estimates. When plaintext mode is selected; output plaintext abundance estimates. Alternatively, kallisto h5dump will output an HDF5 file to plaintext. The first line contains a header for each column, including estimated counts, TPM, effective length. | example abundance.tsv |
run_info.json | a json file containing information about the run | example json |
Description of results and next steps
First, this application runs the ‘kallisto index’ command to build the the index of the transcriptome. Then the ‘kallisto quant’ command is run to do the pesudoalignment of the RNA-Seq reads. Kallisto quantifies RNA-Seq reads against an indexed transcriptome and generates a folder of results for each set of RNA- Seq reads. Sleuth will be used to examine the Kallisto results in R Studio.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org
Analyze Kallisto Results with Sleuth¶
Description:
Sleuth is a program for analysis of RNA-Seq experiments for which transcript abundances have been quantified with Kallisto. In this tutorial, we will use R Studio being served from an VICE instance.
Input Data:
Input | Description | Example |
---|---|---|
Kallisto results folder(s) | Outputs from a Kallisto quantification | Example directory of Kallisto results |
Import Kallisto results into R and analyze with Sleuth¶
We will import the Kallisto results into an RStudio session being run from an Atmosphere image. Then we will follow a R script based on the Sleuth Walkthoughs.
If necessary, login to the CyVerse Discovery Environment.
In the App panel, open the Sleuth RStudio app or click this link:
Name your analysis, and if desired enter comments.
(Optional) In the ‘Notebooks’ section, under ‘Select an RMarkdown notebook to run’ select a notebook.
Sample data
For the sample data, navigate to and select /iplant/home/shared/cyverse_training/tutorials/kallisto/04_sleuth_R/sleuth_tutorial.Rmd
In the ‘Datasets’ section, under ‘Data for analysis (outputs of Kallisto quantification)’ choose the folders containing quantification information for all sets of reads.
Sample data
For the sample data, navigate to and select /iplant/home/shared/cyverse_training/tutorials/kallisto/03_output_kallisto_results
In the ‘Datasets’ section, under ‘Study design file’ choose a TSV file describing the samples and study design (see Sleuth).
Sample data
For the sample data, navigate to and select /iplant/home/shared/cyverse_training/tutorials/kallisto/04_sleuth_R/kallisto_demo.tsv
Tip
See the Example study design (Kallisto_demo_tsv) TSV file. You can create and edit your own in a spreadsheet editing program. The Sleuth explains this file and more is described in this tutorial’s RMarkdown notebook.
If desired adjust the resources required and/or click ‘Next.’
Sample data
For the sample data, we will not specify resources.
Click ‘Launch Analyses’ to start the job. Click on the Analyses button to monitor the job and results. In your Analyses menu, you will find a link to your VICE session ; this may take a few minutes to become active.
In your RStudio session, double click on the sleuth_tutorial.Rmd file and follow the tutorial by pressing the green “play” triangles in each section of code. The code for the notebook is replicated below:
--- title: "Sleuth RNA-Seq Tutorial - Arabidopsis" output: html_document: df_print: paged --- This tutorial will take you through a sample RNA-Seq analysis using [kallisto](https://pachterlab.github.io/kallisto/about), using an RNA-Seq R package [Sleuth](https://pachterlab.github.io/sleuth/about). This tutorial is based on the one by the [Pachter lab](https://pachterlab.github.io/sleuth_walkthroughs/boj/analysis.html) ## Step 1: Load Sleuth and accessory libraries Next, we need to load the Sleuth library to begin. We will also check the version: ```{r message=FALSE} require("sleuth") packageVersion("sleuth") ``` We will also install a plotting library and some other functions we will need... ```{r echo=FALSE, message=FALSE, warning=FALSE} library("gridExtra") library("cowplot") ``` We will also use [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) tools will allow us to pull in recognizable gene names from a database. ```{r echo=FALSE, message=FALSE, warning=FALSE} library("biomaRt") ``` ## Step 2: Load experimental design and label kallisto outputs with metadata ### Locate sample names and describe our experimental design We need to provide Sleuth with our sample names: ```{r} sample_id <- dir(file.path("~/kallisto_qaunt_output/")) sample_id ``` We also need to get the file paths to our results files. ```{r} kal_dirs <- file.path("~/kallisto_qaunt_output", sample_id) ``` We also need a table that provides more meaningful names for describing our experiment... ```{r} s2c <- read.table(file.path("~/kallisto_demo.tsv"), header = TRUE, stringsAsFactors = FALSE, sep = "\t") ``` We will add our file paths to the table ```{r} s2c <- dplyr::mutate(s2c, path = kal_dirs) ``` Let's view the table we have created: ```{r} s2c ``` ## Step 3: Load gene names from Ensembl Next we need to determine which biomaRt to use. This can be a little complex so be sure to read their [documentation](https://www.bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/biomaRt.html). This [blog post](https://nsaunders.wordpress.com/2015/04/28/some-basics-of-biomart/) is also helpful. ```{r} marts <- listMarts() marts ``` If you are not working with these Ensembl data bases you may want to check out documentation on [using BiomaRts other than Ensembl](https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html#using-a-biomart-other-than-ensembl). We are using plants, so... ```{r} marts <- listMarts(host = "plants.ensembl.org") marts ``` For now, remember that we will want to use `plants_mart`. Next, we need to choose a specific data set. ```{r} plants_mart <- useMart("plants_mart", host = "plants.ensembl.org" ) listDatasets(plants_mart) ``` After a little looking, its the `athaliana_eg_gene` data set that we need. Finally, we need to update our `plants_mart` to be more specific. ```{r} plants_mart <- useMart("plants_mart", dataset = "athaliana_eg_gene", host="plants.ensembl.org" ) ``` Now we want to get specific attributes from the list of genes we can import from biomart ```{r} listAttributes(plants_mart) ``` We can choose whichever of these we'd like to use. Let's get transcript ids, gene ids, a description, and gene names. Notice, there are many things you may want to come back for. We must get the transcript id because these are the names of the transcripts that were used in our Kallisto quantification. ```{r echo=FALSE, message=FALSE, warning=FALSE} t2g <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "description", "external_gene_name"), mart = plants_mart) ``` We need to make sure the `ensembl_transcript_id` column is named `target_id` ```{r} ttg <- dplyr::rename(t2g, target_id= ensembl_transcript_id, ens_gene = ensembl_gene_id, ext_gene = external_gene_name) ``` ##Step 4: Prepare data for Sleuth first we need to alter our experimental design so that we consider the full transcriptome sample to be the "control" to compare to... ```{r} s2c$genotype_variation_s <- as.factor(s2c$genotype_variation_s) s2c$genotype_variation_s <- relevel(s2c$genotype_variation_s, ref = "wild type") ``` Now we need to tell Sleuth both about the Kallisto results and the gene names (and gene descriptions/metadata) we obtained from biomaRt. The `sleuth_prep` function does this. ```{r warning=FALSE} so <- sleuth_prep(s2c, full_model = ~genotype_variation_s, target_mapping = ttg, read_bootstrap_tpm=TRUE, extra_bootstrap_summary = TRUE) ``` ##Step 5: Initial data exploration ### Examine Sleuth PCA Next, we should check to see if our samples (and replicates) cluster on a PCA (as should expect) or if there are outliers. When we plot by condition, we'd expect that similar colors group together. ```{r} library(cowplot) ggplot2::theme_set(theme_cowplot()) plot_pca(so, color_by = 'genotype_variation_s', text_labels = TRUE) ``` Let's try plotting by treatment ```{r} plot_pca(so, color_by = 'treatment_s', text_labels = TRUE) ``` We can also see genes involved in the the 1st PC by looking at the loadings (primary genes whose linear combinations define the principal components) ```{r} plot_loadings(so, pc_input = 1) ``` Let's see how this "influential" gene (at least as far as PCA tells us) looks by condition ```{r} plot_bootstrap(so, 'AT2G34420.1', color_by = 'genotype_variation_s') ``` Let's see how this "influential" gene (at least as far as PCA tells us) looks by treatment ```{r} plot_bootstrap(so, 'AT2G34420.1', color_by = 'treatment_s') ``` ##Step 6: Modeling, testing, and results exploration ### Differential expression testing with Sleuth Now we need to run a few functions that will test for differential expression (abundance). First we will create a model ```{r} so <- sleuth_fit(so, ~genotype_variation_s, 'full') so <- sleuth_fit(so, ~1, 'reduced') so <- sleuth_lrt(so, 'reduced', 'full') ``` Now we can get the results of this analysis ```{r} full_results <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE) head(full_results) ``` Let's add Wald test ```{r} wald_test <- colnames(design_matrix(so))[2] so <- sleuth_wt(so, wald_test) ``` And start a Shiny Browser ```{r} sleuth_live(so) ```
Summary Together, Kallisto and Sleuth are quick, powerful ways to analyze RNA-Seq data.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org
Prerequisites¶
Downloads, access, and services¶
In order to complete this tutorial you will need access to the following services/software
Prerequisite | Preparation/Notes | Link/Download |
---|---|---|
CyVerse account | You will need a CyVerse account to complete this exercise | CyVerse User Portal |
Platform(s)¶
We will use the following CyVerse platform(s):
Platform | Interface | Link | Platform Documentation | Quick Start |
---|---|---|---|---|
Data Store | GUI/Command line | Data Store | Data Store Manual | Data Store Guide |
Discovery Environment | Web/Point-and-click | Discovery Environment | DE Manual | Discovery Environment Guide |
Application(s) used¶
Discovery Environment App(s):
App name | Version | Description | App link | Notes/other links |
---|---|---|---|---|
Kallisto-v.0.43.1 | 0.43.1 | Kallisto v.0.43.1 | ![]() |
Kallisto manual |
RStudio Sleuth | 0.30.0 | RStudio with Sleuth (v.0.30.0) and dependencies | ![]() |
Sleuth |
Input and example data¶
In order to complete this tutorial you will need to have the following inputs prepared
Input File(s) | Format | Preparation/Notes | Example Data |
---|---|---|---|
RNA-Seq reads | FastQ (may also be compressed, e.g. fastq.gz) | These reads should have been cleaned by upstream tools such as Trimmomatic | Example fastq files |
Reference transcriptome | fasta | Transcriptome for your organism of interest | Example transcriptome |
Sample Data and Working with Your Own Data¶
Sample data
About the Sample Dataset In this tutorial, we are using publicly available data from the SRA. This tutorial will start with cleaned and processed reads. The SRA experiment used data from bioproject PRJNA272719. The abstract from that project is reprinted here:
‘To survey transcriptome changes by the mutations of a DNA demethylase ROS1 responding to a phytohormone abscisic acid, we performed the Next-gen sequencing (NGS) associated RNA-seq analysis. Two ROS1 knockout lines (ros1-3, ros1-4; Penterman et al. 2007 [PMID: 17409185]) with the wild-type Col line (wt) were subjected. Overall design: Three samples (ros1-3, ros1-4 and wt), biological triplicates, ABA or mock treatment, using Illumina HiSeq 2500 system’ citation.
Tip
Working with your own data
If you have your own FASTQ files upload them to CyVerse using instructions in the CyVerse Data Store Guide (e.g. iCommands/Cyberduck).
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org