ppsPCP

Introduction

ppsPCP is a Pipeline to scan presence/absence variants (PAVs) and make fully annotated Pan-genome when one or multiple assembled plant genomes compared against one selected reference genome. ppsPCP can also be used for prokaryotes and other eukaryotes like animals etc.

To find PAVs and construct a Pan-genome, ppcPCP perform the following steps:

The reference and query genomes are aligned together, and PAVs are scanned. The minimum PAV length set to 100bp
All genes either assosiated with the PAVs, have no similarity with reference or not satisfy at least one of the previous defined criteria are filtered out
Extracted unique PAVs and genes are merged with reference genome to construct a fully annotated pan-genome

An overview of ppsPCP pipeline work flow

Download and Usage

Installation and usage of ppsPCP is very much easy. You can download the ppsPCP package from following ways:

Simple click to download ppsPCP package. After downloading, uncompress the package and put the bin directory into your PATH.
You can also download the ppsPCP package using wget or through git (github link )

# download the ppsPCP
wget http://cbi.hzau.edu.cn/ppsPCP/files/ppsPCP.zip
or
git clone git@github.com:Zhuxitong/ppsPCP.git
# Add the bin to PATH
$ export PATH=/path/to/ppsPCP/bin/:$PATH

ppsPCP available options for users

Usage: 
	make_pan.pl [options] --ref [reference_genome] --ref_anno [refernece_anno] --query query1_genome[query2...] 
	--query_anno query1_anno[query2...] &> [job_name].log

Options:

	Help
	--help|-h 		Print the help message and exit
			
	Required parameters
	--ref 			Reference sequence file, usually a fasta file
	--ref_anno 		The gff3 annotation file for the reference sequence
	--query 		The query sequence files, can be one or more, separated with space
	--query_anno 		The gff3 annotation files corresponding to the query sequence files, must
				have the same order with the query sequence files
	
	Filter parameters
	--coverage 		The coverage used to filter similar PAVs. Can be any number between 0 and 1. Default: 0.9
	--sim_pav 		The similarity used to filter similar PAVs. Can be any number between 0 and 1. Default: 0.95
	--sim_gene 		Then similarity used to filter mapped genes in blat mapping. Can be any number between 0 and 1.
				Default: 0.8
	
	Other parameters
	--tmp 			he temporary directory where you want to save the temporary files. Default: ./tmp
	--no_tmp 		Delete tmp file when job finished
	--thread 		The number of threads used for mummer and blastn. Remember not all the phases of ppsPCP are
				parallelized. Default: 1

Dependencies

MUMmer

You can find MUMmer HERE. Mummer-4.0.0beta2 is uesd. Mummer version 4.x.x requires a recent version of the GCC compiler (g++ version >= 4.7), which is hard to install if you have no administrator authority. You can ask your system administrator for some help in this case.

$ wget https://github.com/mummer4/mummer/releases/download/v4.0.0beta2/mummer-4.0.0beta2.tar.gz
$ tar -xvzf -xvzf mummer-4.0.0beta2.tar.gz
$ ./configure --prefix=/path/to/installation
$ make
$ make install
# Add MUMmer tools to your PATH
$ export PATH=/path/to/installation/:$PATH

Blast+

You can find Blast+ HERE in NCBI. We used the x64-linux version of Blast+.

$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.7.1+-x64-linux.tar.gz
$ tar zxvf ncbi-blast-2.7.1+-x64-linux.tar.gz
# Add Blast+ tools to your PATH
$ export PATH=/path/to/blast+/bin:$PATH

Bedtools

Bedtools is a powerful toolset for genome arithmetic. It is also very easy to install. In this pipeline, four sub-tools of Bedtools are used: getfasta, intersect, merge and sort.

$ wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
$ tar -zxvf bedtools-2.25.0.tar.gz
$ cd bedtools2
$ make
# Add Bedtools tools to your PATH
$ export PATH=/path/to/bedtools/bin:$PATH

Blat

Blat is one of utilities from UCSC. You can select one utility to download or use below commad to download all of them from this page.

$ mkdir UCSC_tools
$ rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ ./
# Add blat to your PATH
export PATH=/path/to/UCSC_tools/blat/:$PATH

gffread

gffread is a build-in tool in Cufflinks.So by installing cufflinks, you can use gffread easily.

$ wget http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz
$ tar zxvf cufflinks-2.2.1.Linux_x86_64.tar.gz
# Add gffread to your PATH
$ export PATH=/path/to/cufflinks-2.2.1.Linux_x86_64/:$PATH

Perl and perl modules

We recommand the version of perl should be at-least 5.10.0 (use perl -v to check the version). Although most of the modules ppsPCP used are already exist, however you still may need to install the Bio::Perl module.

Installing the perl module under Linux system sometimes can be troublesome due to the lack of adminstrator permission. This page inrtoduces three ways to install the Bio::Perl module, but in practice the cpanm is the most friendly way to install perl module. You can find a pre-compiled source code for the cpanm HERE.

# if you are using cpanm for the first time, type the following command on your system.(By default, the module installed through cpanm will be in '~/perl5' directory).
$ cpanm --local-lib=~/perl5 local::lib && eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)
# install Bio::Perl
$ cpanm Bio::Perl

ppsPCP currently only supports Linux system due to the software dependencies.

Input and output files

Input files

At least two genome sequence files and two corresponding annotation files are required to run ppsPCP.

The genome sequence file should be a fasta file with following format:

>chr1
ATCGATCG...

File extension doesn't matter, '.fa', '.fasta' or any other suffix can be accepted. But the prefix name of sequence file will be used to indicate the temporary file, so we recommend you to use 'cultivar.fa (like rice.fa)' to run ppsPCP.

Annotation file should be GFF3 format (note that columns should be separated by tab):

ctg123 . gene 1000  9000  .  +  .  ID=gene00001;Name=EDEN
ctg123 . mRNA 1050  9000  .  +  .  ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123 . exon 1300  1500  .  +  .  ID=exon00001;Parent=mRNA00003
ctg123 . CDS  1201  1500  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1

GFF format with 'gene' line information can also be accepted by ppsPCP.

Output files

The main output files of ppsPCP are 'pangenome.fa' and 'pangenome.gff3', if you create pan-genome with only two genome (one reference and one query), and some useful information about the pan-genome like number of PAVs in query, number of genes merged into pan-genome and so on. ppsPCP supports multiple query genome files, which will produce 'pangenome1.fa', 'pangenome2.fa'... so on, with corresponding gff3 file for each of them. The last pan-genome will be the final pan-genome representing total set of PAVs/genes scaned from every query genome and merged into reference genome.

Test ppsPCP with example data

A small dataset in the example directory can be used to test whether ppsPCP can run on your system successfully or not. Move to the 'example' directory and type the following commands:

$ cd example
$ make_pan.pl --ref Zmw_sc00394.1.fa --ref_anno Zmw_sc00394.1.gff3 --query Zjn_sc00188.1.fa --query_anno Zjn_sc00188.1.gff3 &> run.log

If you receive any error, please check the log information or contact us through e-mail. This result has no biological meaning because these two sequences are only a small part of two genomes from HERE.

Reference

Muhammad Tahir ul Qamar, Xitong Zhu, Feng Xing, Ling-Ling Chen. ppsPCP: A Plant Presence/absence Variants Scanner and Pan-genome Construction Pipeline. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz168
All the data used in above paper and the outputs can be downloaded from here Rice and Arabidopsis.

Contact us

Muhammad Tahir ul Qamar: m.tahirulqamar@hotmail.com
Xitong Zhu: z724@qq.com
Feng Xing: xfengr@mail.hzau.edu.cn
Ling-Ling Chen: llchen@mail.hzau.edu.cn