HapSeq2 is a program for haplotype phasing and genotype calling from next-generation sequencing (NGS) data. This page is to demonstrate a pipeline that can generate phased genotypes (VCF format) from read alignments (BAM format).
Warning: This tutorial assumes you are using Linux operating system.
Go to here to check out the HapSeq2 program.
|hapseq2||use 64-bit precompiled binary packaged and attached to this page.|
|bam_parser||use precompiled binary packaged and attached to this page.|
The starting point are three things:
- A file containing a list of individuals.
- A directory containing set of BAM files, sorted and indexed, each corresponding to one individual in the list above.
- A VCF file containing a SNP site list. Currently HapSeq2 ignore non-SNP polymorphic sites. No genotype information is needed.
The output is a VCF file contained the phased genotypes called over the sites.
Go to here to download a toy example, which is a subset of the 1000 Genomes Project Phase 1 data.
To deploy the tutorial:
This will generate a directory named HapSeq2_pipeline. Go in and get started
Run bam_parser to prepare input for HapSeq2
bam_parser is a program that go over a list of BAM files and extract all relevant input files required by HapSeq2.
We have a wrapper run_bam_parser.pl to call bam_parser for each BAM file and put the resulting files into a temporary directory tmp_input. This can take some time.
Now count.txt contains the allele count information of reads covering single sites (R1); jump1.txt contains double site allele count information for reads covering any consecutive sites (R2); and read.txt contains long range read-site coverage information (R3). sites.txt file contains sites information. jump2.txt contains similar information as jump1.txt, but currently is unused. For the detailed description of these files, please refer to HapSeq2 Manual from this web site.
Now we are ready to run HapSeq2.
This is the main run and could take some time. Typically this will be run as a cluster job. Be sure to prepare enough memory and CPU time.
The parameters are the standard HapSeq2 option. See here for a complete manual of HapSeq2 program.
The output files are the same as that of Mach/Thunder, with 'res-hapseq' as prefix.
Converting MACH/Thunder/HapSeq2 results into VCF format
This can be achieved by the following script:
Comparing results to 1000 Genomes Project Phase 1 release
We also included a sub-section from the Omni genotyping results phased by SHAPEIT as part of 1000 Genomes Project phase 1 release.
VCFTools (not included) can be used to compare the results:
You can check the results by the following: