=======================  README file  ==========================

This README lists the suite of programs and steps necessary to generate large, accurate and 
phylum-annotated Multiple Sequence Alignments (MSAs) from curated NCBI Conserved Domain 
Database (CDD) hierarchical multiple sequence alignments (hiMSAs).

-------------  Download programs -------------
Download the following programs from http://www.igs.umaryland.edu/labs/neuwald/software/mapgaps:

CDD2MGS - cdd2mgs1.0.1.tar.gz
MAPGAPS - mapgaps2.1.4.tar.gz
PurgeMSA - purge_msa.tar.gz
AddPhylum - addphylum.tar.gz

Use gunzip to uncompress the archive and then tar to extract the executable and other files.  
For example, to extract MAPGAPS type:

    gunzip mapgaps2.1.4.tar.gz 
    tar xvf mapgaps2.1.4.tar

This will extract a binary executable named mapgaps.  By typing './mapgaps' it will show usage 
and input and output file descriptions.

In addition to mapgaps, the following auxillary programs will be extracted:

tweakcma
fasplit

Download the following programs from http://www.igs.umaryland.edu/labs/neuwald/software/bpps/:

BPPS - bpps1.1.4.tar

Uncompress and extract the executable and other files.

-------------  Download set of NCBI CDD hiMSAs -------------

A compressed tarball containing the set of NCBI CDD hiMSAs (version 3.17) can be downloaded either
from the Neuwald Lab page (http://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/) by clicking 
on the link under NCBI CDD hiMSAs or from the NCBI ftp site: ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/hiMSA.  

Once downloaded, uncompress and extract the CDD hiMSAs. Each CDD hiMSA is put into its own subdirectory 
and is named after its identifier (e.g. cd00012).

--------------  Download of large sequence databases from NCBI ------------- 

The large sequences databases from NCBI, such as nr and pdbaa, can be downloaded from their website:

ftp://ftp.ncbi.nih.gov/blast/db/FASTA/

A link to the NCBI ftp site is provided on the Neuwald Lab page 
((http://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/) 
under NCBI non-redundant (nr) protein sequence ftp site.  

Once a large fasta file from the NCBI is downloaded for a database, the AddPhylum program can be used 
to add phylum information to protein sequences. Before performing queries, it is recommended to split 
the large fasta file for a sequence database into smaller FASTA files using the fasplit program 
downloaded with the MAPGAPS program.

--------------  Steps used to generate a large MSA using a CDD hiMSA ------------- 

1. Run CDD2MGS to convert mFASTA formatted CDD hiMSA into MAPGAPS's input format:

  Usage: cdd2mgs <cdd_dir> <target_dir> <cd_ident> [options]

  For example: cdd2mgs ./v3.17/ ./target cd00012

2. Download the nr, est_aa, env_nr, and/or pdbaa database files from the NCBI. 
    Use the AddPhylum program to add phylum information to the protein sequences.
    Type './addphylum' for usage and how to obtain the NCBI taxonomy input files.
    To run MAPGAPS in parallel, use fasplit to split the database into subsets of
    say 250,000 sequences each.  For example, to split the phylum annotated nr 
    database (denoted as nrtx) type: 

		fasplit nrtx 250,000 < nrtx

    This splits the nr database into hundreds of smaller fasta files: nrtx.1, nrtx.2, etc.

3. Use MAPGAPS to perform a search on each database file:

    mapgaps <prefix> <database> [options]

	For example, mapgaps cd00012 pdbaa

4. If multiple subfiles were generated, concatenate these into a single file:

   cat nrtx.*_A.mma est_aatx.*_A.mma env_nr.*_A.mma > main.mma

5. Use PurgeMSA to merge the concatenated files, remove short sequence fragments and redundant sequences:

   Usage: PurgeMSA <cmafile> <int1> <int2>[options]

    For example, to remove fragments match less than 75% of the aligned columns while retaining 
	sequences sharing less than 98% identity, type:

     PurgeMSA main 75 98

This creates the final output file: main_Match75_U98.cma

References:
    Neuwald, A.F. Rapid detection, classification and accurate alignment of up to a million 
      or more related protein sequences.  Bioinformatics 2009. 25: 1869-1875
    Neuwald, A.F., L. Aravind & S.F. Altschul. Inferring Joint Sequence-Structural Determinants 
      of Protein Functional Specificity. eLife 2018. doi: 10.7554/eLife.29880.001.