======================= README file ========================== This README lists the suite of programs and steps necessary to generate large, accurate and phylum-annotated Multiple Sequence Alignments (MSAs) from curated NCBI Conserved Domain Database (CDD) hierarchical multiple sequence alignments (hiMSAs). ------------- Download programs ------------- Download the following programs from http://www.igs.umaryland.edu/labs/neuwald/software/mapgaps: CDD2MGS - cdd2mgs1.0.1.tar.gz MAPGAPS - mapgaps2.1.4.tar.gz PurgeMSA - purge_msa.tar.gz AddPhylum - addphylum.tar.gz Use gunzip to uncompress the archive and then tar to extract the executable and other files. For example, to extract MAPGAPS type: gunzip mapgaps2.1.4.tar.gz tar xvf mapgaps2.1.4.tar This will extract a binary executable named mapgaps. By typing './mapgaps' it will show usage and input and output file descriptions. In addition to mapgaps, the following auxillary programs will be extracted: tweakcma fasplit Download the following programs from http://www.igs.umaryland.edu/labs/neuwald/software/bpps/: BPPS - bpps1.1.4.tar Uncompress and extract the executable and other files. ------------- Download set of NCBI CDD hiMSAs ------------- A compressed tarball containing the set of NCBI CDD hiMSAs (version 3.17) can be downloaded either from the Neuwald Lab page (http://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/) by clicking on the link under NCBI CDD hiMSAs or from the NCBI ftp site: ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/hiMSA. Once downloaded, uncompress and extract the CDD hiMSAs. Each CDD hiMSA is put into its own subdirectory and is named after its identifier (e.g. cd00012). -------------- Download of large sequence databases from NCBI ------------- The large sequences databases from NCBI, such as nr and pdbaa, can be downloaded from their website: ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ A link to the NCBI ftp site is provided on the Neuwald Lab page ((http://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/) under NCBI non-redundant (nr) protein sequence ftp site. Once a large fasta file from the NCBI is downloaded for a database, the AddPhylum program can be used to add phylum information to protein sequences. Before performing queries, it is recommended to split the large fasta file for a sequence database into smaller FASTA files using the fasplit program downloaded with the MAPGAPS program. -------------- Steps used to generate a large MSA using a CDD hiMSA ------------- 1. Run CDD2MGS to convert mFASTA formatted CDD hiMSA into MAPGAPS's input format: Usage: cdd2mgs [options] For example: cdd2mgs ./v3.17/ ./target cd00012 2. Download the nr, est_aa, env_nr, and/or pdbaa database files from the NCBI. Use the AddPhylum program to add phylum information to the protein sequences. Type './addphylum' for usage and how to obtain the NCBI taxonomy input files. To run MAPGAPS in parallel, use fasplit to split the database into subsets of say 250,000 sequences each. For example, to split the phylum annotated nr database (denoted as nrtx) type: fasplit nrtx 250,000 < nrtx This splits the nr database into hundreds of smaller fasta files: nrtx.1, nrtx.2, etc. 3. Use MAPGAPS to perform a search on each database file: mapgaps [options] For example, mapgaps cd00012 pdbaa 4. If multiple subfiles were generated, concatenate these into a single file: cat nrtx.*_A.mma est_aatx.*_A.mma env_nr.*_A.mma > main.mma 5. Use PurgeMSA to merge the concatenated files, remove short sequence fragments and redundant sequences: Usage: PurgeMSA [options] For example, to remove fragments match less than 75% of the aligned columns while retaining sequences sharing less than 98% identity, type: PurgeMSA main 75 98 This creates the final output file: main_Match75_U98.cma References: Neuwald, A.F. Rapid detection, classification and accurate alignment of up to a million or more related protein sequences. Bioinformatics 2009. 25: 1869-1875 Neuwald, A.F., L. Aravind & S.F. Altschul. Inferring Joint Sequence-Structural Determinants of Protein Functional Specificity. eLife 2018. doi: 10.7554/eLife.29880.001.