# README for following files outlines how they were cleaned up from the CGA hapmap, hg19_population_stratified_af_hapmap_3.3.vcf.gz, downloaded from . These population stratified allele frequency (AF) files are for use with GATK tool ContEst, which estimates cross-sample contamination and potential sample swaps. All operations by @shlee, and performed on February 6, 2017. Post question to . 172M Feb 6 21:44 hapmap_3.3_b37_pop_stratified_af.vcf.gz 1.5M Feb 6 21:44 hapmap_3.3_b37_pop_stratified_af.vcf.gz.tbi 172M Feb 6 21:46 hapmap_3.3_hg19_pop_stratified_af.vcf.gz 1.5M Feb 6 21:46 hapmap_3.3_hg19_pop_stratified_af.vcf.gz.tbi 172M Feb 6 21:47 hapmap_3.3_grch38_pop_stratified_af.vcf.gz 1.5M Feb 6 21:47 hapmap_3.3_grch38_pop_stratified_af.vcf.gz.tbi 546K Feb 6 21:47 rejected_vars_liftoverhg19tohg38_hapmap_3.3.vcf.gz 20K Feb 6 21:47 rejected_vars_liftoverhg19tohg38_hapmap_3.3.vcf.gz.tbi 1. Correct the mislabeling, i.e. change hg19 to b37, and remove gratuitous column 9. Then divide the contents into header, columns 1-7 and column 8. shlee$ cat ../../hg19_population_stratified_af_hapmap_3.3_CGAoriginal.vcf | cut -f 1-8 > b37_pop_stratified_af_hapmap_3.3_rmvcolumn9.vcf shlee$ head b37_pop_stratified_af_hapmap_3.3_rmvcolumn9.vcf | less shlee$ cat b37_pop_stratified_af_hapmap_3.3_rmvcolumn9.vcf | grep '#' > header.txt shlee$ cat b37_pop_stratified_af_hapmap_3.3_rmvcolumn9.vcf | grep -v '#' > body.txt shlee$ cat body.txt | cut -f 1-7 > body1-7.txt shlee$ cat body.txt | cut -f 8 > body8.txt 2. FIND AND REPLACE SPACES in body8.txt WITH TEXTEDIT AND SAVE TO SAME FILE NAME. 3. Merge back the columns. shlee$ paste -d'\t' body1-7.txt body8.txt > body1-8_tight.txt 4. INSERT INFO FIELDS IN header.txt WITH TEXTEDIT AND SAVE TO SAME FILE NAME. Info fields are from GATK forum user @escaon at . Without these INFO fields, Picard LiftoverVcf requires ALLOW_MISSING_FIELDS_IN_HEADER=true option. 5. Attach modified header to modified body. shlee$ cat header.txt body1-8_tight.txt > hapmap_3.3_b37_pop_stratified_af.vcf 6. Convert b37 nomenclature to hg19 nomenclature, sort and delete index because of wonky bug. Approach is from GATK forum member @escaon and are recapitulated at . shlee$ awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' hapmap_3.3_b37_pop_stratified_af.vcf > chr.vcf shlee$ sed 's/chrMT/chrM/g' chr.vcf > chr_M.vcf shlee$ java -jar $PICARD SortVcf I=chr_M.vcf O=chr_M_hg19_sorted.vcf SEQUENCE_DICTIONARY=~/Documents/ref/hg19/ucsc.hg19.dict shlee$ rm chr_M_hg19_sorted.vcf.idx 7. Rename, block compress and tabix index VCFs using RTG-Tools from Real Time Genomics. shlee$ rtg bgzip hapmap_3.3_b37_pop_stratified_af.vcf shlee$ rtg index hapmap_3.3_b37_pop_stratified_af.vcf.gz -f vcf shlee$ mv chr_M_hg19_sorted.vcf hapmap_3.3_hg19_pop_stratified_af.vcf shlee$ rtg bgzip hapmap_3.3_hg19_pop_stratified_af.vcf shlee$ rtg index hapmap_3.3_hg19_pop_stratified_af.vcf.gz -f vcf 8. Liftover to GRCh38 using chain file from UCSC golden path at . Analysis set GRCh38 reference is from GATK bundle at . shlee$ java -Xmx16G -jar $PICARD LiftoverVcf \ I=hapmap_3.3_hg19_pop_stratified_af.vcf.gz \ O=hapmap_3.3_grch38_pop_stratified_af.vcf.gz \ CHAIN=hg19ToHg38.over.chain \ REJECT=rejected_vars_liftoverhg19tohg38_hapmap_3.3.vcf.gz \ R=hg38/Homo_sapiens_assembly38.fasta ALLOW_MISSING_FIELDS_IN_HEADER=true