Running the pipeline creates GVF and VCF dumps for all species in ensembl with a variation database. The pipeline is run as part of the release cycle.

- read data from registry file
- call scripts
- deal with outputs

There are two versions ReleaseDumps_conf.pm and PopulationDumps_conf.pm:

1) Bio::EnsEMBL::Variation::Pipeline::ReleaseDataDumps::ReleaseDumps_conf
  - Dump the following data for each species if available:
    - generic
    - incl consequences
    - somatic
    - somatic incl consequences
    - structural variation
    - sets: phenotype, clin sign
  - To run the pipeline the following parameters need to be initialised:
    -run_all, division or species, antispecies  Used to know which species the pipeline should process
    -pipeline_dir       directory that store all the data                 
    -pipeline_name      used to create name for hive database
    -hive_db_host       connection details for hive database
    -hive_db_password   connection details for hive database
    -hive_db_port       connection details for hive database
    -hive_db_user       connection details for hive database
    -ensembl_registry   registry file with all species for which to create GVF and VCF dumps
    -ensembl_release    ensembl release
    -gvf_validator      location of the gvf_validator tool, e.g: GAL/bin/gvf_validator from github
    -vcf-validator      location of the vcf-validator tool
    -vcf-sort           location of the vcf-sort tool
    -so_file            SO-Ontologies/so.obo e.g. from github.
    -tmp_dir            tmp directory for storing validation results, err and out files  


2) Bio::EnsEMBL::Variation::Pipeline::ReleaseDataDumps::PopulationDumps_conf
  - dump hapmap and ESP population allele frequencies from database
  - extract frequencies from 1000 Genomes VCF files and map them by identifier to dump files



Dataflow for pipeline 1):

SpeciesFactory
  - Process value of run_all, division, antispecies or species parameters
  - For run_all, the speciesFactory will load all the species defined in the registry file and flow back species with existing variation databases
  - For division, the speciesFactory will process the division using the ensembl_production database and flow back all the variation databases associated with the species in this division
  - For species, the speciesFactory will process the species, look for it in the registry and flow back the species if it has an existing variation database
  - For antispecies, the speciesFactory will process a division or all the species from the registry, remove the species from antispecies and flow the variation databases.

PreRunChecks
  - file checks: 
    - checks that registry file exists
    - checks that directories exist: pipeline_dir script_dir tmp_dir. If not, it will create them.
    - checks that GVF validator and SO file exist
    - creats species directory with species from input registry file
    - checks that species directories for dumping files are empty before starting the pipeline

Config
  - collects all the data types that can be dumped for a species
  - looks up in the database if the species has a certain data type: sift, ancestral allele, global maf, clinical significance, clinical_significance_svs, structural variants
  - the attribs for human are hard coded: we dump more data for human: sets, somatic data 
  - the data types are stored in the config parameter. This is flow downstream to all the modules.

InitSubmitJob (dump gvf)
  - Prepare all commands for dumping GVF files 
  - Depending on the number of variants the species has we divide dumps into smaller portions
  - global_vf_count_in_species => 5_000_000, # if number of vf in a species exceeds this we need to split up dumps. the value can be adjusted in the config file
  - vf_per_slice => 2_000_000, # if number of vf exceeds this we split the slice and dump for each split slice
  - max_split_slice_length => 500_000, (can be modified)
  - if the global vf count is exceeded we can dump per slice or if the number of vf on a slice exceeds vf per slice we can split up a slice into smaller pieces and dump for each split slice 

  - it might be that the global count is exceeded and we dump all slices separately. however if the species has a large number of contigs with only a small vf count perl contig we can group slices and dump a group of slices. The number of vf grouped together is defined by max_vf_load 
  - max_vf_load => 2_000_000, # group slices together until the vf count exceeds max_vf_load

  - when we create split slices: files name is extended by "$seq_region_id\_$start\_$end",
  - if we group multiple seq_regions we define the range in the file name e.g Rattus_norvegicus_incl_consequences-77391_77397.gvf

  - we also define all the options required by the dump and parse scripts: data dump type, additional data to dump (sift, evidence, ...), input and output file, registry

SubmitJob:
  - creates a system command calling the dump_gvf script with all the options generated in the InitSubmitJob module

JoinDump (join slice split): 
  - join all split slices, creating a file for each slice

Validate (gvf)
  - for each GVF file create smaller file using the first 2500 lines for validation. use GVF validator

SummariseValidation (gvf)
  - summarise results in pipeline_dir/SummaryValidateGVF.txt

FileUtils (post_gvf_dump_cleanup):
  - for each species:
    - gzip all GVF files
    - concat all validation reports into one file GVF_Validate_$species and move file to tmp directory
    - concat all err and out files into one file GVF_$species and move file to tmp directory
 

PreRunChecks (pre_run_checks_gvf2vcf):
  - checks that vcf-validator and vcf-sort are defined and exists

InitSubmitJob (parse gvf2vcf):
  - generates all the options for each gvf file that are required by the gvf2vcf script

SubmitJob (gvf2vcf):
  - for each VCF file create smaller file using the first 2500 lines for validation. use vcf-validator

Validate (vcf):
  - for each VCF file create smaller file using the first 2500 lines for validation. use VCF validator
  - Sort and bgzip the VCF files

InitJoinDump
  - prepare final join dumps

JoinDump
  - final join for gvf and vcf files

CleanUp
  - post_join_dumps

Finish:
  - tabix for VCF
  - checksums
  - README files
  - lc dir




Dataflow for pipeline 2):
To run the pipeline the following parameters need to be initialised:

Dump GVF and VCF files for all species in the registry file (no population allele frequency dumps:
init_pipeline.pl ReleaseDumps_conf.pm
-ensembl_release
-hive_db_password
-pipeline_dir
-pipeline_name
-registry_file
-tmp_dir
-run_all 1

Dump allele frequencies for popoulations (1000G, HAPMAP, ESP) for human:
init_pipeline.pl ReleaseDumps_conf.pm
-ensembl_release
-hive_db_password
-pipeline_dir
-pipeline_name
-registry_file
-tmp_dir
-prefetched_frequencies
-human_population_dumps 1
-run_all 1


Finish data dumps (add readme file, compute checksums, create tabix for vcf files) for all species in the registry file:
init_pipeline.pl ReleaseDumps_conf.pm
-ensembl_release
-hive_db_password
-pipeline_dir
-pipeline_name
-registry_file
-tmp_dir
-finish_dumps 1
-run_all 1
