This readme describes the sequence data on the ftp site, how it is processed
and what information is available about the sequence data.

Directory structure and sequence index file:
	
All sequence data is in the format fastq. This gives a sequence and a quality 
string for each read.

The sequences are found under /data/XXXXXX/sequence_read where XXXXXX represents the
sample name, this should be in the form or coriell sample names HGXXXXX or NAXXXXX

The meta data associated with a particular file including its md5sum can be found
in the sequence.index file. This is a tab delimted file where each column contains
a different piece of meta information.

We also present an analysis.sequence.index file which contains all the same information as the
sequence.index file but only refer to ILLUMINA data with 70bp reads or longer. For phase 3 of the 
project this is what we will analyse.

The columns of the sequence index file are

	1.  FASTQ_FILE, path to fastq file on ftp site  
	2.  MD5, md5sum of file
	3.  RUN_ID, SRA/ERA run accession       
	4.  STUDY_ID, SRA/ERA study accession   
	5.  STUDY_NAME, Name of stury   
	6.  CENTER_NAME, Submission centre name 
	7.  SUBMISSION_ID, SRA/ERA submission accession 
	8.  SUBMISSION_DATE, Date sequence submitted, YYYY-MM-DAY       
	9.  SAMPLE_ID, SRA/ERA sample accession 
	10. SAMPLE_NAME, Sample name    
	11. POPULATION, Sample population, this is a 3 letter code and it is defined in README.populations     
	12. EXPERIMENT_ID, Experiment accession 
	13. INSTRUMENT_PLATFORM, Type of sequencing machine     
	14. INSTRUMENT_MODEL, Model of sequencing machine       
	15. LIBRARY_NAME, Library name  
	16. RUN_NAME, Name of machine run       
	17. RUN_BLOCK_NAME, Name of machine run sector  (This is no longer recorded so this column is entirely null, it was left in so
            as not to disrupt existing sequence index parsers)
	18. INSERT_SIZE, Submitter specifed insert size 
	19. LIBRARY_LAYOUT, Library layout, this can be either PAIRED or SINGLE 
	20. PAIRED_FASTQ, Name of mate pair file if exists (Runs with failed mates will have 
	    a library layout of PAIRED but no paired fastq file)
	21. WITHDRAWN, 0/1 to indicate if the file has been withdrawn, only present if a file has been withdrawn
	22. WITHDRAWN_DATE This is generally the date the file is generated on
	23. COMMENT, comment about reason for withdrawal
	24. READ_COUNT, read count for the file
	25. BASE_COUNT, basepair count for the file
	26. ANALYSIS_GROUP, the analysis group of the sequence, this reflects sequencing
	    strategy. Currently this includes low coverage, high coverage, exon targetted and exome
	    to reflect the 2 non low coverage pilot sequencing stratergies and the 2 main project sequencing stratergies used by the 
	    1000 genomes project. 

Any run_id can have up to 3 files associated with it. Single runs have one file. 
Paired runs can have anywhere from 1 to 3 files depending on the success of the 
pairing. The paired run files are associated as *_1.fastq and *_2.fastq and any reads where one
of the two reads failed the qc will be found in the unnumbered fragment file

There is a record of all sequence index files created in 

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/

Each file is dated for its release using the form YYYYMMDD.sequence.index 
so 20091216.sequence.index was released on the 16th December 2009. The most
recent file should also match ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence.index

There is also an associated statistics file e.g
	
 20120522.sequence.index.low_coverage.stats
 20120522.sequence.index.exome.stats

This contains summary statistics for each index using the base count data in column 25.

Please be aware that in the exome.stats files, the "COVERAGE" column refers to whole genome coverage, not coverage over targetted
regions. The exome coverage over targetted regions is better captured in *.HsMetrics.gz files under alignment_indices directory. 

The summaries it presents are

Withdrawn summary, how many files have been withdrawn for each given reason
Study id summary, how many basepairs and xcoverage of data per Study
Population summary, how many basepairs and xcoverage of data by Population
Center summary, how many basepairs and xcoverage of data by Submission Center, this is further broken down by study id
Sample summary, how many basepairs and xcoverage of data by Sample name

csv files containing statistics are also present in the form

20091216_201001025.stats.csv

This contain total numbers and differences between the two dated index files.

     Numbers of accessions
     Numbers of samples
     Numbers of samples with more than 4x coverage
     Number of gigbases per population
     Number of gigabases per platform
     Number of gigabases per submission center

These files will be generated for each new release of a sequence index 


Process of creating the sequence data collection in DCC

The DCC gets its sequence files from the SRA. Before publishing the files they under go some filtering so ensure good quality data is used.

These are the checks the DCC makes on the archive fastq files.

        Syntax Checks:

        -Each header line begins with @
        -The third line always starts with a +
        -There are four lines in each entry (implied by the above two rules)
        -On line3, if a name follows the + sign, the name has to match the one found in line1
        -The sequence and quality lines are the same length
        -For paired end files, the _1 and _2 files have the same number of reads in them. 
        -For SOLID colourspace fastq, each read starts with a base followed by a string of numbers

        Sequence Checks:

        -Read is longer than 35bp for Solexa, 25bp for Solid, and 30 bp for 454
        -Read does not contain any N's in the first 25, 30 or 35bp
        -Quality values are all 2 or higher in the first 25bp, 30bp or 35bp
        -The reads contain more than one type of base in the first 25, 30, or 35bp
        -Read does not contain more than 50% Ns in its whole length
	-Read does not contain characters other than ATGCN (this rule does not apply to SOLID reads)

The output files get the extension .filt.fastq.gz to indicate they have been filtered.

The DCC marks some files as withdrawn. These have a flag in column 21 of the  sequence.index. There are several reasons for this and 
they are given in column 23 of the sequence.index file. This is what those reasons mean.

	FAILED GENOTYPE QC, The fastq file was show to not be of the specified sample.

	FAILED ONE OF THE DCC PROCESSES, There were meta data consistency problems with the file or no reads in the file passed our qc checks.

	NOT YET AVAILABLE FROM ARCHIVE, The fastq files aren't yet available from the SRA

	SUPPRESSED IN ARCHIVE, The run has been suppressed by the submitter in the archive

 	EXCESSIVE SMALL INDELS, The run has an overly large number of small indels suggesting there was a bubble on the flow cell

        FAILED VERIFYBAM QC, The run failed its post alignment quality check for contaimination or sample swapping

	RELATED SAMPLE, The sample has been withdrawn due to being related to another sample in the project

The analysis.sequence.index has two additional failure statuses

        TOO LOW RAW COVERAGE, This means the sample has less than 8250000000 bases of raw sequence for a low coverage run and less than 1000000000 bases
        of raw sequence as an exome run and this means they can not possibly meet our completion critriea of at least 3x non duplicated aligned coverage for 
        low coverage and at least 70% of exome targets covered to 20x or more coverage in the exome so they are left out from our alignment processing

        TOO LOW ALIGNED COVERAGE, This means the sample has failed our aligned coverage criteria