#### README #### IMPORTANT: Please note you can download subsets of data via the BioMart data mining tool. See https://www.ensembl.org/info/data/biomart/ for more information. The genome assembly represented here corresponds to INSDC Assembly ID GCA_000002285.4 ####################### Fasta DNA dumps ####################### ----------- FILE NAMES ------------ The files are consistently named following this pattern: .....fa.gz : The systematic name of the species. : The assembly build name. : * 'dna' - unmasked genomic DNA sequences. * 'dna_rm' - masked genomic DNA. Interspersed repeats and low complexity regions are detected with the RepeatMasker tool and masked by replacing repeats with 'N's. * 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions have been replaced with lowercased versions of their nucleic base One of the following: * 'chromosome' - The top-level coordinate system in most species in Ensembl * 'nonchromosomal' - Contains DNA that has not been assigned a chromosome * 'seqlevel' - This is usually sequence scaffolds, chunks or clones. -- 'scaffold' - Larger sequence contigs from the assembly of shorter sequencing reads (often from whole genome shotgun, WGS) which could not yet be assembled into chromosomes. Often more genome sequencing is needed to narrow gaps and establish a tiling path. -- 'chunk' - While contig sequences can be assembled into large entities, they sometimes have to be artificially broken down into smaller entities called 'chunks'. This is due to limitations in the annotation pipeline and the finite record size imposed by MySQL which stores the sequence and annotation information. -- 'clone' - In general this is the smallest sequence entity. It is often identical to the sequence of one BAC clone, or sequence region of one BAC clone which forms the tiling path. : The actual sequence identifier. Depending on the the could represent the name of a chromosome, a scaffold, a contig, a clone .. Field is empty for seqlevel files fa: All files in these directories represent FASTA database files gz: All files are compacted with GNU Zip for storage efficiency. EXAMPLES The genomic sequence of human chromosome 1: Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz The masked version of the genome sequence on human chromosome 1 (contains '_rm' or '_sm' in the name): Homo_sapiens.GRCh37.dna_rm.chromosome.1.fa.gz Homo_sapiens.GRCh37.dna_sm.chromosome.1.fa.gz Non-chromosomal assembly sequences: e.g. mitochondrial genome, sequence contigs not yet mapped on chromosomes Homo_sapiens.GRCh37.dna.nonchromosomal.fa.gz Homo_sapiens.GRCh37.dna_rm.nonchromosomal.fa.gz Homo_sapiens.GRCh37.dna_sm.nonchromosomal.fa.gz --------- TOPLEVEL --------- These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions. EXAMPLES Toplevel sequences unmasked: Homo_sapiens.GRCh37.dna.toplevel.fa.gz Toplevel soft/hard masked sequences: Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz ----------------- PRIMARY ASSEMBLY ----------------- Primary assembly contains all toplevel sequence regions excluding haplotypes and patches. This file is best used for performing sequence similarity searches where patch and haplotype sequences would confuse analysis. If the primary assembly file is not present, that indicates that there are no haplotype/patch regions, and the 'toplevel' file is equivalent. EXAMPLES Primary assembly sequences unmasked: Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz Primary assembly soft/hard masked sequences: Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa.gz Homo_sapiens.GRCh37.dna_rm.primary_assembly.fa.gz -------------- SPECIAL CASES -------------- Some chromosomes have alternate haplotypes which are presented in files with the haplotype sequence only: Homo_sapiens.GRCh37.dna_rm.chromosome.HSCHR6_MHC_QBL.fa.gz Homo_sapiens.GRCh37.dna_rm.chromosome.HSCHR17_1.fa.gz All alternative assembly and patch regions have their sequence padded with N's to ensure alignment programs can report the correct index regions e.g. A patch region with a start position of 1,000,001 will have 1e6 N's added its start so an alignment program will report coordinates with respect to the whole chromosome. Human has sequenced Y chromosomes and the pseudoautosomal region (PAR) on the Y is annotated. By definition the PAR region is identical on the X and Y chromosome. The Y chromosome file contains the Y chromsome minus these repeated PAR regions i.e. the unique portion of Y.