Reference Input File Format
PHASTER may accept and process either raw nucleotide sequence data in FASTA format or an annotated genome data in GenBank format. If the input file is in FASTA format, locus tags will be assigned sequentially for each identified CDS. Note that this is one FASTA file for the entire genome. We currently do not support annotated cDNAs in FASTA format. Please use the GenBank format for annotated genome. If the input file is in GenBank format, it must provide at least the CDS feature and the "/gene", "/locus_tag" and "/translation" tags inside the CDS features.
Raw Nucleotide Sequence Data Format
Example #1: This is an example of a generic FASTA file. The entire file contains more than 5M nucleotides, and only the top portion is shown here. The header is not important for PHASTER processing, though it will be displayed as the title of the genome. If no header is provided, an arbitrary header will be created. The input header must be placed on the first line. The DNA sequence must conform to IUPAC coding. A metagenomic file must contain contigs, each with its own header. From this input, only contigs of length >=2000 will be processed. If the metagenomic option is not checked, the first 10 sequences will be processed individually.
>gi|16445223|ref|NC_002655.2| Escherichia coli O157:H7 str. EDL933 chromosome, complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACGGTGCGGGCTGACGCGTAC AGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTCGACCAAAGGTAACGAGGTAACA ACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTTGCCGATA TTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCA CCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGT ATTTTTGCCGAACTTCTGACGGGACTCGCCGCCGCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTT TCGTCGACCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGA TAGCATTAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAA GCGCGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAAT CTACTGTCGATATTGCAGAGTCCACCCGCCGTATTGCGGCAAGTCGTATTCCGGCTGATCACATGGTGCT GATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTACTTGGACGCAACGGTTCCGACTAC TCCGCGGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTAT ATACCTGCGACCCGCGTCAGGTGCCCGATGCGAGGTTGTTGAAATCGATGTCCTACCAGGAAGCGATGGA GCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCT TGCCTGATTAAAAATACCGGAAATCCTCAAGCTCCAGGTACGCTCATTGGTGCCAGTCGTGATGAAGACG AATTACCGGTCAAGGGCATTTCCAATCTGAATAATATGGCAATGTTCAGCGTTTCCGGCCCGGGGATGAA AGGAATGGTCGGCATGGCGGCGCGCGTCTTTGCTGCAATGTCACGCGCCCGTATTTCCGTGGTGCTGATT ACGCAATCATCTTCCGAATACAGTATCAGTTTCTGCGTTCCGCAAAGCGACTGTGTGCGAGCTGAACGGG CAATGCAGGAAGAGTTCTACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCGGTGACGGAACGGCT GGCCATTATCTCGGTGGTAGGTGATGGTATGCGCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCG CTGGCCCGCGCCAATATCAACATTGTCGCTATTGCTCAGGGATCTTCTGAACGCTCAATCTCTGTCGTGG TAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTAT
Example #2: Another example in FASTA format.
>unknown genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACGGTGCGGGCTGACGCGTAC AGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTCGACCAAAGGTAACGAGGTAACA ACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTTGCCGATA TTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCA CCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGT ATTTTTGCCGAACTTCTGACGGGACTCGCCGCCGCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTT TCGTCGACCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGA TAGCATTAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAA GCGCGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAAT CTACTGTCGATATTGCAGAGTCCACCCGCCGTATTGCGGCAAGTCGTATTCCGGCTGATCACATGGTGCT GATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTACTTGGACGCAACGGTTCCGACTAC TCCGCGGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTAT ATACCTGCGACCCGCGTCAGGTGCCCGATGCGAGGTTGTTGAAATCGATGTCCTACCAGGAAGCGATGGA GCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCT TGCCTGATTAAAAATACCGGAAATCCTCAAGCTCCAGGTACGCTCATTGGTGCCAGTCGTGATGAAGACG AATTACCGGTCAAGGGCATTTCCAATCTGAATAATATGGCAATGTTCAGCGTTTCCGGCCCGGGGATGAA AGGAATGGTCGGCATGGCGGCGCGCGTCTTTGCTGCAATGTCACGCGCCCGTATTTCCGTGGTGCTGATT ACGCAATCATCTTCCGAATACAGTATCAGTTTCTGCGTTCCGCAAAGCGACTGTGTGCGAGCTGAACGGG CAATGCAGGAAGAGTTCTACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCGGTGACGGAACGGCT GGCCATTATCTCGGTGGTAGGTGATGGTATGCGCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCG CTGGCCCGCGCCAATATCAACATTGTCGCTATTGCTCAGGGATCTTCTGAACGCTCAATCTCTGTCGTGG TAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTAT
Annotated Genome Format
Example #1: This is an example of of a generic GenBank file. The entire file contains more than 20k lines, and only the header and first a few CDSs are shown here. Only the "LOCUS" and "DEFINITION" tags in the header are used and they are used for only for naming. All annotated genes must be recorded in the "FEATURES" section and associated with the "gene"and "CDS" sub-tags. The "CDS" portion may or may not have "/translation". " ORIGIN" tag and DNAsequence section must be in the file.
LOCUS NC_002655 5528445 bp DNA circular BCT 14-DEC-2010
DEFINITION Escherichia coli O157:H7 str. EDL933 chromosome, complete genome.
ACCESSION NC_002655
VERSION NC_002655.2 GI:16445223
DBLINK Project: 57831
KEYWORDS .
SOURCE Escherichia coli O157:H7 str. EDL933
ORGANISM Escherichia coli O157:H7 str. EDL933
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
REFERENCE 1 (bases 1 to 5528445)
AUTHORS Perna,N.T., Plunkett,G. III, Burland,V., Mau,B., Glasner,J.D.,
Rose,D.J., Mayhew,G.F., Evans,P.S., Gregor,J., Kirkpatrick,H.A.,
Posfai,G., Hackett,J., Klink,S., Boutin,A., Shao,Y., Miller,L.,
Grotbeck,E.J., Davis,N.W., Lim,A., Dimalanta,E., Potamousis,K.,
Apodaca,J., Anantharaman,T.S., Lin,J., Yen,G., Schwartz,D.C.,
Welch,R.A. and Blattner,F.R.
TITLE Genome sequence of enterohaemorrhagic Escherichia coli O157:H7
JOURNAL Nature 409 (6819), 529-533 (2001)
PUBMED 11206551
REFERENCE 2 (bases 1 to 5528445)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (28-SEP-2001) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 5528445)
AUTHORS Perna,N.T., Plunkett,G. III, Burland,V., Mau,B., Glasner,J.D.,
Rose,D.J., Mayhew,G.F., Evans,P.S., Gregor,J., Kirkpatrick,H.A.,
Posfai,G., Hackett,J., Klink,S., Boutin,A., Shao,Y., Miller,L.,
Grotbeck,E.J., Davis,N.W., Lim,A., Dimalanta,E., Potamousis,K.,
Apodaca,J., Anantharaman,T.S., Lin,J., Yen,G., Schwartz,D.C.,
Welch,R.A. and Blattner,F.R.
TITLE Direct Submission
JOURNAL Submitted (22-OCT-2000) Laboratory of Genetics, University of
Wisconsin, 445 Henry Mall, Madison, WI 53706, USA
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence was derived from AE005174.
On Oct 26, 2001 this sequence version replaced gi:15799680.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..5528445
/organism="Escherichia coli O157:H7 str. EDL933"
/mol_type="genomic DNA"
/strain="EDL933"
/serotype="O157:H7"
/db_xref="taxon:155864"
/note="enterohemorrhagic"
gene 190..273
/gene="thrL"
/locus_tag="Z0001"
/db_xref="GeneID:962112"
CDS 190..273
/gene="thrL"
/locus_tag="Z0001"
/function="leader; Amino acid biosynthesis: Threonine"
/note="involved in threonine biosynthesis; controls the
expression of the thrLABC operon"
/codon_start=1
/transl_table=11
/product="thr operon leader peptide"
/protein_id="NP_285693.1"
/db_xref="GI:15799681"
/db_xref="GeneID:962112"
/translation="MKRISTTITTTITTTITITITTGNGAG"
gene 354..2816
/gene="thrA"
/locus_tag="Z0002"
/db_xref="GeneID:962110"
CDS 354..2816
/gene="thrA"
/locus_tag="Z0002"
/EC_number="2.7.2.4"
/EC_number="1.1.1.13"
/function="enzyme; Amino acid biosynthesis: Threonine"
/note="multifunctional homotetrameric enzyme that
catalyzes the phosphorylation of aspartate to form
aspartyl-4-phosphate as well as conversion of aspartate
semialdehyde to homoserine; functions in a number of amino
acid biosynthetic pathways"
/codon_start=1
/transl_table=11
/product="bifunctional aspartokinase I/homeserine
dehydrogenase I"
/protein_id="NP_285694.1"
/db_xref="GI:15799682"
/db_xref="GeneID:962110"
/translation="MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKIT
NHLVAMIEKTISGQDALPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIK
HVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHY
LESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACL
RADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQF
QIPCLIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAAR
VFAAMSRARISVVLITQSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAV
TERLAIISVVGDGMRTLRGISAKFFAALARANINIVAIAQGSSERSISVVVNNDDATT
GVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSWLKNKHIDLRVCGVANSKA
LLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYAD
FLREGFHVVTPNKKANTSSMDYYHLLRHAAEKSRRKFLYDTNVGAGLPVIENLQNLLN
AGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARK
LLILARETGRELELADIEIEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEG
KVLRYVGNIDEDGVCRVKIAEVDGNDPLFKVKNGENALAFYSHYYQPLPLVLRGYGAG
NDVTAAGVFADLLRTLSWKLGV"
gene 2818..3750
/gene="thrB"
/locus_tag="Z0003"
/db_xref="GeneID:962111"
CDS 2818..3750
/gene="thrB"
/locus_tag="Z0003"
/EC_number="2.7.1.39"
/function="enzyme; Amino acid biosynthesis: Threonine"
/note="catalyzes the formation of O-phospho-L-homoserine
from L-homoserine in threonine biosynthesis from asparate"
/codon_start=1
/transl_table=11
/product="homoserine kinase"
/protein_id="NP_285695.1"
/db_xref="GI:15799683"
/db_xref="GeneID:962111"
/translation="MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVESAETF
SLNNLGRFADKLPSEPRENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSACS
VVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENDI
ISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGFIHACYSRQ
PELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPDTA
QRVADWLGKNYLQNQEGFVHICRLDTAGARVLEN"
gene 3751..5037
/gene="thrC"
/locus_tag="Z0004"
/db_xref="GeneID:956660"
CDS 3751..5037
/gene="thrC"
/locus_tag="Z0004"
/EC_number="4.2.3.1"
/function="enzyme; Amino acid biosynthesis: Threonine"
/note="catalyzes the formation of L-threonine from
O-phospho-L-homoserine"
/codon_start=1
/transl_table=11
/product="threonine synthase"
/protein_id="NP_285696.1"
/db_xref="GI:15799684"
/db_xref="GeneID:956660"
/translation="MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEID
EMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFPAPVANVESDVGCLELFHGP
TLAFKDFGGRFMAQMLTHIAGDKPVTILTATSGDTGAAVAHAFYGLPNVKVVILYPRG
KISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNSANSINISR
LLAQICYYFEAVAQLPQEARNQLVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVN
DTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPRVEELFRRKIWQLKELGYAAVDDE
TTQQTMRELKELGYTSEPHAAVAYRALRDQLNPGEYGLFLGTAHPAKFKESVEAILGE
TLDLPKELAERADLPLLSHNLPADFAALRKLMMNHQ"
gene 5251..5547
/locus_tag="Z0005"
/db_xref="GeneID:956661"
CDS 5251..5547
/locus_tag="Z0005"
/function="orf; Unknown function"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="NP_285697.1"
/db_xref="GI:15799685"
/db_xref="GeneID:956661"
/translation="MKKMQSIVLALSLVLVAPMATQAAEITLVPSVKLQIGDRDNRGY
YWDGGHWRDHGWWKQHYEWRGNRWHPHGPPPPPRHHKKAHHDHHGGHGPGKHHR"
gene complement(5700..6476)
/gene="yaaA"
/locus_tag="Z0006"
/db_xref="GeneID:956662"
CDS complement(5700..6476)
/gene="yaaA"
/locus_tag="Z0006"
/function="orf; Unknown function"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="NP_285698.1"
/db_xref="GI:15799686"
/db_xref="GeneID:956662"
/translation="MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPP
QISTLMRISDKLAGINAARFHDWQPDFTPENARQAILAFKGDVYTGLQAETFSEDDFD
FAQQHLRMLSGLYGVLRPLDLMQPYRLEMGIRLENARGKDLYQFWGDIITNKLNEALA
AQGDNVVINLASDEYFKSVKPKKLNAEIIKPVFLDEKNGKFKIISFYAKKARGLMSRF
IIENRLTKPEQLTGFNSEGYFFDEASSSNGELVFKRYEQR"
ORIGIN
1 AAAACCCGGGTTT .......
60 CCCGGGTTTAAAG .......
......
//