Reference Input File Format
PHASTER may accept and process either raw nucleotide sequence data in FASTA format or an annotated genome data in GenBank format. If the input file is in FASTA format, locus tags will be assigned sequentially for each identified CDS. Note that this is one FASTA file for the entire genome. We currently do not support annotated cDNAs in FASTA format. Please use the GenBank format for annotated genome. If the input file is in GenBank format, it must provide at least the CDS feature and the "/gene", "/locus_tag" and "/translation" tags inside the CDS features.
Raw Nucleotide Sequence Data Format
Example #1: This is an example of a generic FASTA file. The entire file contains more than 5M nucleotides, and only the top portion is shown here. The header is not important for PHASTER processing, though it will be displayed as the title of the genome. If no header is provided, an arbitrary header will be created. The input header must be placed on the first line. The DNA sequence must conform to IUPAC coding. A metagenomic file must contain contigs, each with its own header. From this input, only contigs of length >=2000 will be processed. If the metagenomic option is not checked, the first 10 sequences will be processed individually.
>gi|16445223|ref|NC_002655.2| Escherichia coli O157:H7 str. EDL933 chromosome, complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACGGTGCGGGCTGACGCGTAC AGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTCGACCAAAGGTAACGAGGTAACA ACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTTGCCGATA TTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCA CCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGT ATTTTTGCCGAACTTCTGACGGGACTCGCCGCCGCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTT TCGTCGACCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGA TAGCATTAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAA GCGCGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAAT CTACTGTCGATATTGCAGAGTCCACCCGCCGTATTGCGGCAAGTCGTATTCCGGCTGATCACATGGTGCT GATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTACTTGGACGCAACGGTTCCGACTAC TCCGCGGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTAT ATACCTGCGACCCGCGTCAGGTGCCCGATGCGAGGTTGTTGAAATCGATGTCCTACCAGGAAGCGATGGA GCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCT TGCCTGATTAAAAATACCGGAAATCCTCAAGCTCCAGGTACGCTCATTGGTGCCAGTCGTGATGAAGACG AATTACCGGTCAAGGGCATTTCCAATCTGAATAATATGGCAATGTTCAGCGTTTCCGGCCCGGGGATGAA AGGAATGGTCGGCATGGCGGCGCGCGTCTTTGCTGCAATGTCACGCGCCCGTATTTCCGTGGTGCTGATT ACGCAATCATCTTCCGAATACAGTATCAGTTTCTGCGTTCCGCAAAGCGACTGTGTGCGAGCTGAACGGG CAATGCAGGAAGAGTTCTACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCGGTGACGGAACGGCT GGCCATTATCTCGGTGGTAGGTGATGGTATGCGCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCG CTGGCCCGCGCCAATATCAACATTGTCGCTATTGCTCAGGGATCTTCTGAACGCTCAATCTCTGTCGTGG TAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTAT
Example #2: Another example in FASTA format.
>unknown genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACGGTGCGGGCTGACGCGTAC AGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTCGACCAAAGGTAACGAGGTAACA ACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTTGCCGATA TTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCA CCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGT ATTTTTGCCGAACTTCTGACGGGACTCGCCGCCGCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTT TCGTCGACCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGA TAGCATTAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAA GCGCGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAAT CTACTGTCGATATTGCAGAGTCCACCCGCCGTATTGCGGCAAGTCGTATTCCGGCTGATCACATGGTGCT GATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTACTTGGACGCAACGGTTCCGACTAC TCCGCGGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTAT ATACCTGCGACCCGCGTCAGGTGCCCGATGCGAGGTTGTTGAAATCGATGTCCTACCAGGAAGCGATGGA GCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCT TGCCTGATTAAAAATACCGGAAATCCTCAAGCTCCAGGTACGCTCATTGGTGCCAGTCGTGATGAAGACG AATTACCGGTCAAGGGCATTTCCAATCTGAATAATATGGCAATGTTCAGCGTTTCCGGCCCGGGGATGAA AGGAATGGTCGGCATGGCGGCGCGCGTCTTTGCTGCAATGTCACGCGCCCGTATTTCCGTGGTGCTGATT ACGCAATCATCTTCCGAATACAGTATCAGTTTCTGCGTTCCGCAAAGCGACTGTGTGCGAGCTGAACGGG CAATGCAGGAAGAGTTCTACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCGGTGACGGAACGGCT GGCCATTATCTCGGTGGTAGGTGATGGTATGCGCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCG CTGGCCCGCGCCAATATCAACATTGTCGCTATTGCTCAGGGATCTTCTGAACGCTCAATCTCTGTCGTGG TAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTAT
Annotated Genome Format
Example #1: This is an example of of a generic GenBank file. The entire file contains more than 20k lines, and only the header and first a few CDSs are shown here. Only the "LOCUS" and "DEFINITION" tags in the header are used and they are used for only for naming. All annotated genes must be recorded in the "FEATURES" section and associated with the "gene"and "CDS" sub-tags. The "CDS" portion may or may not have "/translation". " ORIGIN" tag and DNAsequence section must be in the file.
LOCUS NC_002655 5528445 bp DNA circular BCT 14-DEC-2010 DEFINITION Escherichia coli O157:H7 str. EDL933 chromosome, complete genome. ACCESSION NC_002655 VERSION NC_002655.2 GI:16445223 DBLINK Project: 57831 KEYWORDS . SOURCE Escherichia coli O157:H7 str. EDL933 ORGANISM Escherichia coli O157:H7 str. EDL933 Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia. REFERENCE 1 (bases 1 to 5528445) AUTHORS Perna,N.T., Plunkett,G. III, Burland,V., Mau,B., Glasner,J.D., Rose,D.J., Mayhew,G.F., Evans,P.S., Gregor,J., Kirkpatrick,H.A., Posfai,G., Hackett,J., Klink,S., Boutin,A., Shao,Y., Miller,L., Grotbeck,E.J., Davis,N.W., Lim,A., Dimalanta,E., Potamousis,K., Apodaca,J., Anantharaman,T.S., Lin,J., Yen,G., Schwartz,D.C., Welch,R.A. and Blattner,F.R. TITLE Genome sequence of enterohaemorrhagic Escherichia coli O157:H7 JOURNAL Nature 409 (6819), 529-533 (2001) PUBMED 11206551 REFERENCE 2 (bases 1 to 5528445) CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (28-SEP-2001) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA REFERENCE 3 (bases 1 to 5528445) AUTHORS Perna,N.T., Plunkett,G. III, Burland,V., Mau,B., Glasner,J.D., Rose,D.J., Mayhew,G.F., Evans,P.S., Gregor,J., Kirkpatrick,H.A., Posfai,G., Hackett,J., Klink,S., Boutin,A., Shao,Y., Miller,L., Grotbeck,E.J., Davis,N.W., Lim,A., Dimalanta,E., Potamousis,K., Apodaca,J., Anantharaman,T.S., Lin,J., Yen,G., Schwartz,D.C., Welch,R.A. and Blattner,F.R. TITLE Direct Submission JOURNAL Submitted (22-OCT-2000) Laboratory of Genetics, University of Wisconsin, 445 Henry Mall, Madison, WI 53706, USA COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from AE005174. On Oct 26, 2001 this sequence version replaced gi:15799680. COMPLETENESS: full length. FEATURES Location/Qualifiers source 1..5528445 /organism="Escherichia coli O157:H7 str. EDL933" /mol_type="genomic DNA" /strain="EDL933" /serotype="O157:H7" /db_xref="taxon:155864" /note="enterohemorrhagic" gene 190..273 /gene="thrL" /locus_tag="Z0001" /db_xref="GeneID:962112" CDS 190..273 /gene="thrL" /locus_tag="Z0001" /function="leader; Amino acid biosynthesis: Threonine" /note="involved in threonine biosynthesis; controls the expression of the thrLABC operon" /codon_start=1 /transl_table=11 /product="thr operon leader peptide" /protein_id="NP_285693.1" /db_xref="GI:15799681" /db_xref="GeneID:962112" /translation="MKRISTTITTTITTTITITITTGNGAG" gene 354..2816 /gene="thrA" /locus_tag="Z0002" /db_xref="GeneID:962110" CDS 354..2816 /gene="thrA" /locus_tag="Z0002" /EC_number="2.7.2.4" /EC_number="1.1.1.13" /function="enzyme; Amino acid biosynthesis: Threonine" /note="multifunctional homotetrameric enzyme that catalyzes the phosphorylation of aspartate to form aspartyl-4-phosphate as well as conversion of aspartate semialdehyde to homoserine; functions in a number of amino acid biosynthetic pathways" /codon_start=1 /transl_table=11 /product="bifunctional aspartokinase I/homeserine dehydrogenase I" /protein_id="NP_285694.1" /db_xref="GI:15799682" /db_xref="GeneID:962110" /translation="MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKIT NHLVAMIEKTISGQDALPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIK HVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHY LESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACL RADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQF QIPCLIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAAR VFAAMSRARISVVLITQSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAV TERLAIISVVGDGMRTLRGISAKFFAALARANINIVAIAQGSSERSISVVVNNDDATT GVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSWLKNKHIDLRVCGVANSKA LLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYAD FLREGFHVVTPNKKANTSSMDYYHLLRHAAEKSRRKFLYDTNVGAGLPVIENLQNLLN AGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARK LLILARETGRELELADIEIEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEG KVLRYVGNIDEDGVCRVKIAEVDGNDPLFKVKNGENALAFYSHYYQPLPLVLRGYGAG NDVTAAGVFADLLRTLSWKLGV" gene 2818..3750 /gene="thrB" /locus_tag="Z0003" /db_xref="GeneID:962111" CDS 2818..3750 /gene="thrB" /locus_tag="Z0003" /EC_number="2.7.1.39" /function="enzyme; Amino acid biosynthesis: Threonine" /note="catalyzes the formation of O-phospho-L-homoserine from L-homoserine in threonine biosynthesis from asparate" /codon_start=1 /transl_table=11 /product="homoserine kinase" /protein_id="NP_285695.1" /db_xref="GI:15799683" /db_xref="GeneID:962111" /translation="MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVESAETF SLNNLGRFADKLPSEPRENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSACS VVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENDI ISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGFIHACYSRQ PELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPDTA QRVADWLGKNYLQNQEGFVHICRLDTAGARVLEN" gene 3751..5037 /gene="thrC" /locus_tag="Z0004" /db_xref="GeneID:956660" CDS 3751..5037 /gene="thrC" /locus_tag="Z0004" /EC_number="4.2.3.1" /function="enzyme; Amino acid biosynthesis: Threonine" /note="catalyzes the formation of L-threonine from O-phospho-L-homoserine" /codon_start=1 /transl_table=11 /product="threonine synthase" /protein_id="NP_285696.1" /db_xref="GI:15799684" /db_xref="GeneID:956660" /translation="MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEID EMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFPAPVANVESDVGCLELFHGP TLAFKDFGGRFMAQMLTHIAGDKPVTILTATSGDTGAAVAHAFYGLPNVKVVILYPRG KISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNSANSINISR LLAQICYYFEAVAQLPQEARNQLVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVN DTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPRVEELFRRKIWQLKELGYAAVDDE TTQQTMRELKELGYTSEPHAAVAYRALRDQLNPGEYGLFLGTAHPAKFKESVEAILGE TLDLPKELAERADLPLLSHNLPADFAALRKLMMNHQ" gene 5251..5547 /locus_tag="Z0005" /db_xref="GeneID:956661" CDS 5251..5547 /locus_tag="Z0005" /function="orf; Unknown function" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="NP_285697.1" /db_xref="GI:15799685" /db_xref="GeneID:956661" /translation="MKKMQSIVLALSLVLVAPMATQAAEITLVPSVKLQIGDRDNRGY YWDGGHWRDHGWWKQHYEWRGNRWHPHGPPPPPRHHKKAHHDHHGGHGPGKHHR" gene complement(5700..6476) /gene="yaaA" /locus_tag="Z0006" /db_xref="GeneID:956662" CDS complement(5700..6476) /gene="yaaA" /locus_tag="Z0006" /function="orf; Unknown function" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="NP_285698.1" /db_xref="GI:15799686" /db_xref="GeneID:956662" /translation="MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPP QISTLMRISDKLAGINAARFHDWQPDFTPENARQAILAFKGDVYTGLQAETFSEDDFD FAQQHLRMLSGLYGVLRPLDLMQPYRLEMGIRLENARGKDLYQFWGDIITNKLNEALA AQGDNVVINLASDEYFKSVKPKKLNAEIIKPVFLDEKNGKFKIISFYAKKARGLMSRF IIENRLTKPEQLTGFNSEGYFFDEASSSNGELVFKRYEQR" ORIGIN 1 AAAACCCGGGTTT ....... 60 CCCGGGTTTAAAG ....... ...... //