Description, Instructions, and Tips for Fa-Index


Purpose
This document provides instructions for Fa-Index.

You need not bother reading this document unless you are administering a server running the ProteinProspector programs.

Instructions for ProteinProspector Programs

Contents of this document: (all in one file, so it can be printed and read)

Links to topics in the general instructions:

Introduction

FA-Index was developed for five main reasons:
  1. To enable an internal means for the ProteinProspector programs to store an index number when a hit is recorded during a search, then later use that number to retrieve that database entry for output/report generation purposes. This cuts down the memory requirements for program execution.
  2. To provide indices which can be used to accelerate searches that are pre-filtered by intact protein MW, protein pI and/or species.
  3. To aid the ProteinProspector programs in addressing some of the hindrances inherent in FASTA comment line format heterogeneity.
  4. To allow users to create subset databases based on either a Species/Protein MW pre-filter or the results of a previous search. Searches performed on these smaller databases are often very much faster than searches performed on complete databases.
  5. To allow users to create databases containing user defined proteins.


Background on the FASTA format

The FASTA format for sequence databases was originally developed by Pearson for use with the FASTA program. Today it is probably the most widely used standard format, primarily because its brevity results in the smallest possible file size for sequences.

An example of the format is shown below:

>sp|P28190|AA1R_BOVIN ADENOSINE A1 RECEPTOR.
MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFCFIVSLAVADVAVGA
LVIPLAILINIGPRTYFHTCLKVACPVLILTQSSILALLAMAVDRYLRVKIPLRYKTVVT
PRRAVVAITGCWILSFVVGLTPMFGWNNLSAVERDWLANGSVGEPVIECQFEKVISMEYM
VYFNFFVWVLPPLLLMVLIYMEVFYLIRKQLSKKVSASSGDPQKYYGKELKIAKSLALIL
FLFALSWLPLHILNCITLFCPSCHMPRILIYIAIFLSHGNSAMNPIVYAFRIQKFRVTFL
KIWNDHFRCQPAPPIDEDAPAERPDD

As a standard it leaves something to be desired, because the "standard" is that there is a single comment line per entry which must begin with the ">" character and all subsequent lines for an entry contain sequence. However, there are many "standards" as to the arrangement of fields and/or de-limiting of fields in the comment line. Often the comment line is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained.

The FASTA format was chosen for use with ProteinProspector primarily because of it's universality, brevity, and expected ease with which database files could be shared on the same computer with other programs for sequence analysis.

The FA-Index program creates several indices which are much smaller files than the FASTA database file. These indices aid the ProteinProspector programs in addressing some of the hindrances inherent in the FASTA comment line format heterogeneity.


Using other programs with the same FASTA database files

There is no reason that we know of that should prevent use of the FASTA database files by both ProteinProspector programs and other programs which accept FASTA format. Further, we believe it should be possible for the files to be simultaneously read by more than one program at a time. It may be of interest to some users that the SEQUEST program from John Yates' group at the University of Washington also uses FASTA formatted databases.


ProteinProspector filenaming conventions for public FASTA databases

Often the comment line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. However, this information is NOT consistently organized into fields in the comment line of different FASTA database, though within a specific database it is sometimes consistent.

The way ProteinProspector programs "know" which dialect of FASTA to "speak" with a particular database is via the filename. Acceptable filename prefixes are shown below in bold and the associated comment line format described.

Genpept

Sample entries:

>gi|216790 (D13314) arginine deiminase [Mycoplasma hominis]
>gi|261706|bbs|120303 (S50809) protein LG=immunoglobulin binding protein {immunoglobulin binding domains} [streptococcus, Peptide Recombinant, 455 aa]

ProteinProspector programs designate:

  • accession number, 216790 in the first example, as the number after the first | in the line. This can be delimited by a | or a space.
  • species, Mycoplasma hominis in the first example, as the string between the last set of square brackets in the line.
  • name, arginine deiminase in the first example, as the string between the first space and the last "[" in the line.

    Some entries cause problems:

    >gi|3928883 unknown

    Previously the accession number was taken to be number between the first set of round brackets in the line. However entries like the one above don't have this field. This entry also doesn't contain a species field.

    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/Genpept....usp.

    Some other entries are also potentially problematic.

    This entry is very long and has been truncated by a > character.

    >gi|1387979 (L77099) 44% identity over 302 residues with hypothetical protein from Synechocystis sp, accession D64006_CD; expression induced by environmental stress; some similarity to glycosyl transferases; two potential membrane-spanning helices [Bacillus subtil>

    Neither of the following contain easily extractable species.

    >gi|1123088 (U42436) coded for by C. elegans cDNA yk56a1.3; coded for by C. elegans cDNA CEMSG41FB; coded for by C. elegans cDNA yk81f4.5; coded for by C. elegans cDNA yk56a1.5; coded for by C. elegans cDNA yk81f4.3; similar to the S5P family of ribosomal proteins
    >gi|2330745|gnl|PID|e334350 (Z98598) SPAC1B3.11c, ras-related protein, len:234aa, similar eg. to RB4B_RAT, P51146, ra

    This entry doesn't have anything in the name field but the species is OK.

    >gi|1575686 (U70379) [Synechococcus PCC7942]

    The previously used accession number in the round brackets for these two entries are identical.

    >gi|3928875 (AF093611) putative chloroplast desaturase [Acetabularia acetabulum]
    >gi|3928876 (AF093611) putative chloroplast desaturase [Acetabularia acetabulum]

    This entry contain a zone delimited by [ ] characters which is not at the end of the line and doesn't contain a species.

    >gi|3881286|gnl|PID|e1350785 (AL021507) [980325 dl] : Prediction spanned chimera, modified based on new 3' sequence information (o/l with F14D1); cDNA EST EMBL:D34402 comes from this gene; cDNA EST EMBL:D37454 comes from this gene; cDNA EST EMBL:D68054 comes from this gene; cDNA E>

    Owl

    This database should now be downloaded from NCBI and the comment line format has changed. The entries come from four different sources:

    >owl|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).

    ProteinProspector programs designate:

  • accession number, 100K_RAT as the string between the second | and the first space in the line.
  • species, RAT, as the characters after the underscore in the accession number.
  • name: 100 KD PROTEIN (EC 6.3.2.-). as the string between the first space and the last dash " -" in the line.

    >owl|B40638|B40638 isocytochrome c2 - Rhodobacter sphaeroides

    ProteinProspector programs designate:

  • accession number, B40638 as the string between the second | and the first space in the line.
  • species, Rhodobacter sphaeroides, as the text string following the last space-dash-space (" - ") in the line.
  • name: isocytochrome c2 as the string between the first space and the last dash " -" in the line.

    >owl|Z31371|A7120FTSZ1 A7120FTSZ NID: g1100793 - Anabaena PCC7120.

    ProteinProspector programs designate:

  • accession number, A7120FTSZ1 as the string between the second | and the first space in the line.
  • species, Anabaena PCC7120. as the text string following the last space-dash-space (" - ") in the line. Note that there is a full stop after the species which must be deleted.
  • name: A7120FTSZ NID: g1100793 as the string between the first space and the last dash " -" in the line.

    >owl||NRL_1A00B hemoglobin beta chain mutant (V1M, W37Y) (deoxy), chain B - human

    ProteinProspector programs designate:

  • accession number, NRL_1A00B as the string between the second | and the first space in the line.
  • species, human. as the text string following the last space-dash-space (" - ") in the line.
  • name: hemoglobin beta chain mutant (V1M, W37Y) (deoxy), chain B as the string between the first space and the last dash " -" in the line.

    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/Owl....usp.

    A typical comment line causing problems is:

    >owl|P15455|12S1_ARATH 12S SEED STORAGE PROTEIN PRECURSOR. - ARABIDOPSIS THALIANA (MOUSE-EAR...

    There are three full stops at the end of the line.

    Previously the comment lines had the following format:

    >10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10). - VIGNA UNGUICULATA (COWPEA).
    >AEOHFPA AEOHFPA NID: g141875 - A.hydrophila DNA, clone pPH4.
    >pir|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).

    SwissProt

    Sample entry as output by the sp2fasta program:

    >sp|P16105|H32_BOVIN HISTONE H3 (H3.2)

    Sample entry if the database is downloaded from NCBI:

    >gi|122068|sp|P16105|H32_BOVIN HISTONE H3 (H3.2)

    ProteinProspector programs designate:

  • accession number, P16105 the alphanumeric string between the first sp| and the next | in the comment line
  • species, BOVIN, as the string between the underscore and the space in the next field.
  • name: HISTONE H3 (H3.2), as the string following the species

    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line (this usually does not happen for any entries in SwissProt). All of these UNREADABLE lines are then written by FA-Index to the file seqdb/SwissProt....usp.

    A few entries are of the following form:

    >gi|400027|sp||HYEP_PSESP_2 [Segment 2 of 3] EPOXIDE HYDROLASE (EPOXIDE HYDRATASE)

    In these cases the accession number is taken as being the first number ie. 400027 for the example shown.

    The species is extracted from the code between the last vertical bar and the first space. It appears after the first underscore and before the second underscore (if present). The name is the rest of the line.

    NCBInr

    The comment lines from this database are tricky to handle because it is a non-redundant database which collects entries form several databases, thus there are several formats present in the final database.

    Further information is available from the NCBI site.

    1. Entries of the following format are from Genpept:

    Note that this format has now been discontinued and all Genpept entries are now in the format described in section 2. Support for this format will be continued for a while for people who have an old copy of the database. The corresponding comment lines in the new format are given in section 2.

    >gi|304881 (L07596) alaS [Escherichia coli]

    ProteinProspector programs designate:

  • accession number, 304881, as all consecutive digits following the first "|"
  • species, Escherichia coli, as the text string inside the last set of square brackets.
  • name, (L07596) alaS , as the text string between the first space the last set of square brackets.

    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/NCBInr....usp.

    Lines which are too long are terminated by three full stops:

    >gi|2429520 (AF025469) Similar to acetyl-CoA carboxylase; coded for by C. elegans cDNA yk16c3.3; coded for by C. elegans cDNA yk36b11.3; coded for by C. elegans cDNA yk43h8.3; coded for by C. elegans cDNA yk24d2.3; coded for by C. elegans cDNA yk24d2.5;...

    This line has a space at the end of the species:

    >gi|520517 (U10338) RNA polymerase II, largest subunit [Ilyanassa obsoleta ]

    The following entry has no species:

    >gi|3928883 unknown

    Here are some more examples. Note that what is in the species field isn't always a species.

    >gi|149575 (M76708) L(+)-lactate dehydrogenase [Lactobacillus casei]
    >gi|45803 (X04609) gamma subunit (3'terminus); pid:g45803 [thermophilic bacterium PS3]
    >gi|289135 (L10036) unknown [Anabaena PCC7120]
    >gi|402254 (U01238) beta subunit of the molybdenum-iron nitrogenase [Frankia sp.]
    >gi|414523 (U02284) beta-lactamase [Cloning vector pSP65]
    >gi|439619 (L25848) [Salmonella typhimurium IS200 insertion sequence from SARA17, partial.], gene product [Salmonella typhimurium]
    >gi|431128 (L15633) start [Transposon Tn916]
    >gi|466378 (U07618) SSB [Unknown]
    >gi|403947 (U01693) (M90060); Homology to GenBank Accession numORF-X from STRATPASEA [Mycoplasma genitalium]
    >gi|405516 (L22217) This ORF is homologous to nitroreductase from Enterobacter cloacae, Accession Number A38686, and Salmonella, Accession Number P15888. [Mycoplasma-like organism]
    >gi|457139 (L29100) transposase [Insertion sequence IS150 homolog]
    >gi|468279 (L31491) nreA [pTOM9]
    >gi|413733 (L25424) orf 1 [Plasmid pCB2.4]
    >gi|144453 (M94320) very similar to DNA polymerase of Bacillus subtilis bacteriophage SPO2; potential DNA polymerase; putative [Citrus greening disease-associated bacterium-like organism]
    >gi|971400 (X88862) immunogenic polyprotein with 2A protease [Foot-and-mouth disease virus]
    >gi|1008449 (L19624) envelope glycoprotein [Human immunodeficiency virus type 1]
    >gi|1718307 (U75698) ORF 54; dUTPase homolog; EBV BLLF3 homolog [Kaposi's sarcoma-associated herpesvirus]
    >gi|2271117 (AF008696) hemagglutinin [influenza A virus (A/South_Australia/68/92(H3N2))]
    >gi|2444119 (U88974) ORF40 [Streptococcus thermophilus temperate bacteriophage O1205]
    >gi|2662546 (AF036688) No definition line found [Caenorhabditis elegans]
    >gi|4206510 (AF066801) ribulose 1,5-bisphosphate carboxylase [Dictamnus sp. M.W.Chase-1820K]

    2. Entries of the following format are from GenBank:

    >gi|1680564|gb||S58174_1 (S58174) putative RNA polymerase [Pelargonium leaf curl virus]

    gb|accession|locus

    ProteinProspector programs designate:

  • accession number, 1680564, as all consecutive digits following the first "|"
  • species, Pelargonium leaf curl virus, as the text string inside the last set of square brackets.
  • name, (S58174) putative RNA polymerase, as the text string between the first space the last set of square brackets.

    Here are some more examples.

    >gi|1683178|gb||S69825_2 (S69825) coat/capsid protein [Sweet potato feathery mottle virus (strain CH)]
    >gi|1683615|gb||S81342_1 (S81342) unnamed protein product [Mus sp.]

    Here are some example entries which have changed from format 1 to this format

    >gi|304881|gb|AAA71918.1| (L07596) alaS [Escherichia coli]
    >gi|520517|gb|AAA50229.1| (U10338) RNA polymerase II, largest subunit [Ilyanassa obsoleta]
    >gi|3928883|gb|AAC79708.1| unknown
    >gi|289135|gb|AAD04186.1| (L10036) unknown [Anabaena PCC7120]
    >gi|402254|gb|AAA03325.1| (U01238) beta subunit of the molybdenum-iron nitrogenase [Frankia sp.]
    >gi|414523|gb|AAB60535.1| (U02284) beta-lactamase [Cloning vector pSP65].gi|644827|gb|AAA64566.1| (U19867) beta-lactamase [Cloning vector pSPL3]
    >gi|431128|gb|AAC36978.1| (L15633) start [Transposon Tn916]
    >gi|466378|gb|AAA17041.1| (U07618) SSB [Plasmid R751]
    >gi|403947|gb|AAB01006.1| (U01693) (M90060); Homology to GenBank Accession numORF-X from STRATPASEA [Mycoplasma genitalium]
    >gi|405516|gb|AAA18506.1| (L22217) This ORF is homologous to nitroreductase from Enterobacter cloacae, Accession Number A38686, and Salmonella, Accession Number P15888. [Phytoplasma sp.]
    >gi|457139|gb|AAA98137.1| (L29100) transposase [Bacillus thuringiensis]
    >gi|468279|gb|AAA72440.1| (L31491) nreA [Plasmid pTOM9]
    >gi|413733|gb|AAA97418.1| (L25424) orf 1 [Plasmid pCB2.4]
    >gi|144453|gb|AAA23103.1| (M94320) very similar to DNA polymerase of Bacillus subtilis bacteriophage SPO2; potential DNA polymerase; putative [Citrus greening disease-associated bacterium]
    >gi|1008449|gb|AAA78793.1| (L19624) envelope glycoprotein [Human immunodeficiency virus type 1]
    >gi|1718307|gb|AAC57136.1| (U75698) ORF 54; dUTPase homolog; EBV BLLF3 homolog [Kaposi's sarcoma-associated herpesvirus].gi|2246506|gb|AAB62631.1| (U93872) ORF 54, dUTPase homolog [Kaposi's sarcoma-associated herpesvirus]
    >gi|2271117|gb|AAB66763.1| (AF008696) hemagglutinin [influenza A virus (A/South_Australia/68/92(H3N2))]
    >gi|4206510|gb|AAD11686.1| (AF066801) ribulose 1,5-bisphosphate carboxylase [Dictamnus sp. M.W.Chase-1820K]

    3. Entries of the following format are from SWISS-PROT:

    >gi|132349|sp|P15394|REPA_AGRTU REPLICATING PROTEIN

    sp|accession|entry name

    ProteinProspector programs designate:

  • accession number, 132349, as all consecutive digits following the first "|"
  • species, AGRTU, as the text string between the underscore and the next space when preceded by "sp|...|"
  • name, REPLICATING PROTEIN, as the text string following the species.

    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/NCBInr....usp.

    Lines which are too long are terminated by three full stops:

    >gi|123494|sp|P22291|SULD_STRPN BIFUNCTIONAL FOLATE SYNTHESIS PROTEIN (DIHYDRONEOPTERIN ALDOLASE (DHNA) / 2-AMINO-4-HYDROXY-6-HYDROXYMETHYLDIHYDROPTERIDINE PYROPHOSPHOKINASE (7,8-DIHYDRO-6-HYDROXYMETHYLPTERIN PYROPHOSPHOKINASE) (HPPK) (6-HYDROXYMETHYL-7...

    A few entries are of the following form:

    >gi|4033439|sp||LEC_VICVI_1 [Segment 1 of 4] LECTIN B4 (VVLB4)

    In these cases the species field is terminated in an underscore (VICVI for the example shown).

    4. Entries of the following format are GNL Entries:

    >gi|216351|gnl|PID|d1003451 (D13793) ORF [Bacillus subtilis]

    Here gnl stands for general and the next field, PIR in the above case, identifies the database.

    gnl|database|identifier

    ProteinProspector programs designate:

  • accession number, 216351, as all consecutive digits following the first "|"
  • species, Bacillus subtilis, as the text string inside the last set of square brackets.
  • name, (D13793) ORF, as the text string between the first space the last set of square brackets.

    5. Entries of the following format are from NBRF PIR:

    >gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans

    pir||entry

    ProteinProspector programs designate:

  • accession number, 282349, as all consecutive digits following the first "|"
  • species, Bacillus circulans, as the text string following the last space-dash-space (" - ") in the line.
  • name, chitinase (EC 3.2.1.14) D, as the text string between the first space and the last dash " -" in the line.

    Here are some more examples:

    >gi|80297|pir||JN0146 hypothetical protein (div+ 3' region) - Bacillus subtilis (fragment)
    >gi|77616|pir||A36125 branched-chain amino acid transport protein braC - Pseudomonas aeruginosa (strain PAO)
    >gi|538696|pir||A40613 avirulence protein avrRpt2 - Pseudomonas syringae (strain DC3000, pv. tomato)
    >gi|98505|pir||S21241 oligo-1,6-glucosidase (EC 3.2.1.10) - Bacillus "thermoamyloliquefaciens" (strain KP1071) (fragment)
    >gi|320384|pir||A37388 probable DNA-binding protein 1A - Thermus aquaticus (strain HB8) insertion sequence IS1000
    >gi|477498|pir||A49131 releasechannel homolog - fruit fly (Drosophila melanogaster) (fragment)

    6. Entries of the following format are GenInfo Backbone Id Entries:

    >gi|3712669|bbs|85194 (S85224) vascular endothelial growth factor; VEGF 206 [Homo sapiens]

    bbs|number

    ProteinProspector programs designate:

  • accession number, 3712669, as all consecutive digits following the first "|"
  • species, Homo sapiens, as the text string inside the last set of square brackets.
  • name, (S85224) vascular endothelial growth factor; VEGF 206, as the text string between the first space the last set of square brackets.

    Sometimes the species field isn't present:

    >gi|386067|bbs|133197 cytochrome c3

    Sometimes it contains extra text apart from the species name:

    >gi|386065|bbs|133195 cytochrome c3 {N-terminal} [Desulfovibrio vulgaris, NCIMB 8303, Peptide Partial, 22 aa]

    Sometimes it appears twice:

    >gi|236142|bbs|57690 (S57688) EF-G=elongation factor G [Thermotoga maritima, Peptide, 682 aa] [Thermotoga maritima]

    Sometimes the species is recorded as unidentified:

    >gi|913316|bbs|163145 (S76565) T-cell receptor beta chain VJ region {clone N4} [not specified, vesicular stomatitis virus-specific CTL, Peptide Partial, 15 aa] [unidentified]

    Here are a couple of examples where the comment line has been truncated. In such cases it is terminated by three full stops:

    >gi|435743|bbs|139151 (S66567) alpha-atrial natriuretic factor/coat protein, alpha-ANF/coat protein=fusion polypeptide(coat protein, alpha-atrial natriuretic factor, alpha-ANF) [human, bacteriophage fr, expression vector pFAN15, Peptide PlasmidSynthetic...
    >gi|833965|bbs|160632 (S75335) polyprotein(structural protein C, structural protein E, structural protein M, structural protein PreM, nonstructural protein NS1) [dengue type 1 D1 virus, Mochizuki, Peptide Partial, 50 aa, segment 2 of 2] [Dengue virus ty...

    7. Entries of the following format are from the Brookhaven Protein Data Bank:

    >gi|230242|pdb|1PFK|A Escherichia coli
    >gi|4139942|pdb|1BC5|T Chain T, Chemotaxis Receptor Recognition By Protein Methyltransferase Cher
    >gi|231004|pdb|4ER4|I synthetic construct
    >gi|494001|pdb|1EGF| Epidermal Growth Factor (Egf) (Nmr, 16 Structures)
    >gi|493782|pdb|146L| Lysozyme (E.C.3.2.1.17) Mutant With Cys 54 Replaced By Thr, Cys 97 Replaced By Ala, Leu 121 Replaced By Met, Ala 129 Replaced By Leu, Leu 133 Replaced By Met, Val 149 Replaced By Ile, Phe 153 Replaced By Trp (C54t,C97a,L121m,A129l,...
    >gi|230275|pdb|1R1A|1 Human rhinovirus 1A

    pdb|entry|chain

    ProteinProspector programs designate:

  • accession number, 230275 in the first example, as all consecutive digits following the first "|"
  • species, as UNREADABLE because it isn't reliably positioned within the comment line.
  • name, as the entire comment line.

    All the comment lines of this format are written by FA-Index to the file seqdb/NCBInr....usp.

    8. Entries of the following format are from the Protein Research Foundation:

    >gi|742246|prf||2009326A beta glucosidase [Cellvibrio gilvus]

    prf||name

    ProteinProspector programs designate:

  • accession number, 742246, as all consecutive digits following the first "|"
  • species, Cellvibrio gilvus, as the text string inside the last set of square brackets.
  • name, beta glucosidase, as the text string between the first space the last set of square brackets.

    Here is another example.

    >gi|225172|prf||1210227A amylase subtilisin inhibitor alpha [Hordeum vulgare var. distichum]

    9. Entries of the following format are from the DNA Database of Japan (DDBJ):

    >gi|2440229|dbj||AB006689_5 (AB006689) ORF13 [Agrobacterium rhizogenes]

    ProteinProspector programs designate:

  • accession number, 2440229, as all consecutive digits following the first "|"
  • species, Agrobacterium rhizogenes, as the text string inside the last set of square brackets.
  • name, (AB006689) ORF13, as the text string between the first space the last set of square brackets.

    dbj|accession|locus

    Here is another example.

    >gi|1805521|dbj||D90852_18 (D90852) ORF_ID:o250#11; similar to [SwissProt Accession Number P19779]; start codon is not identified yet [Escherichia coli]

    10. Entries of the following format are from the EMBL Data Library:

    >gi|6|emb|CAA42669.1| (X60065) beta-2-glycoprotein I [Bos taurus]

    emb|accession|locus

    ProteinProspector programs designate:

  • accession number, 6, as all consecutive digits following the first "|"
  • species, Bos taurus, as the text string inside the last set of square brackets.
  • name, (X60065) beta-2-glycoprotein I, as the text string between the first space the last set of square brackets.

    Sometimes the species field isn't present:

    >gi|6065756|emb|CAB58425.1| (AJ238324) Clostridium difficile binary toxin A

    Here is an example where the comment line has been truncated. In such cases it is terminated by a > character:

    >gi|6018922|emb|CAB58111.1| (AL121806) /prediction=(method:""genefinder"", version:""084"", score:""32.36"")~/prediction=(method:""genscan"", version:""1.0"")~/match=(desc:""EUKARYOTIC TRANSLATION INITIATION FACTOR 4E (EIF-4E) (EIF4E) (MRNA CAP-BINDING PROTEIN) (EIF-4F 25 KD SUBU>

    11. Entries of the following format are NCBI Reference Sequences:

    >gi|5713315|ref|NP_002060.1| guanine nucleotide binding protein (G protein), alpha inhibiting activity polypeptide 1

    ref|accession|locus|:q

    ProteinProspector programs designate:

  • accession number, 5713315 as all consecutive digits following the first "|"
  • species, as UNREADABLE because it isn't generally present in the comment line.
  • name, as the entire comment line.

    All the comment lines of this format are written by FA-Index to the file seqdb/NCBInr....usp.

    dbEST

    This database wins the booby prize as the one with the least consistent comment lines.

    Sample entry:

    >gi|1705383|gb|N20717|N20717 SMNHADA002044SK SmAW Schistosoma mansoni cDNA 5'

    ProteinProspector programs designate:

  • accession number, 1705383, as all consecutive digits following "gi|"
  • species, Schistosoma mansoni; since this database is so haphazard in its placement of the species, FA-Index does a string search in the line after first consulting the file dbEST.spl.txt for valid species names. The string search method is possible with this particular database because there is a more limited range of species represented. However, this means that a server administrator needs to keep the dbEST.spl.txt file up to date to ensure continuous high quality species searching of dbEST with ProteinProspector programs. This task, though annoying, is made somewhat easier by consulting the seqdb/dbEST.usp file. Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/dbEST.usp.
  • name, N20717 SMNHADA002044SK SmAW Schistosoma mansoni cDNA 5', as the string following the first space.

    Sometimes the comment lines are very long and appear to consist of two comment lines appended together. The two comment lines are separated by a non-printable binary character (ASCII code Control A) shown here as a full stop. In such cases Protein Prospector only considers the first part of the comment line.

    >gi|3771232|gb|AI209290|AI209290 SWOvAFCAP09G09SK Onchocerca volvulus adult female cDNA (SAW98MLW-OvAF) Onchocerca volvulus cDNA clone SWOvAFCAP09G09 5', mRNA sequence [Onchocerca volvulus].gi|3789602|gb|AI216948|AI216948 SWOvAFCAP10G11SK Onchocerca volvulus adult female cDNA (SAW98MLW-OvAF) Onchocerca volvulus cDNA clone SWOvAFCAP10G115', mRNA sequence [Onchocerca volvulus]

    Ludwignr

    This is another non-redundant database. The entries are of following format:

    db|accno|ID|CRC Description[species]

    db - database
    CRC - 64bit cyclic redundancy check

    Here are some example - one from each of the components of the database:

    >gp|M84711|182775|000037AE195F7A9D v-fos transformation effector protein [Homo sapiens]
    >gp|AL391014|9716128|0006579AD1B1EEE8 putative DNA-binding protein [Streptomyces coelicolor A3(2)]
    >pir|A91719|GGIC1A|0027F62F6F36BA36 globin CTT-IA - midge (Chironomus thummi thummi)[Chironomus thummi thummi]
    >pir|JX0361|JX0361|00013E4475F84453 subtilisin-trypsin inhibitor, SIL10 - Streptomyces sp.[Streptomyces sp.]
    >pir|JC7193|PC7055|0154D83E82AA822B cell division protein FtsQ - Streptomyces collinus (fragment)[Streptomyces collinus]
    >pir|A29526|A29526|02AC2025766BCBC7 ubiquitin B processed pseudogene - human[Homo sapiens]
    >sp|P55820|SN25_RABIT|00014F740FEB29C5 (SNAP..)SYNAPTOSOMAL-ASSOCIATED PROTEIN 25 (SNAP-25) (SUPER PROTEIN) (SUP) (FRAGMENTS).[Oryctolagus cuniculus]
    >sp_vs|P16157-01|P16157|004EDB42F81EBDE8 ISOFORM 2.2 OF P16157[Homo sapiens]
    >tr|AF247519|AAF71733|0001F06BB33BD2E8 Gag protein (Fragment).[Human immunodeficiency virus type 1]
    >tr|U83613|O09751|0000148C132C06BD (POL)REVERSE TRANSCRIPTASE (FRAGMENT).[Human immunodeficiency virus type 1]
    >tr_vs|P70390-01|P70390|0172F8C6825A0023 ISOFORM OG12B/PRX3B OF P70390[Mus musculus]
    >wp|CE24847|C44C3.3|0205CAE438EE8B14 (ST.LOUIS) TR:P91157 protein_id:AAB37360.1[C. elegans]
    >yp|ORFP:YDR094W|0642CC1F954A58E2 YDR094W, Chr IV from 635833-636168[S. cerevisiae]

    ProteinProspector programs designate (example from first line above):

  • accession number, M84711 as the text string between the first vertical bar and the next vertical bar.
  • species, Homo sapiens as the text string inside the last set of square brackets.
  • name, v-fos transformation effector protein, as the text string between the first space and the last set of square brackets.


    ProteinProspector filenaming conventions for proprietary/generic FASTA databases

    Often the comment line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. With well curated databases, this information is consistently organized into fields in the comment line of a FASTA formatted database.

    For ProteinProspector programs the sequence field is only subject to 2 constraints. 1) it must be in CAPITAL lettters, and 2) it must be in single letter code (some people express amino acids in 3-letter code).

    The way ProteinProspector programs "know" which dialect of FASTA to "speak" with a particular database's comment line is via the filename. Generic filename prefixes are shown below in bold and the associated comment line format described. These formats are handled in a relatively robust manner, to allow for the absence of fields or the presence of additional fields. The formats basically consist of "|" delimited fields of accession number, name, and species in that order.

    DN and PN

    The D forms designate that the sequence is DNA and will be translated into protein sequence by ProteinProspector programs. The P forms indicate protein sequence.

    > 417909| Better than sliced bread growth factor beta|Mouse|pancreas|

    ProteinProspector programs designate:

  • accession number, 417909, as the integer before the first "|"
  • name, Better than sliced bread growth factor beta, as the string between the first "|" and second "|" (or the end of the line, if no second "|")
  • species, Mouse, as the string between the second "|" and third "|" (or the end of the line, if no third "|")

    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/DN.usp, or seqdb/PN.usp.

    DA and PA

    The D forms designate that the sequence is DNA and will be translated into protein sequence by ProteinProspector programs. The P forms indicate protein sequence.

    Note that the DA and PA differ from the DN and PN set only in that the accession number can be alphanumeric rather than numeric. This second set is thus more robust. However, for large, frequently updated databases FA-Index can take an hour to run rather than several minutes simply because creation of the dbfilename.acc file involves the much slower process of sorting strings rather than integers.

    > SlowSort909| Better than sliced bread growth factor beta|Mouse|pancreas|

    ProteinProspector programs designate:

  • accession number, SlowSort909, as the alphanumeric string before the first "|"
  • name, Better than sliced bread growth factor beta, as the string between the first "|" and second "|" (or the end of the line, if no second "|")
  • species, Mouse, as the string between the second "|" and third "|" (or the end of the line, if no third "|")

    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/DA.usp or seqdb/PA.usp.

    Any number of proprietary databases may be created with DA, DN, PA or PN prefixes. You must also create species alias lists and accession number links for any databases which you create.

    DDefault and PDefault

    If these prefixes are used then all attempts at trying to extract information from the comment line are abandoned.

    ProteinProspector programs designate:

  • accession number, as the entry number (1 for the first entry, 2 for the second entry, etc).
  • species, as UNREADABLE.
  • name, as the entire comment line.

    DDefault is used for a database containing DNA sequences and PDefault for one containing protein sequences.


    FA-Index output files (the indices)

    Suffix
    (databasefilename.xxx)
    Description
    .idcContains a list of byte offsets for the start of the comment line for each entry in the database.
    .idpContains a list of byte offsets for the start of the sequences for each entry in the database.
    .idiContains the number of entries in the database, the length of the longest comment line in the database and the length of the longest sequence in the database.
    .idxUsed in previous versions of Protein Prospector. Now obselete.
    .unkIndex which keeps track of all foreign characters in the sequence field for each database entry.
        For protein databases any characters other than the 20 standard amino acids are foreign characters.

        For DNA databases any characters other than A, G, C, T, and N are foreign characters.
        Note that the sequences must be in CAPITAL lettters, and in single letter code (some people express amino acids in 3-letter code).
    .mwIndex containing the calculated protein Molecular Weight (MW) of each sequence in the database. For DNA sequences this MW is calculated by translating in frame 1 and ignoring stop codons. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .mw file is used to accelerate searches that are constrained by intact MW.
    .piIndex containing the calculated protein pI of each sequence in the database. For DNA sequences this pI is calculated by translating in frame 1 and ignoring stop codons. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .pi file is used to accelerate searches that are constrained by intact pI.
    .spIndex containing the Species of each sequence in the database. Used to accelerate searches that are constrained by species.
    .slContains a list in alphabetical order of the text strings used to denote different species. A text string has to occur at least ten times to appear in this file. This file is never used by the ProteinProspector programs. The text strings are the ones you should use in MS-Pattern if you have the Search Mode set to Species.
    .usp File created to list the comment lines of each entry for which FA-Index cannot read the species. This file is never used by the ProteinProspector programs; it is created only for use by server administrators in troubleshooting species problems.
    .accIndex of alphanumeric accession numbers, created only for database filename prefixes: Genpept, gen, SwissProt, swp, Owl, owl, DA, PA.
    .acnIndex of integer accession numbers, created only for database filename prefixes: NCBInr, nr, dbEST, dbest, DN, PN.


    Ignore/bypass the indicies?

    Suffix
    (databasefilename.xxx)
    BypassableHow to by-pass if possible
    .idcnoNecessary for any ProteinProspector program that searches/consults a database file.
    .idpnoNecessary for any ProteinProspector program that searches/consults a database file.
    .idinoNecessary for any ProteinProspector program that searches/consults a database file.
    .idxn/aObselete
    .unknoNecessary for any ProteinProspector program that searches/consults a database file.
    .mwyesSelect All in the MW search parameters.
    .piyesSelect All in the pI search parameters.
    .spyesSelect All in the Species search parameters.
    .slyesThis file is never used by the ProteinProspector programs; it is used to report the contents of the species fields in the database file.
    .uspyesThis file is never used by the ProteinProspector programs; it is created only for use by server administrators in troubleshooting species problems.
    .acc
    .acn
    yesDon't choose retrieve by Accession number in MS-Digest, or set the search mode to Accession number in MS-Pattern.


    The Browser Version of FA-Index


    Creating Indicies for a New Database

    Once you've downloaded a new database into the seqdb directory you need to create the index files described above before you can start to use it. To do this:

    1). Type the name of the database into the Newly Downloaded Database field.

    2). Press the Create Indicies For New Database button.

    3). Update the database list.


    Updating the Database and Species Lists in the HTML Forms

    The lists of databases and species used by the forms in the Protein Prospector package are held in Javascript files; the default location of these files is shown on the FA-Index form. To update the contents of the files press the Update Database and Species Lists in Forms button. After doing this you will probably have to reload the relevant HTML form before the new lists appear. If this doesn't work place the cursor in the URL location box of the browser and press return. If even this doesn't work investigate the cache settings on your browser.

    The species Javascript file is generated from the information in the species.txt file.

    The Javascript files are automatically updated after performing the following operations on the FA-Index form:

  • Create Indicies For New Database
  • Create a Pre-Search Subset Database
  • Create Subset Database with Indices from Saved Hits
  • Create or Append to User Database

    You will still have to reload the form as described above to be able to select a newly created database.


    Creating a Pre-Search Subset Database

    ProteinProspector licensees can create their own subset databases which have been pre-filtered for species, species codes, molecular weight, pI and accession number. For example to create a subset database of human proteins between 1000-100000 Da from the SwissProt database:

    1). Choose a suitable suffix for the database such as human.

    2). Select SwissProt.rxx as the existing database.

    3). Select HOMO SAPIENS as the species.

    4). Enter 1000 to 100000 as the MW of the Protein and deselect All.

    5). Press the Create Subset Database button.

    6). Update the database list.

    Using subset databases is likely to dramatically decrease search times.

    This feature is only available to ProteinProspector licensees.


    Creating a Subset Database with Indices from Saved Hits

    The Hits (index numbers for matching database entries) from ProteinProspector search programs can be saved to a user-specified file. This file can then be used create a subset database containing only the Hit proteins from the search.

    1). Choose a suitable suffix for the database. The suffix must be unique; if you use the same suffix twice then the previously created subset database will be overwritten.

    2). Identify the database that was used in the original search.

    3). Identify the file containing the saved hits by entering the Program and File Name.

    4). Press the Create Subset Database with Indices from Saved Hits button.

    5). Update the database list.

    This feature is only available to ProteinProspector licensees.


    Creating or Appending to a Database Containing User Supplied Protein or DNA Sequences

    It is possible to create your own fasta format database which can be searched by the ProteinProspector search programs. An entry for a single protein or DNA sequence is made up of a comment line containing accession number, species and name fields followed by one or more lines containing the sequence.

    1). Enter the database name. There are several dialects of fasta with the essential difference between them being the format of the comment line. You are strongly advised to use a proprietary format but it is also possible to use a public format. If you choose a database name that already exists on the disk then subsequent proteins will be appended to the end of the file, otherwise a new database file will be created. It is possible to append entries to the end of the publicly available databases but this is not advisable; firstly because the index files are remade after each entry, secondly because newer versions of the database won't contain your entries and thirdly because any errors in the information you supply when adding the entry could potentially damage the whole database. If you want to use a public database format you should use a database name such as NCBInr.user.

    2). Enter a name for the entry. Whether you are using a proprietary format or a public format make sure you don't use characters in the name which might give the ProteinProspector programs problems in sorting out the fields in the comment line.

    3). Enter a species for the entry. This should be consistent with the information in the species.txt file.

    4). Enter an accession number for the entry. The accession number must be unique; the program will alert you if it isn't. If your database uses numeric accession numbers then the accession number must be numeric.

    5). Enter the protein or DNA sequence using only the upper case symbols for the 20 naturally occurring amino acids or the four base pairs as appropriate. X may also be used to if the sequence is unknown at a particular point.

    6). Press the Create or Append to User Database button.


    Database Summary Report

    The database summary report option is used to list the accession numbers, species and name fields for a selected index number range of a selected database. Deselect the Hide Protein Sequence checkbox if you also want to see the protein sequences. You can also select the DNA Reading Frame if you are looking at a DNA database.


    The Command Line Version of FA-Index

    FA-Index can also be run from the command line. You might want to do this if you want to set up a batch job to automatically update the databases or if running it from the web page interface causes a time out.

    FA-Index and the ProteinProspector Directory Structure

    On all operating systems the FA-Index program is expected to reside in the same directory as all other ProteinProspector programs (i.e. ). FA-Index accepts a single input argument (the name of the database file). Upon execution FA-Index issues an instruction to read the database file from seqdb/database_filename and write the indices to seqdb/database_filename.suffix.

    This can cause some problems about which directory to launch FA-Index from and the syntax of launching it. We've tried to make this as simple as possible, however system administrators can easily outsmart themselves, particularly if they want to alter the ProteinProspector directory structure.

    Basically you should launch FA-Index from the directory immediately above the seqdb directory, without specifying the path to the database file. FA-Index inserts only seqdb/ in front of the filename, and it "knows" whether to put a forward slash or a back slash for your particular operating system.

    If the FA-Index program does not reside in the directory immediately above the seqdb directory (the normal case on Windows NT systems) then you may need to specify the path to faindex (but not to the database).

    On UNIX systems there is no reason why seqdb cannot be a symbolic link to another directory.

    Running FA-Index

    Examples:

    On SunOS UNIX systems issue a command of the form:
          /home/httpd//faindex.cgi Genpept.r95

    On Windows NT systems use an MS-DOS command prompt to issue a command of the form:
          C:\http> faindex.cgi Genpept.r95
    (you may first need to type)
          path=C:\http\
    or try
          C:\http> \faindex.cgi Genpept.r95

    It is now possible to run all the Protein Prospector programs from a command line interface. The parameters for the programs can be specified as name value pairs. In this way you can specify further parameters such as a different path for the seqdb directory. See the Protein Prospector Automation Manual for details.