Description, Instructions, and Tips for Fa-Index
Purpose
This document provides instructions for Fa-Index.
You need not bother reading this document unless you are administering a server running
the ProteinProspector programs.
Instructions for ProteinProspector Programs
Contents of this document: (all in one file, so it can be printed and read)
Links to topics in the general instructions:
Introduction
FA-Index was developed for five main reasons:
- To enable an internal means for the ProteinProspector programs to store an index
number when a hit is recorded during a search, then later use that number to retrieve
that database entry for output/report generation purposes. This cuts down the memory
requirements for program execution.
- To provide indices which can be used to accelerate searches that are pre-filtered
by intact protein MW, protein pI and/or species.
- To aid the ProteinProspector programs in addressing some of the hindrances
inherent in FASTA comment line format heterogeneity.
- To allow users to create subset databases based on either a Species/Protein MW
pre-filter or the results of a previous search. Searches performed on these smaller
databases are often very much faster than searches performed on complete databases.
- To allow users to create databases containing user defined proteins.
The FASTA format for sequence databases was originally developed by Pearson for use
with the FASTA program. Today it is probably the most widely used standard format,
primarily because its brevity results in the smallest possible file size for sequences.
An example of the format is shown below:
>sp|P28190|AA1R_BOVIN ADENOSINE A1 RECEPTOR.
MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFCFIVSLAVADVAVGA
LVIPLAILINIGPRTYFHTCLKVACPVLILTQSSILALLAMAVDRYLRVKIPLRYKTVVT
PRRAVVAITGCWILSFVVGLTPMFGWNNLSAVERDWLANGSVGEPVIECQFEKVISMEYM
VYFNFFVWVLPPLLLMVLIYMEVFYLIRKQLSKKVSASSGDPQKYYGKELKIAKSLALIL
FLFALSWLPLHILNCITLFCPSCHMPRILIYIAIFLSHGNSAMNPIVYAFRIQKFRVTFL
KIWNDHFRCQPAPPIDEDAPAERPDD
As a standard it leaves something to be desired, because the "standard" is that there
is a single comment line per entry which must begin with the ">" character and all
subsequent lines for an entry contain sequence. However, there are many "standards"
as to the arrangement of fields and/or de-limiting of fields in the comment line. Often
the comment line is used to describe basic information like entry name, accession
number (or other unique identifier), and the species or organism from which the sequence
was obtained.
The FASTA format was chosen for use with ProteinProspector primarily because of it's universality,
brevity, and expected ease with which database files could be shared on the same computer with
other programs for sequence analysis.
The FA-Index program creates several indices which are much smaller files than the FASTA database
file. These indices aid the ProteinProspector programs in addressing some of the hindrances
inherent in the FASTA comment line format heterogeneity.
There is no reason that we know of that should prevent use of the FASTA database files
by both ProteinProspector programs and other programs which accept FASTA format. Further, we believe
it should be possible for the files to be simultaneously read by more than one program at
a time. It may be of interest to some users that the SEQUEST program from John Yates'
group at the University of Washington also uses FASTA formatted databases.
Often the comment line in a FASTA database is used to describe basic information like
entry name, accession number (or other unique identifier), and the species or organism from which the sequence
was obtained. However, this information is NOT consistently organized into
fields in the comment line of different FASTA database, though within a specific database
it is sometimes consistent.
The way ProteinProspector programs "know" which dialect of FASTA to "speak" with a particular database
is via the filename. Acceptable filename prefixes are shown below in bold and the associated
comment line format described.
Genpept
Sample entries:
>gi|216790 (D13314) arginine deiminase [Mycoplasma hominis]
>gi|261706|bbs|120303 (S50809) protein LG=immunoglobulin binding protein {immunoglobulin binding domains} [streptococcus, Peptide Recombinant, 455 aa]
ProteinProspector programs designate:
accession number, 216790 in the first example, as the number after the first | in the line. This
can be delimited by a | or a space.
species, Mycoplasma hominis in the first example, as the string between the last set of
square brackets in the line.
name, arginine deiminase in the first example, as the string between the first space
and the last "[" in the line.
Some entries cause problems:
>gi|3928883 unknown
Previously the accession number was taken to be number between the first set of round brackets in the line.
However entries like the one above don't have this field. This entry also doesn't contain a species
field.
Whenever the species cannot be found the species is assigned as UNREADABLE, and the name
is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index
to the file seqdb/Genpept....usp.
Some other entries are also potentially problematic.
This entry is very long and has been truncated by a > character.
>gi|1387979 (L77099) 44% identity over 302 residues with hypothetical protein from Synechocystis sp, accession D64006_CD; expression induced by environmental stress; some similarity to glycosyl transferases; two potential membrane-spanning helices [Bacillus subtil>
Neither of the following contain easily extractable species.
>gi|1123088 (U42436) coded for by C. elegans cDNA yk56a1.3; coded for by C. elegans cDNA CEMSG41FB; coded for by C. elegans cDNA yk81f4.5; coded for by C. elegans cDNA yk56a1.5; coded for by C. elegans cDNA yk81f4.3; similar to the S5P family of ribosomal proteins
>gi|2330745|gnl|PID|e334350 (Z98598) SPAC1B3.11c, ras-related protein, len:234aa, similar eg. to RB4B_RAT, P51146, ra
This entry doesn't have anything in the name field but the species is OK.
>gi|1575686 (U70379) [Synechococcus PCC7942]
The previously used accession number in the round brackets for these two entries are identical.
>gi|3928875 (AF093611) putative chloroplast desaturase [Acetabularia acetabulum]
>gi|3928876 (AF093611) putative chloroplast desaturase [Acetabularia acetabulum]
This entry contain a zone delimited by [ ] characters which is not at the end of the line and
doesn't contain a species.
>gi|3881286|gnl|PID|e1350785 (AL021507) [980325 dl] : Prediction spanned chimera, modified based on new 3' sequence information (o/l with F14D1); cDNA EST EMBL:D34402 comes from this gene; cDNA EST EMBL:D37454 comes from this gene; cDNA EST EMBL:D68054 comes from this gene; cDNA E>
Owl
This database should now be downloaded from NCBI and the comment line format has changed.
The entries come from four different sources:
>owl|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).
ProteinProspector programs designate:
accession number, 100K_RAT as the string
between the second | and the first space in the line.
species, RAT, as the characters after the underscore
in the accession number.
name: 100 KD PROTEIN (EC 6.3.2.-).
as the string between the first space and the last dash " -" in the line.
>owl|B40638|B40638 isocytochrome c2 - Rhodobacter sphaeroides
ProteinProspector programs designate:
accession number, B40638 as the string
between the second | and the first space in the line.
species, Rhodobacter sphaeroides, as the
text string following the last space-dash-space (" - ") in the line.
name: isocytochrome c2
as the string between the first space and the last dash " -" in the line.
>owl|Z31371|A7120FTSZ1 A7120FTSZ NID: g1100793 - Anabaena PCC7120.
ProteinProspector programs designate:
accession number, A7120FTSZ1 as the string
between the second | and the first space in the line.
species, Anabaena PCC7120. as the text string
following the last space-dash-space (" - ") in the line. Note that
there is a full stop after the species which must be deleted.
name: A7120FTSZ NID: g1100793
as the string between the first space and the last dash " -" in the line.
>owl||NRL_1A00B hemoglobin beta chain mutant (V1M, W37Y) (deoxy), chain B - human
ProteinProspector programs designate:
accession number, NRL_1A00B as the string
between the second | and the first space in the line.
species, human. as the text string following the last
space-dash-space (" - ") in the line.
name: hemoglobin beta chain mutant (V1M, W37Y) (deoxy), chain B
as the string between the first space and the last dash " -" in the line.
Whenever the species cannot be found the species is assigned as UNREADABLE, and the name
is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index
to the file seqdb/Owl....usp.
A typical comment line causing problems is:
>owl|P15455|12S1_ARATH 12S SEED STORAGE PROTEIN PRECURSOR. - ARABIDOPSIS THALIANA (MOUSE-EAR...
There are three full stops at the end of the line.
Previously the comment lines had the following format:
>10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10). - VIGNA UNGUICULATA (COWPEA).
>AEOHFPA AEOHFPA NID: g141875 - A.hydrophila DNA, clone pPH4.
>pir|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).
SwissProt
Sample entry as output by the sp2fasta program:
>sp|P16105|H32_BOVIN HISTONE H3 (H3.2)
Sample entry if the database is downloaded from NCBI:
>gi|122068|sp|P16105|H32_BOVIN HISTONE H3 (H3.2)
ProteinProspector programs designate:
accession number, P16105 the alphanumeric string
between the first sp| and the next | in the comment line
species, BOVIN, as the string between the underscore and
the space in the next field.
name: HISTONE H3 (H3.2), as the string following the species
Whenever the species cannot be found the species is assigned as UNREADABLE, and the name
is assigned as the entire comment line (this usually does not happen for any entries in SwissProt).
All of these UNREADABLE lines are then written by FA-Index
to the file seqdb/SwissProt....usp.
A few entries are of the following form:
>gi|400027|sp||HYEP_PSESP_2 [Segment 2 of 3] EPOXIDE HYDROLASE (EPOXIDE HYDRATASE)
In these cases the accession number is taken as being the first number ie. 400027
for the example shown.
The species is extracted from the code between the last vertical bar and the first space.
It appears after the first underscore and before the second underscore (if present).
The name is the rest of the line.
NCBInr
The comment lines from this database are tricky to handle because it is a non-redundant
database which collects entries form several databases, thus there are several formats
present in the final database.
Further information is
available from the NCBI site.
1. Entries of the following format are from Genpept:
Note that this format has now been discontinued and all Genpept entries are now in the
format described in section 2. Support for this format will be continued for a while for
people who have an old copy of the database. The corresponding comment lines in the new
format are given in section 2.
>gi|304881 (L07596) alaS [Escherichia coli]
ProteinProspector programs designate:
accession number, 304881, as all consecutive digits following the first "|"
species, Escherichia coli, as the text string inside the last set of square brackets.
name, (L07596) alaS , as the text string between the first space the last set of square brackets.
Whenever the species cannot be found the species is assigned as UNREADABLE, and the name
is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index
to the file seqdb/NCBInr....usp.
Lines which are too long are terminated by three full stops:
>gi|2429520 (AF025469) Similar to acetyl-CoA carboxylase; coded for by C. elegans cDNA yk16c3.3; coded for by C. elegans cDNA yk36b11.3; coded for by C. elegans cDNA yk43h8.3; coded for by C. elegans cDNA yk24d2.3; coded for by C. elegans cDNA yk24d2.5;...
This line has a space at the end of the species:
>gi|520517 (U10338) RNA polymerase II, largest subunit [Ilyanassa obsoleta ]
The following entry has no species:
>gi|3928883 unknown
Here are some more examples. Note that what is in the species field isn't always a species.
>gi|149575 (M76708) L(+)-lactate dehydrogenase [Lactobacillus casei]
>gi|45803 (X04609) gamma subunit (3'terminus); pid:g45803 [thermophilic bacterium PS3]
>gi|289135 (L10036) unknown [Anabaena PCC7120]
>gi|402254 (U01238) beta subunit of the molybdenum-iron nitrogenase [Frankia sp.]
>gi|414523 (U02284) beta-lactamase [Cloning vector pSP65]
>gi|439619 (L25848) [Salmonella typhimurium IS200 insertion sequence from SARA17, partial.], gene product [Salmonella typhimurium]
>gi|431128 (L15633) start [Transposon Tn916]
>gi|466378 (U07618) SSB [Unknown]
>gi|403947 (U01693) (M90060); Homology to GenBank Accession numORF-X from STRATPASEA [Mycoplasma genitalium]
>gi|405516 (L22217) This ORF is homologous to nitroreductase from Enterobacter cloacae, Accession Number A38686, and Salmonella, Accession Number P15888. [Mycoplasma-like organism]
>gi|457139 (L29100) transposase [Insertion sequence IS150 homolog]
>gi|468279 (L31491) nreA [pTOM9]
>gi|413733 (L25424) orf 1 [Plasmid pCB2.4]
>gi|144453 (M94320) very similar to DNA polymerase of Bacillus subtilis bacteriophage SPO2; potential DNA polymerase; putative [Citrus greening disease-associated bacterium-like organism]
>gi|971400 (X88862) immunogenic polyprotein with 2A protease [Foot-and-mouth disease virus]
>gi|1008449 (L19624) envelope glycoprotein [Human immunodeficiency virus type 1]
>gi|1718307 (U75698) ORF 54; dUTPase homolog; EBV BLLF3 homolog [Kaposi's sarcoma-associated herpesvirus]
>gi|2271117 (AF008696) hemagglutinin [influenza A virus (A/South_Australia/68/92(H3N2))]
>gi|2444119 (U88974) ORF40 [Streptococcus thermophilus temperate bacteriophage O1205]
>gi|2662546 (AF036688) No definition line found [Caenorhabditis elegans]
>gi|4206510 (AF066801) ribulose 1,5-bisphosphate carboxylase [Dictamnus sp. M.W.Chase-1820K]
2. Entries of the following format are from GenBank:
>gi|1680564|gb||S58174_1 (S58174) putative RNA polymerase [Pelargonium leaf curl virus]
gb|accession|locus
ProteinProspector programs designate:
accession number, 1680564, as all consecutive digits following the first "|"
species, Pelargonium leaf curl virus, as the text string inside the last set of square brackets.
name, (S58174) putative RNA polymerase, as the text string between the first space the last set of square brackets.
Here are some more examples.
>gi|1683178|gb||S69825_2 (S69825) coat/capsid protein [Sweet potato feathery mottle virus (strain CH)]
>gi|1683615|gb||S81342_1 (S81342) unnamed protein product [Mus sp.]
Here are some example entries which have changed from format 1 to this format
>gi|304881|gb|AAA71918.1| (L07596) alaS [Escherichia coli]
>gi|520517|gb|AAA50229.1| (U10338) RNA polymerase II, largest subunit [Ilyanassa obsoleta]
>gi|3928883|gb|AAC79708.1| unknown
>gi|289135|gb|AAD04186.1| (L10036) unknown [Anabaena PCC7120]
>gi|402254|gb|AAA03325.1| (U01238) beta subunit of the molybdenum-iron nitrogenase [Frankia sp.]
>gi|414523|gb|AAB60535.1| (U02284) beta-lactamase [Cloning vector pSP65].gi|644827|gb|AAA64566.1| (U19867) beta-lactamase [Cloning vector pSPL3]
>gi|431128|gb|AAC36978.1| (L15633) start [Transposon Tn916]
>gi|466378|gb|AAA17041.1| (U07618) SSB [Plasmid R751]
>gi|403947|gb|AAB01006.1| (U01693) (M90060); Homology to GenBank Accession numORF-X from STRATPASEA [Mycoplasma genitalium]
>gi|405516|gb|AAA18506.1| (L22217) This ORF is homologous to nitroreductase from Enterobacter cloacae, Accession Number A38686, and Salmonella, Accession Number P15888. [Phytoplasma sp.]
>gi|457139|gb|AAA98137.1| (L29100) transposase [Bacillus thuringiensis]
>gi|468279|gb|AAA72440.1| (L31491) nreA [Plasmid pTOM9]
>gi|413733|gb|AAA97418.1| (L25424) orf 1 [Plasmid pCB2.4]
>gi|144453|gb|AAA23103.1| (M94320) very similar to DNA polymerase of Bacillus subtilis bacteriophage SPO2; potential DNA polymerase; putative [Citrus greening disease-associated bacterium]
>gi|1008449|gb|AAA78793.1| (L19624) envelope glycoprotein [Human immunodeficiency virus type 1]
>gi|1718307|gb|AAC57136.1| (U75698) ORF 54; dUTPase homolog; EBV BLLF3 homolog [Kaposi's sarcoma-associated herpesvirus].gi|2246506|gb|AAB62631.1| (U93872) ORF 54, dUTPase homolog [Kaposi's sarcoma-associated herpesvirus]
>gi|2271117|gb|AAB66763.1| (AF008696) hemagglutinin [influenza A virus (A/South_Australia/68/92(H3N2))]
>gi|4206510|gb|AAD11686.1| (AF066801) ribulose 1,5-bisphosphate carboxylase [Dictamnus sp. M.W.Chase-1820K]
3. Entries of the following format are from SWISS-PROT:
>gi|132349|sp|P15394|REPA_AGRTU REPLICATING PROTEIN
sp|accession|entry name
ProteinProspector programs designate:
accession number, 132349, as all consecutive digits following the first "|"
species, AGRTU, as the text string between the underscore and the next space when preceded by "sp|...|"
name, REPLICATING PROTEIN, as the text string following the species.
Whenever the species cannot be found the species is assigned as UNREADABLE, and the name
is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index
to the file seqdb/NCBInr....usp.
Lines which are too long are terminated by three full stops:
>gi|123494|sp|P22291|SULD_STRPN BIFUNCTIONAL FOLATE SYNTHESIS PROTEIN (DIHYDRONEOPTERIN ALDOLASE (DHNA) / 2-AMINO-4-HYDROXY-6-HYDROXYMETHYLDIHYDROPTERIDINE PYROPHOSPHOKINASE (7,8-DIHYDRO-6-HYDROXYMETHYLPTERIN PYROPHOSPHOKINASE) (HPPK) (6-HYDROXYMETHYL-7...
A few entries are of the following form:
>gi|4033439|sp||LEC_VICVI_1 [Segment 1 of 4] LECTIN B4 (VVLB4)
In these cases the species field is terminated in an underscore (VICVI for the example
shown).
4. Entries of the following format are GNL Entries:
>gi|216351|gnl|PID|d1003451 (D13793) ORF [Bacillus subtilis]
Here gnl stands for general and the next field, PIR in the above case, identifies the
database.
gnl|database|identifier
ProteinProspector programs designate:
accession number, 216351, as all consecutive digits following the first "|"
species, Bacillus subtilis, as the text string inside the last set of square brackets.
name, (D13793) ORF, as the text string between the first space the last set of square brackets.
5. Entries of the following format are from NBRF PIR:
>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans
pir||entry
ProteinProspector programs designate:
accession number, 282349, as all consecutive digits following the first "|"
species, Bacillus circulans, as the text string
following the last space-dash-space (" - ") in the line.
name, chitinase (EC 3.2.1.14) D, as the text string between the first space
and the last dash " -" in the line.
Here are some more examples:
>gi|80297|pir||JN0146 hypothetical protein (div+ 3' region) - Bacillus subtilis (fragment)
>gi|77616|pir||A36125 branched-chain amino acid transport protein braC - Pseudomonas aeruginosa (strain PAO)
>gi|538696|pir||A40613 avirulence protein avrRpt2 - Pseudomonas syringae (strain DC3000, pv. tomato)
>gi|98505|pir||S21241 oligo-1,6-glucosidase (EC 3.2.1.10) - Bacillus "thermoamyloliquefaciens" (strain KP1071) (fragment)
>gi|320384|pir||A37388 probable DNA-binding protein 1A - Thermus aquaticus (strain HB8) insertion sequence IS1000
>gi|477498|pir||A49131 releasechannel homolog - fruit fly (Drosophila melanogaster) (fragment)
6. Entries of the following format are GenInfo Backbone Id Entries:
>gi|3712669|bbs|85194 (S85224) vascular endothelial growth factor; VEGF 206 [Homo sapiens]
bbs|number
ProteinProspector programs designate:
accession number, 3712669, as all consecutive digits following the first "|"
species, Homo sapiens, as the text string inside the last set of square brackets.
name, (S85224) vascular endothelial growth factor; VEGF 206, as the text string between the first space the last set of square brackets.
Sometimes the species field isn't present:
>gi|386067|bbs|133197 cytochrome c3
Sometimes it contains extra text apart from the species name:
>gi|386065|bbs|133195 cytochrome c3 {N-terminal} [Desulfovibrio vulgaris, NCIMB 8303, Peptide Partial, 22 aa]
Sometimes it appears twice:
>gi|236142|bbs|57690 (S57688) EF-G=elongation factor G [Thermotoga maritima, Peptide, 682 aa] [Thermotoga maritima]
Sometimes the species is recorded as unidentified:
>gi|913316|bbs|163145 (S76565) T-cell receptor beta chain VJ region {clone N4} [not specified, vesicular stomatitis virus-specific CTL, Peptide Partial, 15 aa] [unidentified]
Here are a couple of examples where the comment line has been truncated. In such cases
it is terminated by three full stops:
>gi|435743|bbs|139151 (S66567) alpha-atrial natriuretic factor/coat protein, alpha-ANF/coat protein=fusion polypeptide(coat protein, alpha-atrial natriuretic factor, alpha-ANF) [human, bacteriophage fr, expression vector pFAN15, Peptide PlasmidSynthetic...
>gi|833965|bbs|160632 (S75335) polyprotein(structural protein C, structural protein E, structural protein M, structural protein PreM, nonstructural protein NS1) [dengue type 1 D1 virus, Mochizuki, Peptide Partial, 50 aa, segment 2 of 2] [Dengue virus ty...
7. Entries of the following format are from the Brookhaven Protein Data Bank:
>gi|230242|pdb|1PFK|A Escherichia coli
>gi|4139942|pdb|1BC5|T Chain T, Chemotaxis Receptor Recognition By Protein Methyltransferase Cher
>gi|231004|pdb|4ER4|I synthetic construct
>gi|494001|pdb|1EGF| Epidermal Growth Factor (Egf) (Nmr, 16 Structures)
>gi|493782|pdb|146L| Lysozyme (E.C.3.2.1.17) Mutant With Cys 54 Replaced By Thr, Cys 97 Replaced By Ala, Leu 121 Replaced By Met, Ala 129 Replaced By Leu, Leu 133 Replaced By Met, Val 149 Replaced By Ile, Phe 153 Replaced By Trp (C54t,C97a,L121m,A129l,...
>gi|230275|pdb|1R1A|1 Human rhinovirus 1A
pdb|entry|chain
ProteinProspector programs designate:
accession number, 230275 in the first example, as all consecutive digits following the first "|"
species, as UNREADABLE because it isn't reliably positioned within the comment line.
name, as the entire comment line.
All the comment lines of this format are written by FA-Index to the file seqdb/NCBInr....usp.
8. Entries of the following format are from the Protein Research Foundation:
>gi|742246|prf||2009326A beta glucosidase [Cellvibrio gilvus]
prf||name
ProteinProspector programs designate:
accession number, 742246, as all consecutive digits following the first "|"
species, Cellvibrio gilvus, as the text string inside the last set of square brackets.
name, beta glucosidase, as the text string between the first space the last set of square brackets.
Here is another example.
>gi|225172|prf||1210227A amylase subtilisin inhibitor alpha [Hordeum vulgare var. distichum]
9. Entries of the following format are from the DNA Database of Japan (DDBJ):
>gi|2440229|dbj||AB006689_5 (AB006689) ORF13 [Agrobacterium rhizogenes]
ProteinProspector programs designate:
accession number, 2440229, as all consecutive digits following the first "|"
species, Agrobacterium rhizogenes, as the text string inside the last set of square brackets.
name, (AB006689) ORF13, as the text string between the first space the last set of square brackets.
dbj|accession|locus
Here is another example.
>gi|1805521|dbj||D90852_18 (D90852) ORF_ID:o250#11; similar to [SwissProt Accession Number P19779]; start codon is not identified yet [Escherichia coli]
10. Entries of the following format are from the EMBL Data Library:
>gi|6|emb|CAA42669.1| (X60065) beta-2-glycoprotein I [Bos taurus]
emb|accession|locus
ProteinProspector programs designate:
accession number, 6, as all consecutive digits following the first "|"
species, Bos taurus, as the text string inside the last set of square brackets.
name, (X60065) beta-2-glycoprotein I, as the text string between the first space the last set of square brackets.
Sometimes the species field isn't present:
>gi|6065756|emb|CAB58425.1| (AJ238324) Clostridium difficile binary toxin A
Here is an example where the comment line has been truncated. In such cases it is terminated by a > character:
>gi|6018922|emb|CAB58111.1| (AL121806) /prediction=(method:""genefinder"", version:""084"", score:""32.36"")~/prediction=(method:""genscan"", version:""1.0"")~/match=(desc:""EUKARYOTIC TRANSLATION INITIATION FACTOR 4E (EIF-4E) (EIF4E) (MRNA CAP-BINDING PROTEIN) (EIF-4F 25 KD SUBU>
11. Entries of the following format are NCBI Reference Sequences:
>gi|5713315|ref|NP_002060.1| guanine nucleotide binding protein (G protein), alpha inhibiting activity polypeptide 1
ref|accession|locus|:q
ProteinProspector programs designate:
accession number, 5713315 as all consecutive digits following the first "|"
species, as UNREADABLE because it isn't generally present in the comment line.
name, as the entire comment line.
All the comment lines of this format are written by FA-Index to the file seqdb/NCBInr....usp.
dbEST
This database wins the booby prize as the one with the least consistent comment lines.
Sample entry:
>gi|1705383|gb|N20717|N20717 SMNHADA002044SK SmAW Schistosoma mansoni cDNA 5'
ProteinProspector programs designate:
accession number, 1705383, as all consecutive digits following "gi|"
species, Schistosoma mansoni; since this database is so haphazard in its placement
of the species, FA-Index does a string search in the line after first consulting
the file dbEST.spl.txt
for valid species names. The string search method is possible with this particular
database because there is a more limited range of species represented. However, this
means that a server administrator needs to keep the dbEST.spl.txt file up to date to
ensure continuous high quality species searching of dbEST with ProteinProspector
programs. This task, though annoying, is made somewhat easier by consulting the
seqdb/dbEST.usp file.
Whenever the species cannot be found the species is assigned as UNREADABLE, and the name
is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index
to the file seqdb/dbEST.usp.
name, N20717 SMNHADA002044SK SmAW Schistosoma mansoni cDNA 5', as the string
following the first space.
Sometimes the comment lines are very long and appear to consist of two comment lines appended
together. The two comment lines are separated by a non-printable binary character (ASCII code Control A)
shown here as a full stop. In such cases Protein Prospector only considers the first part of
the comment line.
>gi|3771232|gb|AI209290|AI209290 SWOvAFCAP09G09SK Onchocerca volvulus adult female cDNA (SAW98MLW-OvAF) Onchocerca volvulus cDNA clone SWOvAFCAP09G09 5', mRNA sequence [Onchocerca volvulus].gi|3789602|gb|AI216948|AI216948 SWOvAFCAP10G11SK Onchocerca volvulus adult female cDNA (SAW98MLW-OvAF) Onchocerca volvulus cDNA clone SWOvAFCAP10G115', mRNA sequence [Onchocerca volvulus]
Ludwignr
This is another non-redundant database. The entries are of following format:
db|accno|ID|CRC Description[species]
db - database
CRC - 64bit cyclic redundancy check
Here are some example - one from each of the components of the database:
>gp|M84711|182775|000037AE195F7A9D v-fos transformation effector protein [Homo sapiens]
>gp|AL391014|9716128|0006579AD1B1EEE8 putative DNA-binding protein [Streptomyces coelicolor A3(2)]
>pir|A91719|GGIC1A|0027F62F6F36BA36 globin CTT-IA - midge (Chironomus thummi thummi)[Chironomus thummi thummi]
>pir|JX0361|JX0361|00013E4475F84453 subtilisin-trypsin inhibitor, SIL10 - Streptomyces sp.[Streptomyces sp.]
>pir|JC7193|PC7055|0154D83E82AA822B cell division protein FtsQ - Streptomyces collinus (fragment)[Streptomyces collinus]
>pir|A29526|A29526|02AC2025766BCBC7 ubiquitin B processed pseudogene - human[Homo sapiens]
>sp|P55820|SN25_RABIT|00014F740FEB29C5 (SNAP..)SYNAPTOSOMAL-ASSOCIATED PROTEIN 25 (SNAP-25) (SUPER PROTEIN) (SUP) (FRAGMENTS).[Oryctolagus cuniculus]
>sp_vs|P16157-01|P16157|004EDB42F81EBDE8 ISOFORM 2.2 OF P16157[Homo sapiens]
>tr|AF247519|AAF71733|0001F06BB33BD2E8 Gag protein (Fragment).[Human immunodeficiency virus type 1]
>tr|U83613|O09751|0000148C132C06BD (POL)REVERSE TRANSCRIPTASE (FRAGMENT).[Human immunodeficiency virus type 1]
>tr_vs|P70390-01|P70390|0172F8C6825A0023 ISOFORM OG12B/PRX3B OF P70390[Mus musculus]
>wp|CE24847|C44C3.3|0205CAE438EE8B14 (ST.LOUIS) TR:P91157 protein_id:AAB37360.1[C. elegans]
>yp|ORFP:YDR094W|0642CC1F954A58E2 YDR094W, Chr IV from 635833-636168[S. cerevisiae]
ProteinProspector programs designate (example from first line above):
accession number, M84711 as the text string between the first
vertical bar and the next vertical bar.
species, Homo sapiens as the text string inside the last set of square brackets.
name, v-fos transformation effector protein,
as the text string between the first space and the last set of square brackets.
Often the comment line in a FASTA database is used to describe basic information like
entry name, accession number (or other unique identifier), and the species or organism from which the sequence
was obtained. With well curated databases, this information is consistently
organized into fields in the comment line of a FASTA formatted database.
For ProteinProspector programs the sequence field is only subject to 2 constraints.
1) it must be in CAPITAL lettters, and 2) it must be in single letter code
(some people express amino acids in 3-letter code).
The way ProteinProspector programs "know" which dialect of FASTA to "speak" with a
particular database's comment line
is via the filename. Generic filename prefixes are shown below in bold and the associated
comment line format described. These formats are handled in a relatively robust manner,
to allow for the absence of fields or the presence of additional fields. The formats basically consist
of "|" delimited fields of accession number, name, and species in that order.
DN and PN
The D forms designate that the sequence is DNA and will be translated into protein sequence
by ProteinProspector programs. The P forms indicate protein sequence.
> 417909| Better than sliced bread growth factor beta|Mouse|pancreas|
ProteinProspector programs designate:
accession number, 417909, as the integer before the first "|"
name, Better than sliced bread growth factor beta, as the string between the
first "|" and second "|" (or the end of the line, if no second "|")
species, Mouse, as the string between the
second "|" and third "|" (or the end of the line, if no third "|")
Whenever the species cannot be found the species is assigned as UNREADABLE, and the name
is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index
to the file seqdb/DN.usp, or seqdb/PN.usp.
DA and PA
The D forms designate that the sequence is DNA and will be translated into protein sequence
by ProteinProspector programs. The P forms indicate protein sequence.
Note that the DA and PA differ from the DN and PN set only in that the accession number
can be alphanumeric rather than numeric. This second set is thus more robust. However, for
large, frequently updated databases FA-Index can take an hour to run rather than several minutes
simply because creation of the dbfilename.acc file involves the much slower process of
sorting strings rather than integers.
> SlowSort909| Better than sliced bread growth factor beta|Mouse|pancreas|
ProteinProspector programs designate:
accession number, SlowSort909, as the alphanumeric string before the first "|"
name, Better than sliced bread growth factor beta, as the string between the
first "|" and second "|" (or the end of the line, if no second "|")
species, Mouse, as the string between the
second "|" and third "|" (or the end of the line, if no third "|")
Whenever the species cannot be found the species is assigned as UNREADABLE, and the name
is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index
to the file seqdb/DA.usp or seqdb/PA.usp.
Any number of proprietary databases may be created with DA, DN, PA or PN prefixes. You must also create
species alias lists and
accession number links for any databases which you create.
DDefault and PDefault
If these prefixes are used then all attempts at trying to extract information from the
comment line are abandoned.
ProteinProspector programs designate:
accession number, as the entry number (1 for the first entry, 2 for the second entry, etc).
species, as UNREADABLE.
name, as the entire comment line.
DDefault is used for a database containing DNA sequences and PDefault for one containing protein
sequences.
Suffix (databasefilename.xxx) | Description |
|---|
| .idc | Contains a list of byte offsets for the start of the comment line
for each entry in the database. |
|---|
| .idp | Contains a list of byte offsets for the start of the sequences
for each entry in the database. |
|---|
| .idi | Contains the number of entries in the database, the length of the
longest comment line in the database and the length of the longest sequence in the database.
|
|---|
| .idx | Used in previous versions of Protein Prospector. Now obselete. |
|---|
| .unk | Index which keeps track of all foreign characters in the sequence field
for each database entry.
    For protein databases any characters other than the 20 standard amino acids are foreign characters.
    For DNA databases any characters other than A, G, C, T, and N are foreign characters.
    Note that the sequences must be in CAPITAL lettters, and in single letter
code (some people express amino acids in 3-letter code). |
|---|
| .mw | Index containing the calculated protein Molecular Weight (MW) of each
sequence in the database. For DNA sequences this MW is calculated by translating in frame
1 and ignoring stop codons. The amino acid C is treated as unmodified, the amino acid X
is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q.
The .mw file is used to accelerate searches that are constrained by intact MW. |
|---|
| .pi | Index containing the calculated protein pI of each
sequence in the database. For DNA sequences this pI is calculated by translating in frame
1 and ignoring stop codons. The amino acid C is treated as unmodified, the amino acid X
is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q.
The .pi file is used to accelerate searches that are constrained by intact pI. |
|---|
| .sp | Index containing the Species of each sequence in the database. Used
to accelerate searches that are constrained by species. |
|---|
| .sl | Contains a list in alphabetical order of the text strings used to denote different
species. A text string has to occur at least ten times to appear in this file. This file is never used
by the ProteinProspector programs. The text strings are the ones you should use in MS-Pattern if you have
the Search Mode set to Species. |
|---|
| .usp | File created to list the comment lines of each entry for
which FA-Index cannot read the species. This file is never used by the ProteinProspector
programs; it is created only for use by server administrators in troubleshooting
species problems.
|
|---|
| .acc | Index of alphanumeric accession numbers, created only for database
filename prefixes: Genpept, gen, SwissProt, swp, Owl, owl, DA, PA. |
|---|
| .acn | Index of integer accession numbers, created only for database
filename prefixes: NCBInr, nr, dbEST, dbest, DN, PN. |
|---|
Suffix (databasefilename.xxx) | Bypassable | How to by-pass if
possible |
|---|
| .idc | no | Necessary for any ProteinProspector program that
searches/consults a database file. |
|---|
| .idp | no | Necessary for any ProteinProspector program that
searches/consults a database file. |
|---|
| .idi | no | Necessary for any ProteinProspector program that
searches/consults a database file. |
|---|
| .idx | n/a | Obselete |
|---|
| .unk | no | Necessary for any ProteinProspector program that
searches/consults a database file. |
|---|
| .mw | yes | Select All in the MW search parameters. |
|---|
| .pi | yes | Select All in the pI search parameters. |
|---|
| .sp | yes | Select All in the Species search parameters. |
|---|
| .sl | yes | This file is never used by the ProteinProspector programs;
it is used to report the contents of the species fields in the database file.
|
|---|
| .usp | yes | This file is never used by the ProteinProspector programs;
it is created only for use by server administrators in troubleshooting species problems.
|
|---|
.acc .acn | yes | Don't choose retrieve by Accession number in
MS-Digest, or set the search mode to Accession number in MS-Pattern. |
|---|
Once you've downloaded a new database into the seqdb directory you need to create
the index files described above before you can start to use it. To do this:
1). Type the name of the database into the Newly Downloaded Database field.
2). Press the Create Indicies For New Database button.
3). Update the database list.
The lists of databases and species used by the forms in the Protein Prospector package
are held in Javascript files; the default
location of these files is shown on the FA-Index form. To update the contents of
the files press the Update Database and Species Lists in Forms button. After doing this
you will probably have to reload the relevant HTML form before the new lists
appear. If this doesn't work place the cursor in the URL location box of the browser and press
return. If even this doesn't work investigate the cache settings on your browser.
The species Javascript file is generated from the information in the
species.txt file.
The Javascript files are automatically updated after performing the following operations on
the FA-Index form:
Create Indicies For New Database
Create a Pre-Search Subset Database
Create Subset Database with Indices from Saved Hits
Create or Append to User Database
You will still have to reload the form as described above to be able to select a newly
created database.
ProteinProspector licensees can create their own subset databases which have been pre-filtered
for species, species codes, molecular weight, pI and accession number. For example to create
a subset database of human proteins between 1000-100000 Da from the SwissProt database:
1). Choose a suitable suffix for the database such as human.
2). Select SwissProt.rxx as the existing database.
3). Select HOMO SAPIENS as the species.
4). Enter 1000 to 100000 as the MW of the Protein and deselect All.
5). Press the Create Subset Database button.
6). Update the database list.
Using subset databases is likely to dramatically decrease search times.
This feature is only available to ProteinProspector licensees.
The Hits (index numbers for matching database entries) from ProteinProspector search programs
can be saved to a user-specified file. This file
can then be used create a subset database containing only the Hit proteins from the search.
1). Choose a suitable suffix for the database. The suffix must be unique; if you use the same
suffix twice then the previously created subset database will be overwritten.
2). Identify the database that was used in the original search.
3). Identify the file containing the saved hits by entering the Program and
File Name.
4). Press the Create Subset Database with Indices from Saved Hits button.
5). Update the database list.
This feature is only available to ProteinProspector licensees.
It is possible to create your own fasta format database which can be searched by
the ProteinProspector search programs. An entry for a single protein or DNA sequence is made up of a
comment line containing accession number, species and name fields followed by one or
more lines containing the sequence.
1). Enter the database name. There are several dialects of fasta with the essential difference
between them being the format of the comment line. You are strongly advised to use
a proprietary format but it is also possible to use a
public format. If you choose a database name that already
exists on the disk then subsequent proteins will be appended to the end of the file,
otherwise a new database file will be created. It is possible to append entries to the end of
the publicly available databases but this is not advisable; firstly because the index
files are remade after each entry, secondly because newer versions of the database won't
contain your entries and thirdly because any errors in the information you supply when
adding the entry could potentially damage the whole database. If you want to use a
public database format you should use a database name such as NCBInr.user.
2). Enter a name for the entry. Whether you are using a proprietary format
or a public format make sure you don't use characters in the
name which might give the ProteinProspector programs problems in sorting out the fields in
the comment line.
3). Enter a species for the entry. This should be consistent with the information in the
species.txt file.
4). Enter an accession number for the entry. The accession number must be unique; the program
will alert you if it isn't. If your database uses numeric accession numbers then the
accession number must be numeric.
5). Enter the protein or DNA sequence using only the upper case symbols for the 20 naturally occurring
amino acids or the four base pairs as appropriate. X may also be used to if the sequence is unknown at a
particular point.
6). Press the Create or Append to User Database button.
The database summary report option is used to list the accession numbers, species and name fields for a
selected index number range of a selected database. Deselect the Hide Protein Sequence checkbox
if you also want to see the protein sequences. You can also select the DNA Reading Frame if you are
looking at a DNA database.
FA-Index can also be run from the command line. You might want to do this if you
want to set up a batch job to automatically update the databases or if running it
from the web page interface causes a time out.
On all operating systems the FA-Index program is expected to reside in the same directory as all other ProteinProspector
programs (i.e. ). FA-Index accepts a single input argument (the name of the database file). Upon execution FA-Index
issues an instruction to read the database file from seqdb/database_filename and write
the indices to seqdb/database_filename.suffix.
This can cause some problems about which directory to launch FA-Index from and the syntax of launching it.
We've tried to make this as simple as possible, however system administrators can
easily outsmart themselves, particularly if they want to alter the ProteinProspector
directory structure.
Basically you should launch FA-Index from the directory immediately above the
seqdb directory, without specifying the path to the database file. FA-Index
inserts only seqdb/ in front of the filename, and it "knows" whether to put
a forward slash or a back slash for your particular operating system.
If the FA-Index program does not reside in the directory immediately above the seqdb
directory (the normal case on Windows NT systems) then you may need to specify the path
to faindex (but not to the database).
On UNIX systems there is no reason why seqdb cannot be a symbolic link to another
directory.
Examples:
On SunOS UNIX systems issue a command of the form:
      /home/httpd//faindex.cgi Genpept.r95
On Windows NT systems use an MS-DOS command prompt to issue a command of
the form:
      C:\http> faindex.cgi Genpept.r95
(you may first need to type)
      path=C:\http\
or try
      C:\http> \faindex.cgi Genpept.r95
It is now possible to run all the Protein Prospector programs from a command line interface. The parameters
for the programs can be specified as name value pairs. In this way you can specify further parameters
such as a different path for the seqdb directory. See the Protein Prospector
Automation Manual for details.