Protein Prospector Server Administration

Purpose

This document provides instructions for Protein Prospector administrative tasks on both LINUX and Microsoft Windows platforms.

Most of Protein Prospector's configuration files are in the directory. The files are all text files and must be edited with a text editor. Suitable programs are Notepad on a Windows platform or vi/emacs on a LINUX platform. You should not use a Word processor to edit the files.

A list of all the parameter files is shown below with a link to the relevant manual section.

Configuration files required by all versions of Protein Prospector:

Configuration files used by Batch-Tag/Search Compare:

One other file that you may need to modify is


  1. Obtain FASTA formatted sequence database files for the seqdb directory (specified in the main configuration file):

    Locations to download FASTA formatted database files via ftp:

    • uniprot_sprot.fasta uniprot_trembl.fasta Combined
      # entries (4.24.2008) 362,782 # entries (4.24.2008) 5,577,054 # entries (4.24.2008) 5,939,836
      Size in Bytes Size in Bytes Size in Bytes
      Downloaded File (.gz) 57,052,256 Downloaded File (.gz) 1,045,825,538
      Uncompressed Database File 166,064,417 Uncompressed Database File 2,336,384,305 Combined Database Files 2,502,448,722
      Protein Prospector acc File 5,330,625 Protein Prospector acc File 88,121,760 Protein Prospector acc File 93,926,272
      Protein Prospector idc File 2,902,256 Protein Prospector idc File 44,616,432 Protein Prospector idc File 47,518,688
      Protein Prospector idi File 12 Protein Prospector idi File 12 Protein Prospector idi File 12
      Protein Prospector idp File 2,902,256 Protein Prospector idc File 44,616,432 Protein Prospector idc File 47,518,688
      Protein Prospector mw File 5,804,512 Protein Prospector mw File 89,232,864 Protein Prospector mw File 95,037,376
      Protein Prospector pi File 5,804,512 Protein Prospector pi File 89,232,864 Protein Prospector pi File 95,037,376
      Protein Prospector sl File 12,091 Protein Prospector sl File 42,195 Protein Prospector sl File 47,195
      Protein Prospector sp File 2,895,083 Protein Prospector sp File 49,217,079 Protein Prospector sp File 52,509,341
      Protein Prospector usp File 0 Protein Prospector usp File 0 Protein Prospector usp File 0
      Total Disk Space Requirement 191,715,764 Total Disk Space Requirement 2,741,463,943 Total Disk Space Requirement 2,934,043,670


    • # entries (4.24.2008) 320,363
      Size in Bytes
      Downloaded File (swissprot.gz) 87,957,808
      Uncompressed Database File 156,573,096
      Protein Prospector acc File 5,500,896
      Protein Prospector idc File 2,562,904
      Protein Prospector idi File 12
      Protein Prospector idp File 2,562,904
      Protein Prospector mw File 5,125,808
      Protein Prospector pi File 5,125,808
      Protein Prospector sl File 11,865
      Protein Prospector sp File 2,845,949
      Protein Prospector usp File 0
      Total Disk Space Requirement 180,309,242


    • # entries (4.24.2008) 8,137,734
      Size in Bytes
      Downloaded File (est_human.gz) 1,429,171,219
      Uncompressed Database File 5,251,493,836
      Protein Prospector acn File 65,101,872
      Protein Prospector idc File 65,101,872
      Protein Prospector idi File 12
      Protein Prospector idp File 65,101,872
      Protein Prospector sl File 14
      Protein Prospector sp File 72,128,524
      Protein Prospector usp File 0
      Total Disk Space Requirement 5,518,928,002


    • # entries (4.24.2008) 4,850,258
      Size in Bytes
      Downloaded File (est_mouse.gz) 793,259,019
      Uncompressed Database File 2,975,655,157
      Protein Prospector acn File 38,802,064
      Protein Prospector idc File 38,802,064
      Protein Prospector idi File 12
      Protein Prospector idp File 38,802,064
      Protein Prospector sl File 28
      Protein Prospector sp File 42,541,260
      Protein Prospector usp File 0
      Total Disk Space Requirement 3,134,602,649


    • # entries (4.24.2008) 38,517,496
      Size in Bytes
      Downloaded File (est_others.gz)7,259,517,702
      Uncompressed Database File 27,332,614,195
      Protein Prospector acn File 308,139,968
      Protein Prospector idc File 308,139,968
      Protein Prospector idi File 12
      Protein Prospector idp File 308,139,968
      Protein Prospector sl File 28,715
      Protein Prospector sp File 374,101,979
      Protein Prospector usp File 0
      Total Disk Space Requirement 28,631,164,805


    • # entries (4.24.2008) 6,468,149
      Size in Bytes
      Downloaded File (nr.gz) 1,531,595,824
      Uncompressed Database File 3,473,723,692
      Protein Prospector acn File 107,801,264
      Protein Prospector idc File 51,745,192
      Protein Prospector idi File 12
      Protein Prospector idp File 51,745,192
      Protein Prospector mw File 103,490,384
      Protein Prospector pi File 103,490,384
      Protein Prospector sl File 719,554
      Protein Prospector sp File 123,120,741
      Protein Prospector usp File 18,535,381
      Total Disk Space Requirement 4,034,371,796


    • # entries (4.24.2008) 13,382,652
      Size in Bytes
      Downloaded File (rel165.fsa_aa.gz) 1,976,647,053
      Uncompressed Database File 4,520,769,427
      Protein Prospector acn File 107,061,216
      Protein Prospector idc File 107,061,216
      Protein Prospector idi File 12
      Protein Prospector idp File 107,061,216
      Protein Prospector mw File 214,122,432
      Protein Prospector pi File 214,122,432
      Protein Prospector sl File 671,143
      Protein Prospector sp File 128,095,968
      Protein Prospector usp File 70,039,278
      Total Disk Space Requirement 5,469,004,340


    • FileDescriptionTagSize in Bytes
      Aaegypti_nr.seq Aedes aegypti from EnsEMBL ens 8,635
      Agambiae_nr.seq Anopheles gambiae from EnsEMBL ens 7,474,716
      Amellifera_nr.seq Apis mellifera from EnsEMBL ens 13,743,616
      Btaurus_nr.seq Bos taurus from EnsEMBL ens 15,412,324
      Cbriggsae_nr.seq Caenorhabditis briggsae from EnsEMBL ens 5,572,499
      Celegans_nr.seq Caenorhabditis elegans from EnsEMBL ens 304,209
      Cfamiliaris_nr.seq Canis familiaris from EnsEMBL ens 16,172,880
      Cintestinalis_nr.seq Ciona intestinalis from EnsEMBL ens 10,745,415
      Cporcellus_nr.seq Cavia porcellus from EnsEMBL ens 9,734,021
      Csavignyi_nr.seq Ciona savignyi from EnsEMBL ens 12,428,143
      Dmelanogaster_nr.seq Drosophila melanogaster from EnsEMBL ens 414,193
      Dnovemcinctus_nr.seq Dasypus novemcinctus from EnsEMBL ens 9,816,788
      Drerio_nr.seq Danio rerio from EnsEMBL ens 15,365,065
      Ecaballus_nr.seq Equus caballus from EnsEMBL ens 15,443,676
      Eeuropaeus_nr.seq Erinaceus europaeus from EnsEMBL ens 10,043,146
      Etelfairi_nr.seq Echinops telfairi from EnsEMBL ens 11,047,074
      Fcatus_nr.seq Felis catus from EnsEMBL ens 8,815,964
      Gaculeatus_nr.seq Gasterosteus aculeatus from EnsEMBL ens 17,265,152
      Ggallus_nr.seq Gallus gallus from EnsEMBL ens 12,988,589
      Hsapiens_nr.seq Homo sapiens from EnsEMBL ens 8,789,424
      Lafricana_nr.seq Loxodonta africana from EnsEMBL ens 10,406,380
      Mdomestica_nr.seq Monodelphis domestica from EnsEMBL ens 23,137,550
      Mlucifugus_nr.seq Myotis lucifugus from EnsEMBL ens 11,143,933
      Mmulatta_nr.seq Macaca mulatta from EnsEMBL ens 21,669,491
      Mmurinus_nr.seq Microcebus murinus from EnsEMBL ens 11,230,548
      Mmusculus_nr.seq Mus musculus from EnsEMBL ens 8,208,552
      Oanatinus_nr.seq Ornithorhynchus anatinus from EnsEMBL ens 15,580,576
      Ocuniculus_nr.seq Oryctolagus cuniculus from EnsEMBL ens 10,189,512
      Ogarnettii_nr.seq Otolemur garnettii from EnsEMBL ens 10,672,503
      Olatipes_nr.seq Oryzias latipes from EnsEMBL ens 15,275,454
      Oprinceps_nr.seq Ochotona princeps from EnsEMBL ens 10,970,832
      Pberghei_nr.seq Plasmodium berghei ANKA from PlasmoDB plasmo 283,581
      Pchabaudi_nr.seq Plasmodium chabaudi from PlasmoDB plasmo 190,435
      Pfalciparum_nr.seq Plasmodium falciparum 3D7 from PlasmoDBplasmo 411,261
      Pknowlesi_nr.seq Plasmodium knowlesi H from PlasmoDB plasmo 4,529,623
      Ppygmaeus_nr.seq Pongo pygmaeus from EnsEMBL ens 14,226,824
      Ptroglodytes_nr.seq Pan troglodytes from EnsEMBL ens 20,749,232
      Pvivax_nr.seq Plasmodium vivax SaI-1 from PlasmoDB plasmo 4,384,825
      Pyoelii_nr.seq Plasmodium yoelii 17XNL from PlasmoDB plasmo 59,354
      Rnorvegicus_nr.seq Rattus norvegicus from EnsEMBL ens 17,694,720
      Saraneus_nr.seq Sorex araneus from EnsEMBL ens 8,941,578
      Scerevisiae_nr.seq Saccharomyces cerevisiae from EnsEMBL ens 23,957
      Stridecemlineatus_nr.seqSpermophilus tridecemlineatus from EnsEMBLens10,168,502
      Tbelangeri_nr.seq Tupaia belangeri from EnsEMBL ens 10,448,535
      Tgondii_nr.seq Toxoplasma gondii from PlasmoDB plasmo 6,537,597
      Tnigroviridis_nr.seq Tetraodon nigroviridis from EnsEMBL ens 31,802
      Trubripes_nr.seq Takifugu rubripes from EnsEMBL ens 36,821,428
      Xtropicalis_nr.seq Xenopus tropicalis from EnsEMBL ens 16,617,008
      sludge_aus_nr.seq Australian sludge sludge 9,944,026
      sludge_us1_nr.seq US sludge, Jazz Assembly sludge 4,961,736
      sludge_us2_nr.seq US sludge, Phrap Assembly sludge 8,221,065
      swiss_nr.seq SwissProt + updates sp 175,836,033
      swiss_varsplic_nr.seq SwissProt splice variants sp_vs 17,359,720
      trembl_nr.seq TrEMBL + updates tr 2,287,644,410
      wormpep_nr.seq WormPep from the Sanger center wp 326,234
      yeastpep_nr.seq Yeast ORFs from Stanford yp 139,418
      nr_prot.tar.gz Compressed tarball of above files and
      documentation
      1,311,126,484
      nr_prot.tar Uncompressed tarball 3,006,679,040


    • # entries (4.24.2008) 312,942
      Size in Bytes
      Downloaded File (owl.fasta.Z) 68,452,223
      Uncompressed Database File 126,681,299
      Protein Prospector acc File 5,278,314
      Protein Prospector idc File 2,503,536
      Protein Prospector idi File 12
      Protein Prospector idp File 2,503,536
      Protein Prospector mw File 5,007,072
      Protein Prospector pi File 5,007,072
      Protein Prospector sl File 37,988
      Protein Prospector sp File 3,125,345
      Protein Prospector usp File 3,854,820
      Total Disk Space Requirement 153,998,994

    The UniProtKB database is made up from a concatenation of uniprot_sprot.fasta.gz and uniprot_trembl.fasta.gz for the directory ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase,

    The Ludwignr database is a non-redundant database made up from several smaller databases contained in the directory ftp://ftp.ch.embnet.org/pub/databases/nr_prot. You need to download the ones you are interested in individually and then concatenate them together to make one file. The database files currently have a .seq suffix.

    To do concatenation on the LINUX operating system you can use the cat command from the command line. For Windows you could install cygwin and use its cat command. Alternatively you could use the Windows copy command from a command window. Ie:

    		copy file1 + file2 + file3 DestFile
    		copy *.seq FinalDatabase
    		

  2. Uncompress and rename the database files according to the format: UniProt.##, Genpept.##, Owl.##, SwissProt.##, NCBInr.##, dbEST.##, Ludwignr.##, IPI.##. The prefixes shown in italics (UniProt, Genpept, Owl, SwissProt, NCBInr, dbEST, Ludwignr or IPI) are a necessary part of the name, which allow the software to differentiate the specific dialect of the FASTA format comment line used in each database. You may also use the corresponding lowercase prefixes gen, owl, swp, ipi, nr, or dbest. They can also be used for a second database that is of the same format as the uppercase one. If you want to know more details, please read the FA-Index manual, particularly the filenaming sections.

  3. Create indices in the seqdb directory for each database, by using the program. The indicies are necessary for preliminary filtering by species, protein MW and protein pI. FA-Index must be run after each update of a database, even if the update is done by only adding new entries to the end of the original file.

    If you really want to know what FA-Index does and why, please read the manual. Don't even think about trying to use proprietary databases or update databases daily, UNLESS you read the FA-Index manual, particularly the generic database filenaming sections.

    FA-Index will create a file with a .usp suffix (eg. Genpept.r95.usp) where it writes the comment line for each FASTA entry which the FA-Index program cannot parse out the species. Viewing this file can help troubleshoot FASTA format problems for anyone using proprietary databases.


The main Protein Prospector configuration file is info.txt. Although the parameters defined in this don't need to be defined in any particular order it is best to retain the order used in the distributed version of the file. This will make diagnosis easier if problems occur.

The parameters in info.txt are name-value pairs. A name-value pair is a line in the file where the name is followed by a space character and the rest the line is the value. The value may contain space characters. If just the name is specified then the value is assumed to be an empty string.

For example:

ucsf_banner false

Here ucsf_banner is the name and false is the value

Each parameter has a default value which is used if the parameter is missing from the file. When the parameters are listed below, the default value is listed after the parameter. In some cases the default value is an empty string. Sometimes it is not appropriate to use the default value.

If the parameter is a directory name it is permissable to use UNC paths for Windows systems.

Some of these parameters are relevant to all Protein Prospector installations whereas others are only relevant if the installation includes Batch-Tag searching.

The Sequence Database Directory
name: seqdb
default value: seqdb

This is the directory containing the sequence databases. It is almost always appropriate to specify this. In most cases it is best from a performance point of view to have the sequence databases on a separate disk partition and administrators need to make sure this is big enough for current and likely future needs. One reason for this is to stop the database files becoming fragmented.

The sequence database directory can be on a network drive and UNC paths are permitted. However this is not recommended.

If the several Prospector instances have been installed as a computing cluster then it is recommended that each of the cluster nodes has its own sequence database directory with identical copies of any databases used.

The Upload Temporary Directory
name: upload_temp
default value: temp

The MS-Fit Upload and Batch-Tag Web forms both have an Upload Data From File option. When the file is first uploaded it is copied into the upload temporary directory.

By default the upload temporary directory is simply set to the temp directory in the Protein Prospector distribution. If you have the basic Protein Prospector package (without the Batch-Tag option) there is no particular reason to change this. The only relevant program is MS-Fit Upload and this program will delete the file as soon as it has extracted the relevant information from it.

If you are using the Batch-Tag Web program then any successfully uploaded files are copied to a user data repository from the upload temporary directory. Thus it may be appropriate to locate the upload temporary directory on the same disk partition or network drive as the user data repository.

The Maximum Size of an Uploaded File
name: max_upload_content_length
default value: 0

It is possible to restrict the size in bytes of any uploaded file via the max_upload_content_length parameter. If an uploaded file exceeds this length then the search will be rejected and no files will be generated on the system.

If this parameter is set to zero then the size of the uploaded file is not restricted by Protein Prospector. It may however be restricted by the web server software.

The Path of the R Executable
name: r_command
default value:

The R statistics package is used for drawing some of the plots in the Protein Prospector output. In order for this to work the R package needs to be installed and the r_command parameter needs to contain the full path to the R exectutable file.

For a Windows system this might be:

r_command C:\Program Files\R\R-2.2.1\bin\R

For a LINUX system it could be:

r_command /usr/bin/R

If the r_command parameter is missing from the info.txt file then Protein Prospector assumes that R is not installed and the relevant plots will be missing from the reports.

Whether the UCSF Banner Should Be Displayed
name: ucsf_banner
default value: false

A black UCSF banner can be displayed at the top of the search forms and results pages. You can choose whether or not to display this based on the ucsf_banner parameter. Note that this parameter will not turn the banner on or off on static web pages. To do this you need to modify the html/js/info.js file.

Logging Parameters

It is possible to create log files when search forms are submitted to the server. These can be used to diagnose problems.

The log files are created in a subdirectory of the logs directory. The subdirectory is named after the date the search form was submitted. The date format is yyyy_mm_dd to enable easy sorting of the directories.

Each binary (eg mssearch.cgi, msform.cgi, etc) can write out a log file. This will contain some of the CGI environment variables, the process ID, the program start and end times and optionally the search parameters.

The log files can be automatically deleted after a specified period. For example to delete the log files after 7 days the following name-value pair should be specified:

delete_log_days 7

If the delete_log_days parameter is set to zero the log files are never deleted. This is the default situtation.

To write a log file for the mssearch.cgi binary which contains the basic logging information the following name-value pair should be specified:

mssearch_logging true

If you additionally want to record the parameters from the search form in the log file then you also need to specify the following name-value pair:

mssearch_parameter_logging true

The equivalent name-value pairs for msform.cgi and searchCompare.cgi are:

msform_logging true
msform_parameter_logging true
searchCompare_logging true
searchCompare_parameter_logging true

The log files are in XML format. However as they are not valid XML files until the associated search has finished they are first created with a .txt suffix which changes to a .xml suffix at the end of the search. Thus a file with .txt suffix either represents a search that is in progress or one that has failed.

A typical log file name is:

mssearch_000107_4264.xml

Here mssearch is the program binary name, 000107 is the form submission date in hhmmss format and 4264 is the process id number.

Typical contents of a basic log file:

<?xml version="1.0" encoding="UTF-8"?>
<?Tue Apr 01 00:01:07 2008, ProteinProspector Version 5.0.0?>
<program_log>
<pid>4264</pid>
<start_time>Tue Apr 01 00:01:07 2008</start_time>
<SCRIPT_NAME>/prospector/cgi-bin/mssearch.cgi</SCRIPT_NAME>
<REMOTE_HOST></REMOTE_HOST>
<REMOTE_ADDR>127.0.0.1</REMOTE_ADDR>
<HTTP_USER_AGENT>Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.13)
                 Gecko/20080311 Firefox/2.0.0.13
</HTTP_USER_AGENT>
<HTTP_REFERER>http://localhost/prospector/cgi-bin/msform.cgi?form=mspattern</HTTP_REFERER>
<end_time>Tue Apr 01 00:01:46 2008</end_time>
<search_time>39 sec</search_time>
</program_log>

Typical contents a log file which also contains the search parameters:

<?xml version="1.0" encoding="UTF-8"?>
<?Tue Apr 15 12:57:35 2008, ProteinProspector Version 5.0.0?>
<program_log>
<pid>1612</pid>
<start_time>Tue Apr 15 12:57:35 2008</start_time>
<SCRIPT_NAME>/prospector/cgi-bin/mssearch.cgi</SCRIPT_NAME>
<REMOTE_HOST>127.0.0.1</REMOTE_HOST>
<REMOTE_ADDR>127.0.0.1</REMOTE_ADDR>
<HTTP_USER_AGENT>Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.13)
                 Gecko/20080311 Firefox/2.0.0.13
</HTTP_USER_AGENT>
<HTTP_REFERER>http://localhost/prospector/cgi-bin/msform.cgi?form=msfitstandard</HTTP_REFERER>
<parameters>
<const_mod>Carbamidomethyl%20%28C%29</const_mod>
<data>842.5100%0D%0A
      856.5220%0D%0A
      864.4733%0D%0A
      870.5317%0D%0A
      940.4754%0D%0A
      943.4885%0D%0A
      959.4934%0D%0A
      970.4308%0D%0A
      975.4785%0D%0A
      1045.5580%0D%0A
      1048.5716%0D%0A
      1063.5712%0D%0A
      1064.5892%0D%0A
      1098.6185%0D%0A
      1147.5876%0D%0A
      1163.5996%0D%0A
      1178.6280%0D%0A
      1179.6014%0D%0A
      1187.6316%0D%0A
      1193.5461%0D%0A
      1211.6607%0D%0A
      1248.5664%0D%0A
      1280.5561%0D%0A
      1289.7670%0D%0A
      1314.7019%0D%0A
      1328.6521%0D%0A
      1332.7121%0D%0A
      1360.6820%0D%0A
      1406.6617%0D%0A
      1447.7010%0D%0A
      1459.7311%0D%0A
      1475.7471%0D%0A
      1508.8107%0D%0A
      1576.7986%0D%0A
      1624.7649%0D%0A
      1699.9255%0D%0A
      1721.9134%0D%0A
      1767.9147%0D%0A
      1776.8961%0D%0A
      1783.9077%0D%0A
      1794.8293%0D%0A
      1799.9017%0D%0A
      1816.9798%0D%0A
      1859.8805%0D%0A
      2088.9872%0D%0A
      2211.1046%0D%0A
      2240.1851%0D%0A
      2256.2412%0D%0A
      2284.2079%0D%0A
      2299.2019%0D%0A
      2808.4450%0D%0A
      3156.6352%0D%0A
</data>
<data_format>PP%20M%2FZ%20Charge</data_format>
<data_source>Data%20Paste%20Area</data_source>
<database>SwissProt.2007.12.04</database>
<detailed_report>1</detailed_report>
<dna_frame_translation>3</dna_frame_translation>
<enzyme>Trypsin</enzyme>
<full_pi_range>1</full_pi_range>
<high_pi>10.0</high_pi>
<input_filename>lastres</input_filename>
<input_program_name>msfit</input_program_name>
<instrument_name>ESI-Q-TOF</instrument_name>
<low_pi>3.0</low_pi>
<met_ox_factor>1.0</met_ox_factor>
<min_matches>4</min_matches>
<min_parent_ion_matches>1</min_parent_ion_matches>
<missed_cleavages>1</missed_cleavages>
<mod_AA>Peptide%20N-terminal%20Gln%20to%20pyroGlu</mod_AA>
<mod_AA>Oxidation%20of%20M</mod_AA>
<mod_AA>Protein%20N-terminus%20Acetylated</mod_AA>
<mowse_on>1</mowse_on>
<mowse_pfactor>0.4</mowse_pfactor>
<ms_mass_exclusion>0</ms_mass_exclusion>
<ms_matrix_exclusion>0</ms_matrix_exclusion>
<ms_max_modifications>2</ms_max_modifications>
<ms_max_reported_hits>5</ms_max_reported_hits>
<ms_parent_mass_systematic_error>0</ms_parent_mass_systematic_error>
<ms_parent_mass_tolerance>20</ms_parent_mass_tolerance>
<ms_parent_mass_tolerance_units>ppm</ms_parent_mass_tolerance_units>
<ms_peak_exclusion>0</ms_peak_exclusion>
<ms_prot_high_mass>125000</ms_prot_high_mass>
<ms_prot_low_mass>1000</ms_prot_low_mass>
<msms_deisotope>0</msms_deisotope>
<msms_join_peaks>0</msms_join_peaks>
<msms_mass_exclusion>0</msms_mass_exclusion>
<msms_matrix_exclusion>0</msms_matrix_exclusion>
<msms_peak_exclusion>0</msms_peak_exclusion>
<output_filename>lastres</output_filename>
<output_type>HTML</output_type>
<parent_mass_convert>monoisotopic</parent_mass_convert>
<report_title>MS-Fit</report_title>
<search_name>msfit</search_name>
<sort_type>Score%20Sort</sort_type>
<species>All</species>
<user1_name>Acetyl%20%28K%29</user1_name>
<user2_name>Acetyl%20%28K%29</user2_name>
<user3_name>Acetyl%20%28K%29</user3_name>
<user4_name>Acetyl%20%28K%29</user4_name>
</parameters>
<end_time>Tue Apr 15 12:57:46 2008</end_time>
<search_time>11 sec</search_time>
</program_log>
The Search Timeout in Seconds
name: timeout
default value: 0

The timeout parameter can be used to abort searches that have exceeded a given number of seconds. If this parameter is set to zero then search times are not restricted by Protein Prospector. They may however be restricted by the Web Server software. Note that Batch-Tag search times are never restricted by web server software as they are controlled by a search daemon.

The Root Directory of the Centroid Data File Repository
name: centroid_dir
default value:

This is the root directory for the repository of centroided data. This directory will typically contain a subdirectory for each instrument for which you have centroided data. If you are using several computers in a cluster this parameter will typically be a directory accessible by all computers in the cluster (eg. a UNC directory on a Windows system).

Data which are uploaded to the server is stored in a separate repository for uploaded files which is organized by user.

The Root Directory of the Raw Data File Repository
name: raw_dir
default value:

This is the root directory for the repository of raw data. This directory will typically contain the same subdirectories as the repository for centroided data. If you are using several computers in a cluster this parameter will typically be a directory accessible by all computers in the cluster (eg. a UNC directory on a Windows system).

The Root Directory of the Repository for Uploaded Files
name: upload_repository
default value:

The repository for uploaded files is used to store search results files and project files along with data files which are uploaded using the Batch-Tag Web program. Every user has a separate directory where this information is stored. The upload_repository parameter is used to specify the root directory of this repository. If you are using several computers in a cluster this parameter will typically be a directory accessible by all computers in the cluster (eg. a UNC directory on a Windows system).

Whether Multiple Processors are Used for Batch-Tag Searches
name: multi_process
default value: false

Protein Prospector can optionally use MPICH2 to make use of multiple processors and hence speed up Batch-Tag searches. If you are using this then the multi_process parameter should be set to true.

The Maximum Number of MSMS Spectra in a Group in a Batch-Tag Search
name: msms_max_spectra
default value: 500

When doing Batch-Tag searching this is the maximum number of spectra that a single process deals with in a pass through the database. If the MPI option is used then a single search will use multiple processes. Thus the number of passes through the database that are required depends on this parameter, on the number of spectra in the dataset and the number of processes that MPI has been set up to use.

Whether to Duplicate Scans if the Charge isn't Specified in an Uploaded Centroided File
name: duplicate_scans
default value: false

When data files are uploaded using the Batch-Tag Web program they are converted to MGF format before being stored in the upload repository. Generally, when there is no precursor charge specified in the centroid file then the Precursor Charge Range option on the Batch-Tag Web form is used to supply the charges and the MGF file created doesn't contain charge information. If the duplicate_scans parameter is set to true then the MGF file that is created will contain duplicate peak lists for every charge from the Precursor Charge Range and the corresponding charge information will be placed in the MGF file. mzXML files often don't have charge information stored in the file.

The MPICH2 Run Executable (Windows Only)
name: mpi_run
default value:

If MPICH2 is being used to allow parallel Batch-Tag searches on a Windows platform the mpi_run parameter needs to contain the full path to the mpiexec exectutable file. This parameter is only relevant if the multi_process parameter is set to true.

A typical definition could be:

mpi_run C:\Program Files\MPICH2\bin\mpiexec.exe

On LINIX installations this is dealt with by the PATH environment variable so this parameter is ignored.

The Arguments Used When Running MPICH2 (Windows Only)
name: mpi_args
default value:

mpi_args contains the command line arguments used by mpiexec when using MPICH2 to run a parallel Batch-Tag search. This parameter is only relevant if the multi_process parameter is set to true.

A typical definition could be:

mpi_args -n 3 -localroot

This parameter is ignored on LINIX installations where the Perl script cgi-bin/mssearchmpi.pl is used instead.

The Minimum Password Length for the Batch-Tag Search Database
name: min_password_length
default value: 0

Users have to log in to do Batch-Tag Searching and to view the results in Search Compare. When creating a new user a password has to be selected. The min_password_length is the minimum number of characters that a password can contain. If this is set to 0 the password field can be left blank if the user doesn't want to protect their data with a password.

The Batch-Tag Search Database Login Parameters

These are the parameters that Protein Prospector uses to log into the Batch-Tag Daemon mySQL database.

name: db_host
default value: localhost

db_host is the computer on which the database resides. If you have several instances of Prospector installed on a computer cluster then this needs to be set to the computer where the database has been installed.

name: db_port
default value: 0

db_port is the port used to access the database. If the default value of 0 is used then the default mySQL port is used.

name: db_name
default value: ppsd

db_name is the database name. You can have more than one database but only one can be used at a time. Gnerally you should only change this parameter if you know what you are doing.

name: db_user
default value: prospector
name: db_password
default value: pp

db_user and db_password are the user name and password used to log into the database. These parameters are set when the Protein Prospector package is installed. A random password is selected at this time.

These parameters define the user name and password that Protein Prospector uses to log into the Batch-Tag Daemon mySQL database.

The Batch-Tag Daemon Parameters
name: btag_daemon_name
default value (Windows): btag_daemon
default value (UNIX): btag-daemon

For Windows this parameter defines the name of Batch-Tag Daemon service whereas for UNIX it defines the name of the Batch-Tag Daemon binary.

The only reason for changing this is if you have more than one instance of Protein Prospector installed on the same computer. In this case the daemons would have to have different names.

name: btag_daemon_remote
default value: false

Protein Prospector will normally try to start the Batch-Tag Daemon if you submit a Batch-Tag search and it isn't running. If you set btag_daemon_remote to true then the daemon is assumed to be running on a remote computer so no attempt is made to start it. This makes it possible to set up one computer as a web server and some other computers as compute nodes. These don't even need to have the same Operating Systems running on them. Thus you could have a Windows Web Server that can deal with quantitation and LINUX compute nodes.

name: max_btag_searches
default value: 1

This is the maximum number of Batch-Tag searches that can run at one time on the current computer. If more searches are submitted then they will be placed in a queue. If you want to stop the daemon for any reason but want to make sure any ongoing searches complete you can temporarily set this parameter to 0.

name: email
default value: false

If this parameter is set to true Protein Prospector attempts to send an email to the user once a search has either completed or has been aborted. The computer has to be set up to send email for this to work.

name: server_name
default value: localhost
name: server_port
default value: 80
name: virtual_dir
default value:

These parameters are used to create the URL for running Search Compare when users are sent an email after a Batch-Tag search has finished.

For example for a results retrieval URL of:

http://prospector.ucsf.edu/prospector/cgi-bin/msform.cgi?form=search_compare&search_key=Md7XxQhUQ4R7HQ9i

The following parameters would need to be defined:

server_name prospector.ucsf.edu
virtual_dir prospector
http://prospector.ucsf.edu:8888/prospector/cgi-bin/msform.cgi?form=search_compare&search_key=Md7XxQhUQ4R7HQ9i

would require:

server_name prospector.ucsf.edu
server_port 8888
virtual_dir prospector
name: job_status_refresh_time
default value: 5

After a Batch-Tag search is submitted a Job Status page is displayed which reports on the progress of the job. By default the information is updated every 5 seconds. You can change the update rate by changing the job_status_refresh_time parameter.

name: daemon_loop_time
default value: 5

The daemon_loop_time is the time the Batch-Tag Daemon sleeps between the times when it checks if it has anything to do. The default value for this parameter is 5 sec.

name: aborted_jobs_delete_days
default value: 0

Information on aborted searches is kept in a database table. You can delete this information after a certain time via the aborted_jobs_delete_days parameter. If the default value of 0 is used then the information is not deleted.

name: session_delete_days
default value: 0

Every time a user logs into Protein Prospector an entry is added to a table in the Batch-Tag search database. A key into the table is stored in a cookie in the user's browser which is deleted once the user closes the browser. The entries can be deleted from the database after a time controlled by the session_delete_days parameter. Once the entry has been deleted from the database then the user will have to log in again whether or not they have closed the browser. If the default value of 0 is used then the entries are never deleted from the table. A value of 2 is recommended for this parameter.

name: preload_database
default value: none defined

The Batch-Tag Daemon can load sequence databases into a memory mapped file which the database search programs can access. Multiple databases can be preloaded in this way.

For example:

preload_database SwissProt.2007.12.04
preload_database NCBInr.11.Dec.2007

would preload the SwissProt.2007.12.04 and NCBInr.11.Dec.2007 database into memory mapped files.

The following parameters can be modified whilst the daemon is running.

email
server_name
server_port
virtual_dir
max_btag_searches
daemon_loop_time
session_delete_days
aborted_jobs_delete_days

On Windows systems the file just needs to be saved for it to be reread. Thus you need to be careful when saving the file that there are no errors in it.

On LINUX systems, after saving the file, you also need to send a HUP signal to the btag-daemon process. Ie:

kill -HUP pid

where

pid
is the process ID of the btag-daemon process.


The file html/js/info.js controls some aspects of what is displayed on static web pages such as the home page mshome.htm. There are some variables near the top of the file that can be modified.
pubWebServer

If pubWebServer is set to false then the links to FA-Index on static web pages are not shown.

batchMSMSSearching

If batchMSMSSearching is set to false then all the links in the Batch MSMS Searching section of the home page are not shown.

sciexAnalystRawData

If sciexAnalystRawData is set to false then the link to Wiff Read on the home page is not shown.

ABITOFTOFRawData

If ABITOFTOFRawData is set to false then the link to Peak Spotter on the home page is not shown.

ucsfBanner

If ucsfBanner is set to false then the black UCSF area of the web page is not shown on static web pages.

feedbackEmail

The feedbackEmail variable is used to control the email address that users are prompted to send queries to.


The database accession number in the search results has a HTML link to retrieve the complete entry including comments from a remote database. In order for this link to be created the programs need to know the URL for the remote database. This is accomplished through parameters contained in the acclinks.txt file. Occasionally the URL's to the remote database may need to be updated, or new ones added for a new database. This requires editing of the acclinks.txt file.

Within the acclinks.txt file an entry for an HTML link from the accession number MUST contain 1 line:

The line must contain the following information:

  1. The prefix name for the database as listed in the HTML input page for each program. The prefix should be long enough to uniquely identify the database or set of databases you wish to refer to.
  2. The URL to link to if the accession number for the entry is added to the end of the URL. The URL addition is internal to the programs and is expected to retrieve a fully annotated entry from a remote database.

    Note that this link need not be to a sequence database. The link could be to whatever a Protein Prospector server administrator specifies.

Example:

Below is an example of the entry for UniprotKB in acclinks.txt:

UniProt http://www.pir.uniprot.org/cgi-bin/upEntry?id=

The lowercase prefixes gen, owl, swp, or nr are intended to be used for a second database that is of the same format as the uppercase one. See Linking for creating links into NCBI databases.

As mentioned above the prefix name can refer to a single database or a set of databases. For example if you have two user created databases called PA3_mouse and PA33_mouse, an entry in the acclinks.txt file of the form:

PA3 some_url_prefix

would give the databases the same accession number link. On the other hand entries of the following form:

PA3 some_url_prefix
PA33 another_url_prefix

would give the databases different accession number links.

Protein Prospector server administrators who find improved options for links to publicly available databases are encouraged to send the modified parameter files to for inclusion in subsequent Protein Prospector releases.


The genelinks.txt file contains the remote database URL definitions from gene names in the Protein Prospector results pages. Currently gene names are only reported in the Search Compare output.

The instructions for modifying this file are essentially the same as those for modifying the acclinks.txt file.

Some example are given below:

SwissProt http://www.pir.uniprot.org/cgi-bin/upEntry?id=
swp http://www.pir.uniprot.org/cgi-bin/upEntry?id=
UniProt http://www.pir.uniprot.org/cgi-bin/upEntry?id=

The MS-Digest index number in the search results has an HTML link to retrieve an MS-Digest listing for the matched database entry. In order for this link to be created the programs need to know the URL to MS-Digest and some default parameters. This is accomplished through information contained in the idxlinks.txt file. A server administrator can customize these parameters by editing the idxlinks.txt file.

Within the idxlinks.txt file an entry for an HTML link from the MS-Digest index number MUST contain 2 lines:

The lines must contain the following information:

  1. The program name for which the specified HTML link will be created from the index number link in the program's output.
  2. The URL to link to if the enzyme, MS-Digest index number, and modified AA parameters (from MS-Fit only) for the entry are added to the end of the provided URL. The URL addition is internal to the programs and is expected to provide an MS-Digest listing for the database entry corresponding to the index number.

    Note that this link need not be the same for each Protein Prospector program creating the link, and that the MS-Digest parameters can be customized. Furthermore, this link need not be to MS-Digest at all; the link could be to whatever a Protein Prospector server administrator specifies.

Example:

Below is an example of the entries for msfit and mstag in idxlinks.txt:

msfit
MSDIGEST?
mstag
MSDIGEST?mod_AA=Peptide+N-terminal+Gln+to+pyroGlu&mod_AA=Oxidation+of+M&mod_AA=Protein+N-terminus+Acetylated

In order to limit searches to a particular species, or a collection of species, the programs have to correlate the species name selected in the HTML form with the species names in the database entries. This is accomplished through the species alias file species.txt.

There are three types of entry in the species.txt file:

         
  • single species entries
  •      
  • multiple species entries
  •      
  • excluded species entries

Within the species.txt file a single species entry must contain at least ONE line, species are separated by a line with only the ">" symbol.

The first line of an entry contains the species name as it is to appear on the Species menu. All other lines should contain names (aliases) by which the species may be found in the databases. The aliases can be in any order and should be in upper case.

Examples:

HELICOBACTER PYLORI
HELPY
HELICOBACTER PYLORI
>
HOMO SAPIENS
HUMAN
H. SAPIENS
H.SAPIENS
HUMHBC
HOMO SAPIENS
>

In the first example HELPY is a typical SwissProt species alias and HELICOBACTER PYLORI is typical of what might be found in Genpept. A database such as NCBInr, which contains entries from several sources, would typically use several aliases.

Multiple species entries allow you to group species together in a search. A typical example which restricts the search to the HOMO SAPIENS, BOS TAURUS and SUS SCROFA species is:

     
[Mammals]
HOMO SAPIENS
BOS TAURUS
SUS SCROFA
>

The first line of an entry contains the identifier for the multiple species entry as it is to appear on the Species menu. The identifier is enclosed by the '[' character and the ']' character as in the example.

The other lines should contain the names of the species that you which to include in the search. These can either be multiple or single species entries from the species.txt file.

Excluded species entries allow you to exclude species from a search. A typical example which includes all species except HOMO SAPIENS, BOS TAURUS and SUS SCROFA is:

     
]Model Organisms[
HOMO SAPIENS
BOS TAURUS
SUS SCROFA
>

The menu item is enclosed by the ']' character and the '[' as in the example.

The other lines should contain the names of the species that you wish to exclude. The species that you wish to exclude MUST have single species entries from the species.txt file.


Detailed information on all amino acids used in the programs is located on the server in the file aa.txt.

You can edit this file to change the attributes shown below. This is not recommended unless you know what you are doing.

An entry for an amino acid MUST contain 9 lines:
line 1) contains a name for the amino acid. This isn't currently used by the programs.
line 2) contains a single letter code for the amino acid.
line 3) contains the elemental formula of the amino acid.
lines 4) and 5) contain elemental formulae for side-chains that are used in calculating d and w ions. If there are no beta substituents, or they are irrelevant, then use 0 (zero) on these lines.
line 6) contains the pk_C_term for the amino acid.
line 7) contains the pk_N_term the amino acid.
line 8) contains the pk_acidic_sc for the amino acid. You should enter n/a for not applicable.
line 9) contains the pk_basic_sc for the amino acid. You should enter n/a for not applicable.

The pK values are taken from:

Bjellqvist, B., Hughes, G. H., Paquali, C., Paquet, N., Ravier, F., Sanchez, J.-C., Frutiger, S., Hochstrasser, D. (1993) The focusing positions of polypeptides in immobilized pH gradients can be predicted from their amino acid sequences. Electrophoresis, 1993, Pp. 1023-1031

Bjellqvist, B., Basse, B., Olsen, E. and Celis, J. E. (1994) Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions, Electrophoresis, Vol. 15, Pp. 529-539

Below is an example of the entry for Isoleucine:

Isoleucine
I
C6 H11 N1 O1
C1 H3
C2 H5
3.55
7.5
n/a
n/a

Make sure the elements in your amino acid are present in the file elements.txt. See also, To Add/Change Elements.

It is not possible to add new amino acids. The ones currently defined are:

Alanine (A)
Cysteine (C)
Aspartic Acid (D)
Glutamic Acid (E)
Phenylalanine (F)
Glycine (G)
Histidine (H)
Isoleucine (I)
Lysine (K)
Leucine (L)
Methionine (M)
Asparagine (N)
Proline (P)
Glutamine (Q)
Arginine (R)
Serine (S)
Threonine (T)
Valine (V)
Tryptophan (W)
Tyrosine (Y)
Homoserine Lactone (h)
Met Sulfoxide (m)
Phosphorylated Serine (s)
Phosphorylated Threonine (t)
Phosphorylated Tyrosine (y)
Selenocysteine (U)

The file usermod.txt contains the variable modifications used on the search forms. An administrator can add new modifications to this file or edit existing ones.

Within this file an entry for a variable modification MUST contain 3 lines:
line 1) contains a name for the modification;
line 2) contains an elemental formula for the modification (elements can be negative - eg Amidation would be N H O-1);
line 3) contains a list of amino acids/termini to check for the modification.

Although the software doesn't require it we suggest that the modifications are kept in the same order as the supplied file where the modifications names are in alphabetic order.

It is strongly recommended that you use names which follow the PSI_MOD standard for naming modifications. Also you should check the Unimod website to see if the modification you want to add already has a name. If you add a modification and either change the name or the elemental formula then all previous search results using this modification will be invalid and should be deleted.

Some examples of what line 3) can contain are:

1). Restricting the modification to the protein N or C terminus:

Protein N-term
Protein C-term

2). Restricting the modification to one of a list of amino acids at the protein N or C terminus:

Protein N-term M

3). Modification to the peptide N or C terminus:

C-term
N-term

4). Modification to one of a list of amino acids at the peptide N or C terminus:

N-term Q
C-term M

5). Neutral loss modification:

Neutral loss

6). Modification to one of a list of amino acids:

STY

Below is an example of the entry for Phosphorylation of S, T and Y:

Phospho
P O3 H
STY

The list of possible constant modifications is generated automatically from the list of possible variable modifications.


Detailed information on all elements used in the programs is located on the server in the elements.txt file. You must edit this file to add or modify an element.

Within the elements.txt file an entry for an element MUST contain 1 line:

The line contains the following information:
a). The symbol for the element.
b). The valency of the element.
c). The number of isotopes listed on the line.
d). A mass/abundance pair for each isotope.

Below is an example of the entry for hydrogen:

H 1 2 1.007825035 .99985 2.014101779 0.00015

If you add a new element, please, send the modified parameter file to for inclusion in subsequent Protein Prospector releases.

Stable Isotope elements may also be added. For example:

2H 1 1 2.014101778 1.0
13C 4 1 13.003354838 1.0
15N 3 1 15.000108898 1.0
18O 2 1 17.999160419 1.0

The masses and isotopic abundances currently used are from:

Audi, G. and Wapstra, A. H. (1995) The 1995 update to the atomic mass evaluation, Nucl. Phys. A, Vol. 595, pp. 409-480 (1995)


Detailed information on all enzymatic digests used in the programs is located on the server in the enzyme.txt file. You must edit this file to add or modify the rules for an enzymatic digest.

Within this file an entry for an enzymatic digest MUST contain 4 lines:
line 1) contains a name for the enzymatic digest which will appear on the digest menu;
line 2) contains a list of cleavage amino acids;
line 3) contains a list of exception amino acids (a '-' character indicates no exceptions);
line 4) either C for cleavage on the C terminus side of an amino acid or N for cleavage on the N terminus side.

Below is an example of the entry for Trypsin:

Trypsin
KR
P
C 

The file enzyme_comb.txt is used to specify enzyme combinations. You can combine the cleavage rules for two or more enzymes by having them on the same line in this file separated by a '/' character. For example to have an option which combines the cleavage rules for CNBr and Trypsin you would need the following line:

Trypsin/CNBr

The enzyme combinations will appear on the digest menu after the ezymes that have been defined in the enzyme.txt file.

Any enzyme used in the enzyme_comb.txt must have been defined in the enzyme.txt file.

It is possible to mix enzymes which cleave on the N-terminus side with those that cleave on the C-terminus side.

If you add a new enzymatic digest please send the modified parameter file to for inclusion in subsequent Protein Prospector releases.


The imm.txt file contains the immonium ion elemental formulae and corresponding compositional information for use by Protein Prospector programs.

The first 2 entries in the file are for the immonium tolerance and the minimum fragment ion mass (both in Da). This is followed by a list of immonium ions.

An entry for an immonium ion contains:

1). The elemental formula using elements defined in elements.txt.

2). The compositional information. List all the amino acids corresponding to the elemental formula.

3). Ions labelled as M are major peaks; these are used to include an amino acid when using immonium ions to extract compositional ions in MS-Tag and MS-Seq. Minor ions are labelled m and are only likely to be present alongside major ions. They are reported in the immonium and related ions section of the MS-Product report.

4). Use I if the ion is an immonium ion or - otherwise.

5). A list of amino acids to exclude if the mass is missing or a dash (-) character if there are no amino acids to exclude. Excluding amino acids on the basis of missing peaks is a feature that can be turned off.

The fields must be separated by the | character.

For example:

C2 H6 N O|S|M|I|-
C4 H8 N|P|M|I|P
C4 H8 N|R|M|-|-
C4 H10 N|V|M|I|-
C3 H8 N O|T|M|I|-
C5 H10 N|KQ|M|-|-
C5 H12 N|IL|M|I|IL
C3 H7 N2 O|N|M|I|-
C4 H11 N2|R|M|-|-
C3 H6 N O2|D|M|I|-
C4 H10 N3|R|m|-|-
C5 H13 N2|K|M|I|-
C4 H9 N2 O|Q|M|I|-
C4 H8 N O2|E|M|I|-
C4 H10 N S|M|M|I|-
C5 H8 N3|H|M|I|H
C5 H10 N3|R|M|-|R
C8 H10 N|F|M|I|-
C6 H8 N O2|P|M|-|-
C6 H13 N2 O|K|m|-|-
C5 H9 N2 O2|Q|m|-|-
C8 H10 N O|Y|M|I|-
C6 H8 N3 O|H|m|-|-
C10 H11 N2|W|M|I|-

Any suggestion for improving this scheme should be sent to for inclusion in subsequent Protein Prospector releases.


MS-Fit/MS-Bridge/MS-NonSpecific

Edit the fit_graph.par.txt file.

MS-Product/MS-Display

Edit the pr_graph.par.txt file.

MS-Isotope

Edit the sp_graph.par.txt file.

DB-Stat

Edit the dbstat_hist.par.txt file.

Search Compare Discriminant Score Histogram/MS-Tag Score Histogram

Edit the hist.par.txt file.

Search Compare MSMS Precursor Mass Error Histogram

Edit the error_hist.par.txt file.

The graphs in the package are Java applets which use the information in their corresponding parameter file to control their appearance.

The files contains comment lines (starting with a # character) explaining the information fields beneath them. The following information is stored in the file:

  • The graph width in pixels.
  • The graph height in pixels.
  • The width of the graph axes and the lines used to draw the graph in pixels.
  • The graph background color (red green and blue values which must be between 0 and 255).
  • The graph axes color.
  • The default peak color.
  • The number of application colors (should be set to zero for MS-Isotope).
  • The application colors (not relevant for MS-Isotope).
  • The default font - the font for all text except the peak labels.
  • The peak label font.
  • The X-Axis label.

Colors are specified as 3 integers for the red, green and blue intensities respectively. The intensity values must be between 0 and 255.

A font specification is made up of a font family (Dialog, Helvetica, TimesRoman, Courier or Symbol), a font style identifier (PLAIN, BOLD or ITALIC) and a point size.


Fragmentation types are stored in the file fragmentation.txt. The information corresponding to a fragmentation type consists of one or more lines in this file. Individual fragment type entries in the file are separated by a line with only the ">" symbol.

The first line for an entry contains the fragmentation type name. This can be followed by lines (some optional) which override the default fragmentation type parameters. The additional lines have the form of name value pairs separated by a space. The possible parameters are listed below:

1). A list of fragment ions types (one per line) which occur in MS/MS fragmentation.

name: it
possible values: a
                 a-H2O
                 a-NH3
                 a-H3PO4
                 a-SOCH4
                 b
                 b-H2O
                 b-NH3
                 b+H2O
                 b-H3PO4
                 b-SOCH4
                 bp2                   Doubly charged b ion for data where the charge can't be determined
                                       from the peak list.
                 bp2-H2O
                 bp2-NH3
                 bp2-H3PO4
                 bp2-SOCH4
                 c
                 x
                 y
                 y-H2O
                 y-NH3
                 y-H3PO4
                 y-SOCH4
                 yp2                   Doubly charged y ion for data where the charge can't be determined
                                       from the peak list.
                 yp2-H2O
                 yp2-NH3
                 yp2-H3PO4
                 yp2-SOCH4
                 Y
                 z
                 I                      Internal ions.
                 C                      C-ladder ions.
                 N                      N-ladder ions.
                 i                      Immonium and low mass ions.
                 m
                 d
                 v
                 w
                 h                      MH-H2O, b-H2O if b, b-H2O if y.
                 n                      a-NH3 if a, b-NH3 if b, y-NH3 if y.
                 B                      b+H2O if b.
                 P                      a-H3PO4 if a, b-H3PO4 if b, y-H3PO4 if y.
                 S                      b-SOCH4 if b, y-SOCH4 if y.
                 MH-H2O
                 MH-NH3
                 MH-H3PO4
                 MH-SOCH4

The following ion types are possible in MS-Tag.

a,a-NH3,a-H2O,a-H3PO4,b,b-H2O,b-NH3,b+H2O,b-H3PO4,b-SOCH4,c,d
bp2,bp2-H2O,bp2-NH3,bp2-H3PO4,bp2-SOCH4
x,y,y-NH3,y-H2O,y-H3PO4,y-SOCH4,Y,z
yp2,yp2-H2O,yp2-NH3,yp2-H3PO4,yp2-SOCH4
I,C,N,h,n,B,P,S

None are defined by default.

2). A list of amino acids which lose NH3 in MS/MS fragmentation.

name: nh3_loss
default value: RKNQ

3). A list of amino acids which lose H2O in MS/MS fragmentation.

name: h2o_loss
default value: STED

4). A list of positive charge bearing amino acids.

name: pos_charge
default value: RHK

5). A list of amino acids that don't generate d ions.

name: d_ion_exclude
default value: FHPWY

6). A list of amino acids that don't generate v ions.

name: v_ion_exclude
default value: GP

7). A list of amino acids that don't generate w ions.

name: w_ion_exclude
default value: FHWY

8). The maximum internal ion mass.

name: max_internal_ion_mass
default value: 700.0

9). MS-Tag/Batch-Tag scores for various ion types

name: unmatched_score
name: immonium_score
name: related_ion_score
name: m_score
name: a_score
name: a_loss_score
name: a_phos_loss_score
name: b_score
name: b_plus_h2o_score
name: b_loss_score
name: b_phos_loss_score
name: c_ladder_score
name: c_score
name: d_score
name: v_score
name: w_score
name: x_score
name: n_ladder_score
name: y_score
name: y_loss_score
name: y_phos_loss_score
name: Y_score
name: z_score
name: bp2_score
name: bp2_loss_score
name: bp2_phos_loss_score
name: yp2_score
name: yp2_loss_score
name: yp2_phos_loss_score
name: internal_a_score
name: internal_b_score
name: internal_loss_score
name: mh3po4_score
name: msoch4_score
default value: 0

Below is an example of the entry for ESI-Q-CID:

ESI-Q-CID
it a
it a-NH3
it a-H2O
it b
it b-NH3
it b-H2O
it b+H2O
it y
it y-NH3
it y-H2O
it I
it i
it P
it S
it M-H2O
it M-NH3
it M-SOCH4
unmatched_score -0.1
immonium_score 0.5
related_ion_score 0.5
a_score 0.5
a_loss_score 0.0
a_phos_loss_score 0.5
b_score 1.5
b_plus_h2o_score 1.0
b_loss_score 0.5
b_phos_loss_score 1.5
y_score 3.0
y_loss_score 1.5
y_phos_loss_score 3.0
internal_a_score 0.25
internal_b_score 0.5
internal_loss_score 0.25
max_internal_ion_mass 500.0
>

The file instrument.txt contains the information for the items on the instrument menu.

An entry for an instrument option typically extends over several lines. Individual entries in the file are separated by a line with only the ">" symbol. The first line for an entry contains the instrument name as it appears on the instrument menu. This can be followed by lines (some optional) which override the default instrument parameters. The additional lines have the form of name value pairs separated by a space. The possible parameters are listed below:

1). A mandatory entry from the file fragmentation.txt.

name: frag
default value:

For example:

frag ESI-Q-CID

2). The number of decimal places used when printing out parent ion masses in reports.

name: parent_precision
default value: 4

3). The number of significant figures used when printing out parent ion mass errors in reports.

name: parent_error_significant_figures
default value: 3

4). The number of significant figures used when printing out parent ion intensities in reports.

name: parent_intensity_significant_figures
default value: 3

5). The number of decimal places used when printing out fragment ion masses in reports.

name: fragment_precision
default value: 4

6). The number of significant figures used when printing out fragment ion mass errors in reports.

name: fragment_error_significant_figures
default value: 2

7). The number of significant figures used when printing out fragment ion intensities in reports.

name: fragment_intensity_significant_figures
default value: 3

8). The mass window used when doing quantitation based on MSMS reporter ions (eg. iTRAQ).

name: quan_tolerance
default value: 0.2

If for example a value of 0.2 Da is used then all signals in the range ±0.2 Da of the expected exact mass are summed.

9). Whether to allow incorrect charges when reporting matches in MS-Product.

name: allow_incorrect_charge
default value: false

It is appropriate to set this to true if you generally can't reliably work out the charge of fragment ions from the peak list.

name: allow_incorrect_charge
default value: false

10). MS peak filtering parameters.

Note that all these parameters can also be used as CGI parameters to the MS-Fit, MS-Bridge and MS-NonSpecific programs. CGI parameters will override what is in the instrument.txt file.

name: ms_peak_exclusion
default value: false

This flag controls whether or not to apply peak intensity filtering and filtering based on the number of peaks in the MS spectrum.

name: ms_min_intensity
default value: 0.0

If the ms_peak_exclusion flag is set then any peaks with intensities less than the ms_min_intensity will be excluded.

name: ms_matrix_exclusion
default value: false
name: ms_max_matrix_mass
default value: 1300.0

If the ms_matrix_exclusion flag is set to true then the software attempts to detect and remove any peaks less than or equal to ms_max_matrix_mass that the software judges from their mass offset to be from non-peptide peaks.

name: ms_mass_exclusion
default value: false
name: ms_min_mass
default value: 50.0
name: ms_max_mass
default value: 10000.0

If the ms_mass_exclusion flag is set to true then peaks with a mass less than ms_min_mass or greater than ms_max_mass are filtered out.

name: ms_max_peaks
default value: 200
name: ms_min_peaks
default value: 5

If the ms_peak_exclusion flag is set then only ms_max_peaks are retained via an intensity filter. Also any spectra with less than msms_min_peaks peaks will not be processed.

11). MSMS peak filtering parameters.

Note that all these parameters can also be used as CGI parameters to the MS-Tag, MS-Product and Batch-Tag programs. CGI parameters will override what is in the instrument.txt file.

name: msms_min_precursor_mass
default value: 0.0

Any spectrum where the M+H of the precursor ion (as calculated from the m/z and the charge) is less than msms_min_precursor_mass will not be processed.

name: msms_raw_spectrum
default value: false

If this flag is set to true then all peak filtering is disabled and the value of all other MSMS peak filtering flags is ignored. This is generally used as a CGI parameter by MS-Product to display an unprocessed peak list.

name: msms_ft_peak_exclusion
default value: false

It this flag is set to true then the isotope distributions for the precursor peak, the charge reduced peak and the resonant peak are removed for the peak list.

name: msms_peak_exclusion
default value: false

This flag controls whether or not to apply peak intensity filtering and filtering based on the number of peaks in the MSMS spectrum.

name: msms_min_intensity
default value: 0.0

If the msms_peak_exclusion flag is set then any peaks with intensities less than the msms_min_intensity will be excluded.

name: msms_join_peaks
default value: false

The next stage of the peak list processing is to attempt to join together split peaks if the msms_join_peaks flag is set to true.

name: msms_matrix_exclusion
default value: false
name: msms_max_matrix_mass
default value: 400.0

If the msms_matrix_exclusion flag is set to true then the software attempts to detect and remove any peaks less than or equal to msms_max_matrix_mass that the software judges from their mass offset to be from non-peptide peaks.

name: msms_deisotope
default value: false

The next stage is to deisotope the spectrum if the msms_deisotope flag is set to true.

name: msms_mass_exclusion
default value: false
name: msms_min_mass
default value: 50.0
name: msms_precursor_exclusion
default value: 15.0

If the msms_matrix_exclusion flag is set to true then peaks with a mass less than msms_min_mass or within msms_precursor_exclusion of the precursor mass are filtered out.

name: msms_max_peaks
default value: 60
name: msms_min_peaks
default value: 5

If the msms_peak_exclusion flag is set then only msms_max_peaks are retained via an intensity filter. Before applying the filter the spectrum is split into 2 halves and the same number of peaks are retained in each half. Also any spectra with less than msms_min_peaks peaks will not be processed.


The file homology.txt contains the information for the matrix modification options.

An entry for a matrix modification option MUST contain least TWO lines. Individual entries in the file are separated by a line with only the ">" symbol. The first line for an entry contains the matrix modification option name as it appears in the Matrix Modification section of the Batch-Tag or MS-Tag form. Subsequent lines (of which there must be at least one) should contain the following information separated by a space:

a). an amino acid;

b). a list of amino acids that the amino acid in a) can mutate or be modified to.

Below are examples of entries for a comprehensive homology option and for an option which allows BX and Z codes in the database to become the relevant standard amino acid.

Homology
A CDEFGHIKLMNPQRSTVWY
C ADEFGHIKLMNPQRSTVWY
D ACEFGHIKLMNPQRSTVWY
E ACDFGHIKLMNPQRSTVWY
F ACDEGHIKLMNPQRSTVWY
G ACDEFHIKLMNPQRSTVWY
H ACDEFGIKLMNPQRSTVWY
I ACDEFGHKLMNPQRSTVWY
K ACDEFGHILMNPQRSTVWY
L ACDEFGHIKMNPQRSTVWY
M ACDEFGHIKLNPQRSTVWY
N ACDEFGHIKLMPQRSTVWY
P ACDEFGHIKLMNQRSTVWY
Q ACDEFGHIKLMNPRSTVWY
R ACDEFGHIKLMNPQSTVWY
S ACDEFGHIKLMNPQRTVWY
T ACDEFGHIKLMNPQRSVWY
V ACDEFGHIKLMNPQRSTWY
W ACDEFGHIKLMNPQRSTVY
Y ACDEFGHIKLMNPQRSTVW
>
Unknown Amino Acid
B DN
X ACDEFGHIKLMNPQRSTVWY
Z EQ
>

Computer optimisation options are currently only relevant to the Windows version. They are contained in the computer.txt file.

The following parameters are currently available:

1). The default memory block size used in memory mapping.

name: block_size
default value: 65536

This number is applicable for Windows systems and should not be changed.

2). The number of blocks to use as a default memory map size when reading a database.

name: num_blocks
minimum value: 1
default value: 256
maximum value: 16384

The default value assumes that 16 MBytes blocks are mapped in. The maximum value is 1 GByte. You might want to vary this parameter to see if it affects search times. If you have a lot of RAM then a much bigger number could be appropriate.


MS-Homology uses scoring matricies like those used in the BLAST or FASTA programs. The user is offered a choice of which one to use via the Score Matrix menu.

Users can add new scoring matricies or edit existing ones by editing the mat_score.txt file.

An example of a score matrix as defined in the file is given below:

BLOSUM62MS
A  4
R -1  5
N -2  0  6
D -2 -2  1  6
C  0 -3 -3 -3  9
Q -1  1  0  0 -3  5
E -1  0  0  2 -4  2  5
G  0 -2  0 -1 -3 -2 -2  6
H -2  0  1 -1 -3  0  0 -2  8
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4
L -1 -2 -3 -4 -1 -2 -3 -4 -3  4  4
K -1  2  0 -1 -3  1  1 -2 -1 -2 -2  5
M -1 -1 -2 -3 -1  0 -2 -3 -2  2  2 -1  5
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -2 -2 -3 -1  1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7
V  0 -3 -3 -3 -1 -2 -2 -3 -3  2  1 -2  1 -1 -2 -2  0 -3 -1  4
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4
X  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#  A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X
>

The first line is the name of the scoring matrix as it will appear on the Score Matrix menu.

Subsequent lines contain the scores assigned by the MS-Homology program to the mutation of one amino acid to another. The scores must be separated by space or tab characters. The scores may be positive, negative or zero.

Lines starting with a "#" character are treated as comments.

Separate entries are separated by a line with only the ">" symbol.

If MS-Homology encounters an amino acid that is not present in the score matrix then a default value of zero is used.


A list of species for dbEST prefix databases is maintained in the file dbEST.spl.txt

This file is necessary because of the lack of a standardized species field in the dbEST fasta file. The file contains a list of species strings found in the comment lines of the database entries. The FA-Index program scans through this list and tries to find one of these strings in the comment line. If it finds one, it assigns that string as the species for the entry. The order of species listed in this file is not crucial but FA-Index will run faster if the more common entries are in order of number of occurences starting at the top of the file. For example in the distributed file the species with over 10000 occurances are at the top of the list and everything else is in roughly alphabetic order. Note however that Rat has to appear after Rattus norvegicus as it is contained within it. The comment line at the end of the file is to make sure there is a carriage return after the last entry.

If an entry doesn't contain one of these species strings it is labelled as an UNREADABLE species. A list of the searched fields from these UNREADABLE entries is contained in the file seqdb\dbEST*.usp after every FA-Index run. You can look through the dbEST*.usp file to see if you can add any more fields to this file when you update the database. FA-Index must then be run again in order to assign the new species correctly.

The first few lines of a typical dbEST.spl.txt are shown below. Note how some very common species are listed first and then the remaining species are listed in alphabetic order.

Homo sapiens
Mus musculus
H. sapiens
Human
C.elegans
Arabidopsis thaliana
Rice
A.gambiae
A.muscaria
A.thaliana
A. thaliana
Acacia mangium
Acanthamoeba healyi
Acanthopanax sessiliflorus
Acetabularia acetabulum
Acorus americanus
Acropora cervicornis
Acyrthosiphon pisum
Aedes aegypti

MS-Digest can currently report Bull Breese (%Hydrophobicity) and HPLC indicies for peptides. The corresponding coefficients used by MS-Digest for each amino acid are contained in the file indicies.txt. These can be edited if desired.

The relevant publications are:

Bull, Henry B. and Breese, Keith (1974) "Surface Tension of Amino Acid Solutions: A Hydrophobicity Scale of the Amino Acid Residues", Arch. Biochem. Biophys, 161, 665-670

Browne, C. A., Bennett, H. P. J. and Solomon, S. (1982) "The Isolation of Peptides by High-Performance Liquid Chromatography Using Predicted Elution Positions", Anal. Biochem., 124, 201-208

The file indicies.txt also contains amino acid coefficients from the following publications:

Hopp, T. P. and Woods, K.R. (1981) Proc. Natl. Acad. Sci., 78, 3824-

Kyte, Jack and Doolittle, Russell F. (1974) "A Simple Method for Displaying the Hydropathic Character of a Protein", J. Mol. Biol., 157, 105-132

Engelman, D. M., Steitz, T. A. and Goldman, A. (1986) "Identifying Nonpolar Transbilayer Helices in Amino Acid Sequences of Membrane Proteins", Ann. Rev. Biophys. Chem, 15, 321-353

These aren't currently used by anything.


The file links.txt contains the information required by the Links Search Type option of the MS-Bridge form. Editing this file is not really recommended unless your cross linker is very similar to the current options.

If the first character of a line in the file is a '#' character it is treated as a comment.

The entries in the file are separated by a line containing a '>' character.

The first line of an entry is the string that is to appear on the Links Search Type menu on MS-Bridge.

The Xlink:Dehydro (C) entry deals with disulfide bonds and should not be edited.

Subsequent lines for an entry are parameters and are in the form of name-value pairs. A name-value pair is a line in the file where the name is followed by a space character and the rest the line is the value. The value may contain space characters. If just the name is specified then the value is assumed to be an empty string.

All the parameters need to be specified.

name: link_aa_1

This is the amino acid that the cross-linker attaches to. Single letter amino acid codes are used when specifying this.

name: link_aa_1_n_term_flag

A flag used to specify the whether the cross-linker can also attach to the N-terminus.

name: bridge_formula

The elemental formula of the cross-linker.

name: usermod

Entries from the usermod.txt file which define modified amino acids that can occur as a result of the cross linking. These need to be on the amino acid specified by the link_aa_1 parameter.

An example entry is shown below:

DSS
link_aa_1 K
link_aa_1_n_term_flag 1
bridge_formula C8 H10 O2
usermod Xlink:DSS1 (K)
usermod Xlink:DSS2 (K)
>

The file quan.txt contains the quantitation options.

An entry for a quantitation type MUST contain least TWO lines. Individual quantitation types in the file are separated by a line with only the ">" symbol. The first line for an entry contains the quantitation type name as it appears on the Quantitation menu on the Search Compare form.

The iTRAQ4plex, iTRAQ8plex and O18 entries should not be modified. For the other quantitation types the subsequent lines contain modifications from the usermod.txt followed by the modified amino acid in brackets.

Example entry for ICAT.

ICAT-C:13C (C)
ICAT-C:13C(9) (C)
>

Example entry for SILAC K.

Label:13C (K)
Label:13C(6) (K)
>

Example entry for SILAC C of R and SILAC NC of L.

Label:13C (R) 13C 15N (L)
Label:13C(6)15N(1) (L)
Label:13C(6) (R)
>

itraq.txt contains the iTRAQTM purity coefficients for 4-plex iTRAQTM.

itraq8.txt contains the iTRAQTM purity coefficients for 8-plex iTRAQTM.

iTRAQTM reagent batches are labelled with purity values indicating the percentages of each reporter ion that have masses differing by -2 Da, -1 Da, +1 Da and +2 Da from the reporter ion mass. This allows the software to make the necessary corrections before reporting the quantitation ratios.

The files contain one or more entries which will appear on a menu on the Search Compare form. The entries are separated from each other by a line which just contains a ">" symbol.

The first line of an entry contains the string which will appear on the menu. Subsequent lines contain the nominal reporter ion mass followed by the percentages corresponding to -2 Da, -1 Da, +1 Da and +2 Da mass shifts.

An example from the itraq.txt is shown below:

Default iTRAQ4plex
114 0.0 1.0 5.9 0.2
115 0.0 2.0 5.6 0.1
116 0.0 3.0 4.5 0.1
117 0.1 4.0 3.5 0.1
>
In the itraq8.txt file there also needs to be an entry for the Phenylalanine immonium ions at 120 Da. For example:
Default iTRAQ8plex
113 0.0 1.0 5.9 0.2
114 0.0 1.0 5.9 0.2
115 0.0 2.0 5.6 0.1
116 0.0 3.0 4.5 0.1
117 0.1 4.0 3.5 0.1
118 0.1 4.0 3.5 0.1
119 0.1 4.0 3.5 0.1
120 0.0 0.0 3.5 0.1
121 0.1 4.0 3.5 0.1
>

Obviously there is no component at -2 Da and -1 Da for the Phenylalanine immonium ion.

The following publication outlines the purity correction method for 4-plex iTRAQTM:

Shadforth, I. P., Dunkley, T. P. J., Lilley, K. and Nessant, C. (2005) i-Tracker: For Quantitative Proteomics Using iTRAQTM, BMC Proteomics, Vol. 6, Pp. 145-150


When the mass modifications option is used in MS-Tag or Batch-Tag hits containing a mass modification are displayed as a mass in brackets after the modified amino acid. For example:

STTTGHLIYK(14.0067)

If you click on the hit peptide to bring up the MS-Product report then the sequence displayed at the top of the report links to the Unimod web site if you click on the mass. This suggests modifications from the Unimod database that have a similar mass shift.

The file unimod.txt has 3 parameters that define the url used for this link:

main_url http://www.unimod.org/modifications_list.php?a=advsearch&asearchfield[]=mono_mass&asearchopt_mono_mass=Between&
start_range value_mono_mass=
end_range value1_mono_mass=

main_url is the initial part of the url.

start_range is the parameter used to define the start of the mass range.

end_range is the parameter used to define the end of the mass range.

It is possible to edit these values if you want something else to happen when a user follows this link.


The MGF parameters are used to enable Protein Prospector to extract information from the TITLE line in an MGF file. They are stored in the file mgf.xml.

Several different TITLE line formats are supported. Users should not generally edit the existing ones but it is possible to add new ones. A typical TITLE line might look like this (this is produced by the Mascot dll in Sciex Analyst 2):

TITLE=File: F25uLUCSF.wiff, Sample: F2 26_5-28002 (sample number 1), Elution: 26.813 to 28.437 min,
   Period: 1, Cycle(s): 1129, 1139, 1150 (Experiment 3), 1125 (Experiment 4)

The parameters for each different format which is supported are contained between <mgf_type> tags. The parameters are explained below:

<name>

Each format that is supported has to be given a unique name. You should not change the names of any of the formats in the supplied file.

<start>, <end> and <contains>

Protein Prospector uses the information in these tags to work out which of the supported formats the current title line corresponds to. The <start> parameter is what is at the start of the title line after the TITLE= identifier. The <end> parameter is what is at the end of the title line. One or more <contains> parameters can be used to specify other identifying strings that would distinguish this title line format from the other supported title line formats. It is not always possible to specify <start> and <end> tags.

The different formats are considered in the order they appear in the file. Thus a more specific format should be placed before a more general format. For example:

<mgf_type>
   <name>ANALYST_DISTILLER</name>
   <contains>S</contains>
   <contains>(rt=</contains>
   <contains>p=</contains>
   <contains>c=</contains>
   <contains>e=</contains>
   <contains>[</contains>
   <contains>]</contains>
   <spot_start>rt=</spot_start>
   <spot_end>,</spot_end>
</mgf_type>

Would recognize:

TITLE=1: Scan 5 (rt=4.106, p=0, c=1, e=1) [C:\MSDATA\QS20060131_S_18mix_02.wiff]

and should be placed before the more generic:

<mgf_type>
   <name>DISTILLER</name>
   <contains>S</contains>
   <contains>(rt=</contains>
   <contains>[</contains>
   <contains>]</contains>
   <spot_start>rt=</spot_start>
   <spot_end>)</spot_end>
</mgf_type>

<spot_start> and <spot_end>

These tags are used to delimit the "spot" information which is used in the S column in the Search Compare output. This should preferably be a retention time. If the title line contains a retention time window the start of the window is generally preferable. If no retention time is available a scan number should be used. If the sample is on a spotting plate a spot number could be used.


This section describes the inst_dir.txt file

The Batch-Tag program can make use of a data repository. This is a browsable area from which one or more MSMS peak list files can be selected to make a project which will be searched in a batch. In this way it is possible to search multiple LC fractions in the same search.

The base directories of the repository are specified in the info.txt file via the centroid_dir and raw_dir directives (see modifying the main configuration file).

The base directories would typically contain a directory for each physical instrument that you have. The inst_dir.txt maps the directory names you choose for each physical instrument to the generic names specified in the instrument.txt file.

A typical example is:

LCQ ESI-ION-TRAP-low-res
QStarPulsar ESI-Q-TOF
QStarXL ESI-Q-TOF
TOFTOF1 MALDI-TOFTOF
TOFTOF2 MALDI-TOFTOF
TOFTOF3 MALDI-TOFTOF

The default parameters for the Batch-Tag and Batch-Tag Web forms are stored in the file batchtag/default.xml.

This file contains the cgi parameters used by the program and their default values. An example of the type of thing found in the file is shown below:

<const_mod>Carbamidomethyl%20%28C%29</const_mod>
<database>SwissProt</database>
<species>All</species>

The parameters for the expectation value search are stored in the file expectation.xml. The contents of the current default file are shown below.

<?xml version="1.0" encoding="UTF-8"?>
<parameters>
<database>SwissProt</database>
<full_pi_range>1</full_pi_range>
<max_hits>2000000</max_hits>
<missed_cleavages>3</missed_cleavages>
<msms_full_mw_range>1</msms_full_mw_range>
<msms_max_modifications>0</msms_max_modifications>
<msms_max_reported_hits>5</msms_max_reported_hits>
<msms_parent_mass_tolerance>0.5</msms_parent_mass_tolerance>
<msms_parent_mass_tolerance_units>Da</msms_parent_mass_tolerance_units>
<parent_mass_convert>monoisotopic</parent_mass_convert>
<report_title>BatchTag</report_title>
<search_name>batchtag</search_name>
<species>All</species>
<use_instrument_ion_types>1</use_instrument_ion_types>
</parameters>
<copy_parameter>fragment_masses_tolerance</copy_parameter>
<copy_parameter>fragment_masses_tolerance_units</copy_parameter>
<copy_parameter>instrument_name</copy_parameter>
<copy_parameter>allow_non_specific</copy_parameter>
<copy_parameter>enzyme</copy_parameter>
<copy_parameter>expect_calc_method</copy_parameter>
<copy_parameter>const_mod</copy_parameter>
<copy_parameter>project_name</copy_parameter>
<copy_parameter>msms_precursor_charge_range</copy_parameter>

The search parameters that are shown between the <parameters> tags are used in every expectation value search. Thus the database is always SwissProt and the species is always All. The parameters in <copy_parameters> tags are copied from the search form. If an expectation value search has previously been done with the same values for all the copy parameters then a new expectation value search is not performed.


In the Protein Prospector Batch-Tag program expectation values are calculated by a linear tail fit method. This involves collecting a distribution of the scores for all peptides that fall within a Precursor m/z tolerance specified in the file expectation.xml. The scores are plotted as a histogram and the gradient and offset of a survival curve of the tail of the distribution are obtained to enable expectation values to be calculated. Some aspects of the tail fit calculation can be modified via parameters in the expectation.txt file. Modifying this file is not generally necessary or recommended.

tail_percent

The tail_percent parameter has a default value of 10. This is the percentage of the scores from the distribution that are used for the linear tail fit.

max_used_peptides

The max_used_peptides parameter has a default value of 10000. A search against a randomized SwissProt database (using the parameters in expectation.xml) is used to generate peptides from which to assemble the score distribution. The program stops generating new peptides for a particular spectrum when max_used_peptides different peptides have been processed.

min_used_peptides

The min_used_peptides parameter has a default value of 2800. A search against a randomized SwissProt database is used to generate peptides from which to assemble the score distribution. The program keeps cycling through the database to generate new peptides until at least min_used_peptides peptides have been generated for each spectrum. In some cases it may not be possible to generate min_used_peptides peptides so the database cycling will stop after 5 cycles. If min_used_peptides peptides haven't been generated then an expecation value is not calculated for this spectrum.

A fairly similar approach to calculating expectation values by a tail fit method is outlined in the following publication:

Fenyo, D. and Beavis, R. C. (2003) A Method for Assessing the Statistical Significance of Mass Spectrometry-Based Protein Identifications Using General Scoring Schemes, Anal. Chem., Vol. 75, Pp. 768-774


The coefficients for calculating discriminant scores are stored in the files disc_score.txt and disc_score2.txt.

The discriminant score is calculated using the coefficients in disc_score2.txt if an expectation value is available. Otherwise it uses the coefficients in disc_score2.txt. Expectation values will not be available if you did the Batch-Tag search with the Expectation Calc Method parameter set to None. They will also not be available if you set the Expectation Calc Method parameter to Linear Tail Fit and there were less than min_used_peptides (from the expectation.txt file) for a particular MSMS precursor m/z.

There should be entries in both disc_score.txt and disc_score2.txt for all the instrument entries in instrument.txt.

The possible coefficients in disc_score.txt are:

best_score
maximum_best_score
score_diff
offset

and the discriminant score equation is:

d = ( x × max ( b, m ) ) + ( y × s ) + z;

where

d = discriminant score
x = best_score coefficient
b = best peptide score for protein
m = maximum_best_score coefficent
y = score_diff coefficient
s = score difference between score for the peptide hit and the 6th best peptide hit
    (similar hits aren't counted when counting up to 6)
z = offset coefficient 

If maximum_best_score is not defined in the file then b will be used in the equation

The possible coefficients in disc_score2.txt are:

best_score
maximum_best_score
expectation
offset

and the discriminant score equation is:

d = ( x × max ( b, m ) ) + ( y × log10 ( e )) + z;

where

d = discriminant score
x = best_score coefficient
b = best peptide score for protein
m = maximum_best_score coefficent
y = expectation coefficient
e = expectation value
z = offset coefficient 

If maximum_best_score is not defined in the file then b will be used in the equation.