Description, Instructions, and Tips for MS-Pattern


Purpose
This document provides instructions for MS-Pattern.

Instructions for ProteinProspector Programs

Contents of this document:

Links to topics in the general instructions:

Introduction

MS-Pattern began its life as a simple utility for specifying a text string (protein name, sequence, accession number) and retrieving the database entries associated with that string. Since the algorithm we used for accomplishing this is very similar to the way regular expressions are treated with the UNIX grep command, the implementation lends itself well to describing the ambiguity often present in data obtained from an Edman degradation protein sequencing experiment. We have since added more features like peptide mass filtering and tolerance for mismatched amino acids. Then we broadened the regular expression concept to include a list of sequences. This list is expected to contain multiple sequences which are similar, as would result from de novo interpretation of an MS/MS spectrum with .


Search Mode

Sequence Only
MS-Pattern finds amino acid sequences in the selected database which match the regular expression entered.
In this mode the sequence should be in CAPITAL LETTERS.

Sequence and Mass
MS-Pattern first finds amino acid sequences in the selected database which match the regular expression entered, then filters those sequences to eliminate those not containing one of the specified peptide mass WITHIN the sequence. Hence, not all of the specified sequence must be contained in the region defined by the mass. Thus, residues outside of the peptide in question could be specified (unless done when specifying No enzyme, the cleavage rule may prevent matching in such cases).
In this mode the sequence should be in CAPITAL LETTERS.

List of Sequences
This list is expected to contain multiple sequences which are similar, as would result from de novo interpretation of an MS/MS spectrum with (all sequences would thus have the same mass). Furthermore, it should be possible to match sequences which are homologous to one of the sequences in the list if the number of mismatched AA's is set to a value > 0 (2 is a good 1st choice).
This mode does NOT allow use of non-alphabetic characters from regular expressions ( [, ], ^, ., .* ). In this mode the sequences should be in CAPITAL LETTERS.


Regular Expressions

Square brackets have special meaning in a regular expression. The regular expressions used are of the form used by the UNIX grep facility. Examples (type man grep on a UNIX system for full details):
[EF]The amino acid is either E or F.
[^EF]The amino acid is anything but E or F.
.Any single amino acid is possible.
.*Used to represent a sequence of one or more unknown amino acids. Note that this is "dot-star" not just "star". This wildcard allows some not entirely obvious features. A match is to the longest sequence fitting the condition (ex: FMQ .*K will find the last K in the sequence following FMQ). In Sequence and Mass mode the sequence is matched first then a mass WITHIN the sequence is found. Hence, not all of the specified sequence must be contained in the region defined by the mass. Thus, residues outside of the peptide in question could be specified (unless done when specifying No enzyme, the cleavage rule may prevent matching in such cases).


Mismatched AA's

By setting the Max. # of Mismatched AA's parameter to a value other than 0, homologous sequences can be matched. This is done by allowing a number of positions, as determined by this parameter, not to match protein sequences in the database. In future revisions of MS-Pattern this parameter may be replaced by PAM matrices used in sequence homology programs like BLAST. This parameter is active in the following search modes:

  • Sequence only
  • Sequence and Mass
  • List of Sequences