| |
|
Scores in Scansite start at 0.000 if
the sequence optimally matches the given motif, and the scores
increase for sequences as they diverge from the optimal match.
The score histograms and the percentile are the main ways
to tell how good a score is for a given motif. If your site
ranks in the best 0.2% of all sites, that is quite good (this
is the "high stringency" threshold value we use for most
of the motifs). The Z score will also tell you how many standard
deviations from the mean it is (the Z score will be negative
because lower Scansite scores are better). For database searches,
the percentile is calculated from all proteins you searched
(a whole database, for example). For the Motif Scan program,
the percentile is calculated from a reference database search,
which is the vertebrate section of the Swiss-Prot database.
It may be worthwhile to consider the
surface accessibility of the predicted site, as shown by
the plot below the graph of sites found. A site that can
actually be reached by an interacting protein is better
than a buried one. However, our surface accessibility is
calculated from the sequence, and not with reference to
any known structures, so its accuracy may be limited.
Another way to assess the validity of a
site found is to check whether it is conserved in related
organisms. Clicking on the sequence will show you the position
of the site within the protein, and on that page is a convenient
link to submit the site to BLAST.
|
|
| |
|
To use your own binding motif in a database search, you
will need to define it in a text file and import the text
file into Scansite. The format is a matrix of typically 20
columns and 15 rows. The columns are scores for the 20 amino
acids at each position, and the 15 rows are positions of
each residue in the binding pocket. Scansite requires one
residue to be invariant in the motif sequence, such as a
tyrosine for motifs recognized by tyrosine-kinases. (For
motifs that recognize more than one residue in this position,
such as kinases that phosphorylate both serine and threonine,
multiple residues can be treated as equivalent in the fixed
position.) The middle position, row 8, holds the fixed residue
in the matrix. Rows 1 through 7 hold the scores for positions
to the left (N-terminal side) of the fixed position, and
rows 9 through 15 hold the scores for those to the right
(C-terminal) side. Finally, at the top of this 20 by 15 matrix
should be a row of column headings indicating the one-letter
amino acid code.
The actual number of columns in the
matrix can be higher or lower than 20. If your peptide
library screen did not include some residues, such as cysteine
or tryptophan, these columns can be left out of your matrix
and Scansite will assign them default values of 1 everywhere
except in the fixed position, where the default score is
0. Alternatively, if your motif targets the C terminus
(as PDZ domains do), you can include an extra column giving
scores for the C terminus, labeled with the symbol "*" (asterisk). The N terminus can
be given scores in a column labeled "$" (dollar sign). (When
absent, Scansite assigns the N and C terminus columns values
of zero by default.) Selenoprotein researchers can add a
column of scores labeled "U" for selenocysteine (the U symbol
is present in recent releases of the Genpept database). When
this column is absent, Scansite scores selenocysteines as
cysteines. Finally, columns can be included for the degenerate
symbols "B" (aspartate/asparagine), "Z" (glutamate/glutamine),
and "X" (any residue) to score sequences containing these
symbols in the public databases, although there is probably
no research reason to include these in an input matrix. As
a result of all these options, an input matrix can contain
as many as 26 columns, or even only 1 column naming the fixed
residue (such a matrix would not be very informative, however).
The order of the residue columns is not important ("A" does
not have to be first, and so on).
An example will make this more clear.
Below is a sample matrix. The full matrix might have 21
columns in this case (20 residues plus the C terminus),
but only seven are shown here in order to fit well on the
page. The first line in the text file contains the column
headers (residue letters and the asterisk). These should
be separated by tabs. In practice, it is easiest to do
this by creating the whole matrix in a spreadsheet program
and saving it as a tab-delimited text file. The next 15
lines are the 15 positions relative to the fixed center
at row 8. This motif is specific for tyrosine (or phosphotyrosine)
in the center position, so tyrosine is the fixed residue.
It must be given the value 21 in row 8 for Scansite to
recognize it as the fixed residue. All other values in
row 8 must be zero. (For multiple residue possibilities
at the fixed position, each of them must have the value
21 and the remaining residues must all be zero.) This motif
also requires the C terminus to be four residues away from
the center tyrosine (so the column "*" is given
the value 21 at that position). In all the other rows, scores
are given to represent the relative binding affinity for
each residue type at each position in the sequence.
example:
| A |
C |
D |
E |
... |
W |
Y |
* |
| 1 |
1 |
1 |
1 |
... |
1 |
1 |
0 |
| 1 |
1 |
1 |
1 |
... |
1 |
1 |
0 |
| 1.63 |
0.29 |
9.08 |
9.4 |
... |
2.91 |
3.67 |
0 |
| 8.88 |
0.83 |
0.28 |
3.92 |
... |
5.93 |
1.45 |
0 |
| 1.56 |
5.81 |
6.49 |
3.31 |
... |
5.35 |
7.07 |
0 |
| 0.11 |
5.57 |
15.0 |
0.03 |
... |
0.31 |
4.95 |
0 |
| 6.79 |
0.38 |
8.2 |
2.17 |
... |
3.37 |
0.59 |
0 |
| 0 |
0 |
0 |
0 |
... |
0 |
21 |
0 |
| 5.12 |
3.61 |
5.53 |
1.91 |
... |
8.63 |
7.03 |
0 |
| 2.18 |
4.53 |
5.52 |
6.11 |
... |
8.94 |
0.71 |
0 |
| 0.63 |
4.07 |
5.34 |
2.51 |
... |
4.49 |
3.63 |
0 |
| 4.74 |
5.9 |
1.56 |
3.1 |
... |
2.39 |
6.82 |
0 |
| 0 |
0 |
0 |
0 |
... |
0 |
0 |
21 |
| 1 |
1 |
1 |
1 |
... |
1 |
1 |
0 |
| 1 |
1 |
1 |
1 |
... |
1 |
1 |
0 |
|
The scoring system ranges from 0 to
roughly 21. Giving an individual amino acid a score of
1 at one position in the motif indicates that no preference
exists, positive or negative, for that particular amino
acid in that position. Giving all amino acids in one position
of the motif a score of 1 (i.e. making all values in a
single row of the matrix equal to one, as in the first
two rows in the table shown above) indicates no preference
exists for any particular residue type at that position
in the motif. A score of 15 means that residue has 15 chances
out of 21 to be the selected residue, which is a very strong
affinity. It is not necessary to normalize the scores at
one position to add up to 21; Scansite will normalize them
during program execution. Values higher than 21 are permitted
if you wish to indicate very strong affinities. Values
between 0 and 1 can be used for very low affinities, but
negative numbers cannot. Beware that the scoring function
uses natural logarithms, so values less than 1, particularly
those less than 0.5, strongly penalize for that particular
residue in a motif. In fact, the penalty of negative selection
from a matrix value of 0.1 in the final score is equivalent
(though opposite) to the positive selection obtained with
a value of 10 for another residue in the motif. Consequently,
choose values less than 1 (other than 0) carefully. This
matrix favors the C terminus, but for motifs not showing
any special affinity for the N or C terminus, scores in the "$" and "*" columns
(if present) should be zero. No commas or other punctuation
should separate matrix values. Only tabs are permitted between
columns.
While the center position must have a fixed residue, it
is also possible to fix other positions as well (as done
in this example for the C terminus). As another example,
suppose your motif requires aspartate in row 5 (three residues
N-terminal to the fixed center). To do this, give that aspartate
the value 21 in row 5, and make all other amino acid values
equal to 0. Conversely, you should avoid giving residues
the special score of 21 if they are not intended to be fixed
positions; use values like 20.9 or 21.1 instead.
|
|