APPLICATIONS / DATABASESSUPPLEMENTAL DATAFACULTYMIRROR SITESCOLLABORATIVE PROJECTSDEPARTMENTS
Scansite Tutorial

Scansite Tutorial

Protein ID Input

Using the Scansite batch submission program starts with one or more input protein IDs or sequences. For example, the human p53 protein can be submitted using any of the identifiers below:

P53_HUMAN (Swiss-Prot ID)
P04637 (Swiss-Prot accession)
AAA61212 (GenPept accession)
339816 (GenPept GI)
CAA25652 (GenPept accession)
642241 (Genpept GI)
NP_000537 (RefSeq accession)
8400738 (RefSeq GI)
DNHU53 (PIR accession)
ENSP00000269305 (Ensembl accession)

To submit p53 to Scansite, use one of these IDs and a code indicating which database to retrieve the sequence from. The codes are "ST" (Swiss-Prot/TrEMBL), "GP" (GenPept), "RS" (RefSeq), "PIR" (Protein Information Resource), and "EN" (Ensembl). Scansite recognizes identifiers from all these databases. You must upload a text file containing one protein entry per line. Each entry consists of a protein ID, a space, and a database code. (The space could just as easily be several spaces or a tab character.) Here are a few of the ways p53 would be entered:

P53_HUMAN ST
AAA61212 GP
8400738 RS
ENSP00000269305 EN

Here is a sample file of twenty protein IDs. Save this on your computer and give it a name (such as "twenty.ids"). To run these through Scansite, go to the Scansite home page, find the heading "Scan a List of Protein IDs or Accession Numbers", and enter the path to "twenty.ids" in the input for "File of Protein IDs:". It is usually easier to click the "Browse..." button and navigate to this file.

Next, select a stringency level to use. The stringency level determines how closely a potential site has to match the motif description in order to be reported. The default setting is "High stringency", which requires the closest match and results in few false positives. However, the high stringency setting sometimes overlooks known sites, so you may occasionally want to use "Medium" or "Low" stringency settings, though you will get more false positives this way.

With the input file specified and the stringency setting selected, click "Submit". You will see a new page headed "Scansite Job Submitted", which gives an estimate of the time needed to complete the job. This is usually less than a minute. Click the "Check Results" button to see if the job is finished. If it is, a page headed "Scansite Job Complete" will be displayed, and a link called "Results" will be presented for you to view or download. A link called "Input" also lets you download your input file, making it easy for you to store the input and output files together.

The format of the output is a tab-delimited text file containing eight columns, as shown below:

O19594 DNA_PK DNA_dam_kin S50 0.2369 0.116% PVAEYWNSQKDILED 3.119
O19594 PDZ_class2 PDZ F76 0.1568 0.000% NYGVGESFTV*     1.017
33383432 Fyn_SH2 SH2 Y69 0.1328 0.140% DIFTGKKYEDICPST 2.911
AAH10698 PKC_delta Baso_ST_kin S362 0.2307 0.130% KTFTKKESMKIASSV 1.476
AAF14000 PDZ_class1 PDZ S349 0.2825 0.000% NRNLVQFSRL*     1.534
21264341 ATM_Kin DNA_dam_kin S101 0.3120 0.146% EERMKELSQDSTGRV 1.700
21264341 Grb2_SH2 SH2 Y147 0.2358 0.189% VYDEDSPYQNIKILH 1.508
21264341 Nck_SH2 SH2 Y141 0.1460 0.034% YDIDEVVYDEDSPYQ 0.955
A57147 PDZ_class1 PDZ T313 0.2721 0.000% RRIVIPSTLA*     1.177
AAD33991 Grb2_SH2 SH2 Y83 0.2393 0.208% DLGTLRGYYNQSEDG 3.154
CAD86576 PKA_Kin Baso_ST_kin S41 0.0806 0.138% GDRGRRKSRFALYKR 2.061
CAD86576 PKC_mu Baso_ST_kin T88 0.2926 0.176% YTLSRNQTVVVEYTH 0.391
AI2191 Erk1_Kin Pro_ST_kin S334 0.2706 0.027% ACHSGKLSPSPILLA 1.623
AI2191 PDZ_class1 PDZ S385 0.1585 0.000% QRLIPDVSLV*     0.480
AI2191 PDZ_nNOS_1 PDZ S385 0.2083 0.000% QRLIPDVSLV*     0.480
Here is a description of the information in the eight columns, in order:

  1. The protein ID, as submitted in your input file.
  2. The motif found, and thus a predicted interaction. For example, the motif "Fyn_SH2" indicates that a tyrosine residue on the input protein, once phosphorylated, is recognized by the SH2 domain of the kinase Fyn.
  3. The motif family. Scansite organizes motifs of similar types into families. For example, the proline-directed serine/threonine kinases group (Pro_ST_Kin) is currently composed of Cdc2, Cdk5, Erk1, and p38 MAPK. These have similar motifs, and a predicted interaction with one of them may involve one of the others instead.
  4. The site found, such as "Y69".
  5. The calculated score. Lower numbers indicate better matches. A score of 0.000 means the site matches the motif description perfectly, whereas the score increases for sequences with some substituted low-scoring residues.
  6. The percentile rank of this site compared to others in a reference set. Lower numbers mean better specificity; a percentile of 0.130% indicates that this site is a better match to the motif description than 99.870% of potential sites in the reference data. The reference used is the vertebrate category of Swiss-Prot, chosen for its low redundancy. Specifically, for a motif with a central serine or threonine, all serine and threonine residues in all vertebrate Swiss-Prot proteins are scored as potential sites. A low percentile indicates the site is very rare in the vertebrate proteome.
  7. The 15-mer sequence surrounding the site. If the residues are numbered 1 to 15, the "site" is the central residue, at position 8. For sites very near the N or C termini of the protein, the N or C termini are indicated with the characters "$" (for N terminus) or "*" (for C terminus). If there are still fewer than 15 residues to display, spaces fill in the rest of the positions.
  8. The calculated surface accessibility. This is calculated from the relative hydrophobicity of nearby residues, and is intended to help judge whether the site found is near the protein surface and thus available for an interaction. Values lower than 1.0 are typically buried, and higher values are increasingly hydrophilic regions and thus likely to be near the surface. This calculation is not foolproof, but is useful when no protein structure is available.

If you want to analyze the Scansite results in a program, you can easily parse this output file to read the data into program variables. Here is an example in Perl.

Sequence Input

The above instructions assume you are entering a list of protein IDs. You can just as easily enter a list of raw sequences. In this case, each line of the input file should consist of a protein name (any name you want), a space, and the sequence using the standard single-letter amino acid code. "X" can be used for unknown residues, and "U" can be used for selenocysteine. One restriction on the protein name is that it should not contain spaces, or the line will not be parsed correctly by Scansite. Here is what the format should look like:

MyProtein MGHHHHHHDYDIPTTENLYFQGAHMGIQRPTSTSSLVAAASRGSLEACGTKLGCFGG
Protein2 VVFDEDEIPSGVDVAKISMDEQDLLNGAGETYEVALTEPGTYSFYCAPHQGAGMVGKVTVN
Fragment5 MKRFLFLLLTISLLVMVQIQTGLSGQNDTSQTSSPSASSSMSGGIFLFFVANAIIHLFCFSL
C_terminus MSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGESL

Here is an example file you can run. Save it on your computer and give it a name, such as "twenty.seq". To submit these to Scansite, go to the Scansite home page, find the heading "Scan a List of Protein Sequences", and enter the path to "twenty.seq" in the input for "File of Protein Sequences:". It is usually easier to click the "Browse..." button and navigate to this file. Then select the desired stringency (see discussion above) and click "Submit". The program after this point is identical to the entry-by-IDs version described above, and the output format is the same as well.