![]() |
|
|
Scansite TutorialProtein ID InputUsing the Scansite batch submission program starts with one or more input protein IDs or sequences. For example, the human p53 protein can be submitted using any of the identifiers below: P53_HUMAN (Swiss-Prot ID) P04637 (Swiss-Prot accession) AAA61212 (GenPept accession) 339816 (GenPept GI) CAA25652 (GenPept accession) 642241 (Genpept GI) NP_000537 (RefSeq accession) 8400738 (RefSeq GI) DNHU53 (PIR accession) ENSP00000269305 (Ensembl accession) To submit p53 to Scansite, use one of these IDs and a code indicating which database to retrieve the sequence from. The codes are "ST" (Swiss-Prot/TrEMBL), "GP" (GenPept), "RS" (RefSeq), "PIR" (Protein Information Resource), and "EN" (Ensembl). Scansite recognizes identifiers from all these databases. You must upload a text file containing one protein entry per line. Each entry consists of a protein ID, a space, and a database code. (The space could just as easily be several spaces or a tab character.) Here are a few of the ways p53 would be entered: P53_HUMAN ST AAA61212 GP 8400738 RS ENSP00000269305 EN Here is a sample file of twenty protein IDs. Save this on your computer and give it a name (such as "twenty.ids"). To run these through Scansite, go to the Scansite home page, find the heading "Scan a List of Protein IDs or Accession Numbers", and enter the path to "twenty.ids" in the input for "File of Protein IDs:". It is usually easier to click the "Browse..." button and navigate to this file. Next, select a stringency level to use. The stringency level determines how closely a potential site has to match the motif description in order to be reported. The default setting is "High stringency", which requires the closest match and results in few false positives. However, the high stringency setting sometimes overlooks known sites, so you may occasionally want to use "Medium" or "Low" stringency settings, though you will get more false positives this way. With the input file specified and the stringency setting selected, click "Submit". You will see a new page headed "Scansite Job Submitted", which gives an estimate of the time needed to complete the job. This is usually less than a minute. Click the "Check Results" button to see if the job is finished. If it is, a page headed "Scansite Job Complete" will be displayed, and a link called "Results" will be presented for you to view or download. A link called "Input" also lets you download your input file, making it easy for you to store the input and output files together. The format of the output is a tab-delimited text file containing eight columns, as shown below: O19594 DNA_PK DNA_dam_kin S50 0.2369 0.116% PVAEYWNSQKDILED 3.119 O19594 PDZ_class2 PDZ F76 0.1568 0.000% NYGVGESFTV* 1.017 33383432 Fyn_SH2 SH2 Y69 0.1328 0.140% DIFTGKKYEDICPST 2.911 AAH10698 PKC_delta Baso_ST_kin S362 0.2307 0.130% KTFTKKESMKIASSV 1.476 AAF14000 PDZ_class1 PDZ S349 0.2825 0.000% NRNLVQFSRL* 1.534 21264341 ATM_Kin DNA_dam_kin S101 0.3120 0.146% EERMKELSQDSTGRV 1.700 21264341 Grb2_SH2 SH2 Y147 0.2358 0.189% VYDEDSPYQNIKILH 1.508 21264341 Nck_SH2 SH2 Y141 0.1460 0.034% YDIDEVVYDEDSPYQ 0.955 A57147 PDZ_class1 PDZ T313 0.2721 0.000% RRIVIPSTLA* 1.177 AAD33991 Grb2_SH2 SH2 Y83 0.2393 0.208% DLGTLRGYYNQSEDG 3.154 CAD86576 PKA_Kin Baso_ST_kin S41 0.0806 0.138% GDRGRRKSRFALYKR 2.061 CAD86576 PKC_mu Baso_ST_kin T88 0.2926 0.176% YTLSRNQTVVVEYTH 0.391 AI2191 Erk1_Kin Pro_ST_kin S334 0.2706 0.027% ACHSGKLSPSPILLA 1.623 AI2191 PDZ_class1 PDZ S385 0.1585 0.000% QRLIPDVSLV* 0.480 AI2191 PDZ_nNOS_1 PDZ S385 0.2083 0.000% QRLIPDVSLV* 0.480Here is a description of the information in the eight columns, in order:
If you want to analyze the Scansite results in a program, you can easily parse this output file to read the data into program variables. Here is an example in Perl. Sequence InputThe above instructions assume you are entering a list of protein IDs. You can just as easily enter a list of raw sequences. In this case, each line of the input file should consist of a protein name (any name you want), a space, and the sequence using the standard single-letter amino acid code. "X" can be used for unknown residues, and "U" can be used for selenocysteine. One restriction on the protein name is that it should not contain spaces, or the line will not be parsed correctly by Scansite. Here is what the format should look like: MyProtein MGHHHHHHDYDIPTTENLYFQGAHMGIQRPTSTSSLVAAASRGSLEACGTKLGCFGG Protein2 VVFDEDEIPSGVDVAKISMDEQDLLNGAGETYEVALTEPGTYSFYCAPHQGAGMVGKVTVN Fragment5 MKRFLFLLLTISLLVMVQIQTGLSGQNDTSQTSSPSASSSMSGGIFLFFVANAIIHLFCFSL C_terminus MSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGESL Here is an example file you can run. Save it on your computer and give it a name, such as "twenty.seq". To submit these to Scansite, go to the Scansite home page, find the heading "Scan a List of Protein Sequences", and enter the path to "twenty.seq" in the input for "File of Protein Sequences:". It is usually easier to click the "Browse..." button and navigate to this file. Then select the desired stringency (see discussion above) and click "Submit". The program after this point is identical to the entry-by-IDs version described above, and the output format is the same as well. |
|
|
|
||