Screen Scraping

1: Analyzing the web page

First you will try using a screen scraper to access the NNPREDICT secondary structure prediction server. NNPREDICT (http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html) is a nice example because it runs quickly and the HTML is very simple.

Unfortunately the original server is no longer available. Consequently, I have created a pretend version which does a dummy prediction but uses exactly the same formatting. You can try the web page and download the Perl screenscraper. Note that this dummy version does not do a real secondary structure prediction - it returns a random prediction but everything is in the same format as the original NNPREDICT.

First try out the server. Enter a protein sequence (a random series of letters will do) and observe the results. Also make a note of the URL.

Once you have tried the server, you need to record the information about the form on the web page. Take a look at the HTML source code for the page:

<html><head>
<title>nnpredict input form</title>
</head>
<body bgcolor="F2F2F2">
<H1 align=center>NNPREDICT<br>Protein Secondary Structure Prediction</H1>
<H2 align=center><IMG SRC="protein.gif"></H2>
<H3 align=center>Enter a protein sequence and nnpredict will predict the secondary structure.</H3>
<H3 align=center><a href="http://www.cmpharm.ucsf.edu/~nomi/nnpredict-instrucs.html">Click here for instructions</a></H3>
<HR>
<form method="POST" action="./nnpredict.cgi">
<b>Tertiary structure class:</b>
<input TYPE="radio" NAME="option" VALUE="none" CHECKED> none 
<input TYPE="radio" NAME="option" VALUE="all-alpha"> all-alpha 
<input TYPE="radio" NAME="option" VALUE="all-beta"> all-beta 
<input TYPE="radio" NAME="option" VALUE="alpha/beta"> alpha/beta 
<dl>
  <dt><b>Name of sequence</b> 
	(optional)
  <dd><input name="name" size="70">
       <p>
<dt><b>Sequence</b>
<dd><i>(Use single-letter amino acid codes or three-letter codes separated by spaces)</i><br>
     <textarea name="text" rows=14 cols=70></textarea>
     <p>   
     <dd><input type="submit" value="Submit">
         <input type="reset" value="Clear">
</dl>
</form>
<hr>
<b>Other Local Homes</b><p>
<a href="http://www.cmpharm.ucsf.edu/cohen.html">Cohen Group Home</a> *
<a href="http://www.cmpharm.ucsf.edu/">Cohen Group Welcome </a>
<br>
<a href="http://www.pharm.ucsf.edu/">Department of Pharmaceutical Chemistry Home</a> *
<a href="http://www.ucsf.edu/">UCSF Home</a>
<hr>
<em>nnpredict was written by Donald Kneller<br>
Copyright (C) 1991 Regents of the University of California
<p>Web interface by <a href=http://www.fruitfly.org/~nomi/>Nomi Harris</a> (nomi@cgl.ucsf.edu, <A HREF="mailto:nlharris@lbl.gov">nlharris@lbl.gov</a>)
<P>
 (Mon Feb  5 15:32:54 1996)
</body> </html>

Record the following information:

You should have obtained:

Continue