Using REST

Representational State Transfer

[Wikipedia]

We will now look at using REST instead of screen scraping. This relies on the service provider to provide a REST interface, but the idea is very simple: REST is simply a CGI script that provides content in a simple easy-to-parse and guaranteed format. This might be XML, JSON, or keyword: value pairs.

We will use the REST services provided by the PDBSWS server available at
http://www.bioinf.org.uk/servers/pdbsws/

PDBSWS is a server which provides a chain- and residue- level mapping between sequences of structures in the Protein Databank (PDB) and the equivalent entries in UniProt. PDBSWS starts with cross-links provided in the two databanks (i.e. UniProt codes provided in the PDB and PDB codes provided in UniProt). However these can sometimes become outdated (particularly the cross-links from the PDB to UniProt) and the information is provided at the chain-level not the individual amino acid level. Why is this important? Because PDB files often do not include the whole protein sequence present in UniProt (i.e. only part of the protein has been solved as a crystal structure) and often the numbering in the PDB file does not account for this. This may occur for genuine biological reasons (e.g. part of the protein is naturally cleaved off to create the active form) or for practical reasons (e.g. part of the protein is flexible making it difficult to crystallize so it is removed artificially). Thus, for example, residue number 5 in the PDB structure file might be the 20th residue in the UniProt sequence file.

Suppose we want to write some code that takes information from a PDB file about a residue (perhaps identifying residues that are on the surface, or within a certain distance of a binding site) and then want to look up information about that residue in UniProtKB/SwissProt. We would need to have the correct UniProt accession code and residue number.

For this example, we are going to look at hen egg white lysozyme, a very widely studied protein for which many structures have been solved. The first 18 residues form a signal peptide which is cleaved off, so all the structures start with residue 19 of the UniProt sequence, but this is almost invariably numbered as residue 1 in the structures. Consequently, you need to add 18 to all the PDB residue numbers to get the UniProt residue number.

This is where PDBSWS comes in. You can enter a PDB code and (optionally) a residue number to obtain the correct UniProt accession code and residue number.

First try out the server. In the Query by PDB code section, enter the PDB code 1bwi and press the Submit button.

You will see the mapping to the UniProt accession (AC) and identifier (ID) and an arrow to the right which will display the alignment.

In the alignment you will see the Sequential Number of the residue in the PDB file, the Residue Number as it appears in the PDB file and the Residue Number in the UniProt file.

Return to the server. Again, in the Query by PDB code section, enter the PDB code 1bwi, but also enter the Residue ID 35 and press the Submit button.

This time the additional information of the PDB Residue ID you entered (35) and the UniProt residue nunber. If you look at the UniProt entry, you will find that this residue (residue 53) is part of the active site. (This information could be extracted computationally from the downloadable UniProt data files.)

You could get straight to the same page of information from PDBSWS by putting the parameters in the URL:
http://www.bioinf.org.uk/servers/pdbsws/query.cgi?qtype=pdb&id=1bwi&res=35

Clearly you could write a screen scraper to obtain the information from the resulting table, but PDBSWS also provides a REST interface, simply by adding the parameter plain=1 to the URL.

Add plain=1 to the URL and you will see the same information in a simple keyword: value format:
http://www.bioinf.org.uk/servers/pdbsws/query.cgi?qtype=pdb&id=1bwi&res=35&plain=1

The results should look something like this:

PDB: 1bwi
CHAIN: A
RESID: 35
PDBAA: E
AC: P00698
ID: LYSC_CHICK
UPCOUNT: 53
UPAA: E
//

As you see, this is a very simple format that is easy to parse.

Continue