Primary databanks

Primary databanks store raw data (such as sequence information). This may be at the DNA or protein level, or may be other data such as interactions or expression levels. Examples include:

Genbank, EMBL, DDBJ - DNA sequences
SwissProt, trEMBL, PIR - Protein sequences
PDB - protein structure

They can be downloaded as flat text files and must be used with a suitable search tool to extract data. Examples of search tools are:

BLAST, PSI-BLAST, FASTA (Sequence searches)
GLIMPSE (Annotation text searches)

Practical work

You will now try the following searches:

DNA search against Genbank using BLAST
Protein search against Genpept using BLAST
Protein search against SwissProt using FASTA
A simple text search against SwissProt using Glimpse
A more complex text search against SwissProt using SRS

Searching SwissProt with FASTA

Here is a protein sequence which you might have derived from a sequencing experiment:

>myseq
MIKKIGVLTSGGDAPGMNAAIRGVVRSALTEGLEVMGIYDGYLGLYEDRMVQLDRYSVSD
MINRGGTFLGSARFPEFRDENIRAVAIENLKKRGIDALVVIGGDGSYMGAMRLTEMGFPC
IGLPGTIDNDIKGTDYTIGFFTALSTVVEAIDRLRDTSSSHQRISVVEVMGRYCGDLTLA
AAIAGGCEFVVVPEVEFSREDLVNEIKAGIAKGKKHAIVAITEHMCDVDELAHFIEKETG
RETRATVLGHIQRGGSPVPYDRILASRMGAYAIDLLLAGYGGRCVGIQNEQLVHHDIIDA
IENMKRPFKGDWLDCAKKLY

The sequence is shown in a format known as FASTA. The first line is a header introduced by a > sign with a label for the sequence. The following lines contain the sequence using the standard one-letter code. After the header line, formatting (line breaks and spaces) is ignored.

Start FASTA by clicking the following link: http://www.ebi.ac.uk/fasta33/

Paste the sequence in FASTA format (including the header line) into the sequence box.

Ensure you set the Database to UniProtKB/Swiss-Prot to search SwissProt.

Leave all options at their default settings and click the Run button to run the search. It may take some time to run (but usually no more than a few minutes) so be patient.

The page will keep refreshing until the results are ready.

Note: Sometimes the server may run particularly slowly, because lots of people are trying to access it at once. Alternatively, there may be a temporary problem meaning it is offline. There are alternative FASTA servers at other sites such as:

http://fasta.genome.jp/ (make sure you click 'Swiss-Prot' as the database to search).

You can use this server instead if the EBI server isn't working.

The results show you which sequences in the database are similar to the sequence you used for searching.

You should find that the top hit is to pfkA from E. coli. Since this protein sequence is already in the UniProt/SwissProt database, you will see that your sequence is 100% identical. You will also find that some other sequences are 100% identical. What does this suggest to you about these organisms?

Ask yourself what the other hits mean. In particular, look at the e-values. For a given score, the e-value indicates the number of hits that one would expect to see by chance with this score or better in this database. This accounts for the size and content of the database being searched. In other words, this is a measure of the chance that this is a false positive hit: the lower the value, the more likely it is that this is a genuine hit.

If you are unfamiliar with the notation used for small numbers, a value of 1.5e-6 means 1.5x10^-6 (0.0000015)

The fact that database size is taken into account can be confusing. A given sequence when hitting a sequence in the PDB might have an e-value of 0.0001 while the same hit in NR (the non-redundant composite database maintained by the NCBI) might give an e-value of 0.1 because NR is so much bigger.

Ensure that the meaning of e-values is absolutely clear to you. Make sure you understand the effect of database size. Discuss with a demonstrator if you are at all unsure!

By clicking a database identifier (in the DB:ID column), you can view details of the protein that was hit.

If you used the EBI server...

Towards the top of the FASTA results page, there is a table headed SUBMISSION PARAMETERS. At the top of the page is a tab labelled Tool Output.

Assuming you used the EBI server, click the Tool Output tab and then click Send to MView and Submit.

After a short pause, a multiple alignment of all the displayed hits will be shown. You can look at the multiple alignment to see conserved features of the sequences. At the bottom of the multiple alignment, consensus sequences are shown at different levels of conservation. Thus for an amino acid to be shown in consensus/100%, it must be seen at that position in every sequence.

Lower case letter and symbols are used to indicate partially conserved residues (or groupings of similar residues) and full-stops are used to indicate non-conserved residues.

Later, you will learn how to create your own multiple alignments.

If you used the Japanese server...

At the top of the results page is an option to select a set of sequences (top-10, top-20, etc) and to perform an operation on those sequences.

Select the Top 20 sequences. Change the Select Operation to CLUSTALW and click Exec.

After a short pause, a multiple alignment of all the displayed hits will be shown. You can look at the multiple alignment to see conserved features of the sequences. At the bottom of the multiple alignment, the level of conservation is indicated with a '*' for a completely conserved residue, a ':' for a highly conserved residue and a '.' for a partially conserved residue.

Later, you will learn how to create your own multiple alignments.

The most important thing to remember is the meaning of e-values: the number of hits you would expect to observe obtaining this score (or better) by random chance in the current database. You should always look carefully at this value.
Take a look at this paper by Jones & Swindells to clarify some of the problems of working with BLAST and PSI-BLAST.

Searching the 'nr' database with BLAST

The nr database is a 'non-redundant' database (i.e. with duplicated sequences removed). nr contains non-redundant sequences from GenBank translations (i.e. GenPept) together with sequences from other databanks (Refseq, PDB, SwissProt, PIR and PRF).

BLAST is a program similar to FASTA for searching sequence databases. It is rather faster that FASTA, but somewhat less sensitive at finding remote similarities. (The variant PSI-BLAST is much more sensitive, but much slower.)

Here is the protein sequence that you used for the FASTA search again:

>myseq
MIKKIGVLTSGGDAPGMNAAIRGVVRSALTEGLEVMGIYDGYLGLYEDRMVQLDRYSVSD
MINRGGTFLGSARFPEFRDENIRAVAIENLKKRGIDALVVIGGDGSYMGAMRLTEMGFPC
IGLPGTIDNDIKGTDYTIGFFTALSTVVEAIDRLRDTSSSHQRISVVEVMGRYCGDLTLA
AAIAGGCEFVVVPEVEFSREDLVNEIKAGIAKGKKHAIVAITEHMCDVDELAHFIEKETG
RETRATVLGHIQRGGSPVPYDRILASRMGAYAIDLLLAGYGGRCVGIQNEQLVHHDIIDA
IENMKRPFKGDWLDCAKKLY

Using this protein sequence, we will now perform a search of the nr database using BLAST at the NCBI web site.

Start BLAST by clicking the link: http://www.ncbi.nlm.nih.gov/BLAST/

Select the type of comparison (protein blast from the Basic BLAST section).

This will take you to a page where you can run a search of a protein sequence against a protein database. Later you will do a nucleotide search.

On the blastp page, leave all options at their default settings (including leaving the database as Non-redundant protein sequences (nr) and the algorithm as blastp (protein-protein BLAST)).

Paste the protein sequence into the Search box.

Press the BLAST button.

This will take you to a page which will keep refreshing until the results are complete. As before, you can now find the best matches (hits).

Ask yourself what the hits mean.

Scroll down and see what the alignments tell you.

In this case, the best hits are to E. coli and Shigella flexneri pfkA.

In earlier versions of the database there was an E. coli hit with a lower score and viewing the alignment showed a number of discrepancies in the sequence. This doesn't occur any more, but follow the instructions below to see what was happening.

Click the following link to view the UniProt/SwissProt entry that you used in the searches: http://www.uniprot.org/uniprot/P0A796

Scroll down to the section headed Sequence and below that Experimental Info.

You will see a number of entries labelled Sequence Conflict

Each 'conflict' refers back to the same PubMed reference

Click the Sequence Conflict text for more information and then the PubMed Reference to be taken to the reference.

Here, you will find that the discrepancies, or conflicts, are between the sequence in this entry and the sequence derived from the DNA in the specified reference.

If you click back to the UniProt page and then scroll down to the section headed Cross-references, you will find references to the EMBL databank. Here you can compare the sequences from different data sources.

The important take-home message is that there can be discrepancies in the data stored in different databanks!

You must always exercise caution in interpreting results of searches, especially looking at individual sequence differences.

Searching Genbank with a DNA sequence using BLAST

In practice, we will again search a combined database containing data from GenBank, EMBL and DDBJ (all of which exchange data on a regular basis), together with sequences from RefSeq and the PDB.

Here is a DNA sequence you might have obtained from sequencing in the lab:

>myseq
agtcatgatt aagaaaatcg gtgtgttgac aagcggcggt gatgcgccag gcatgaacgc
cgcaattcgc ggggttgttc gttctgcgct gacagaaggt ctggaagtaa tgggtattta
tgacggctat ctgggtctgt atgaagaccg tatggtacag ctagaccgtt acagcgtgtc
tgacatgatc aaccgtggcg gtacgttcct cggttctgcg cgttgtccgg aattccgcga
cgagaacatc cgcgccgtgg ctatcgaaaa cctgaaaaaa cgtggtatcg acgcgctggt
ggttatcggc gatggcggtt cctacatggg tgcaatgcgt ctgaccgaaa tgggcttccc
gtgcatcggt ctgccgggca ctatcgacaa cgacatcaaa ggcactgact acactatcgg
tttcttcact gcgctgagca ccgttgtaga agcgatcgac cgtctgcgtg acacctcttc
ttctcaccag cctatttccg tggtggaagt gatgggccgt tattgtggag atctgacgtt
ggctgcggcc attgccggtg gctgtgaatt cgttgtggtt ccggaagttg aattcagccg
tgaagacctg gtaaacgaaa tcaaagcggg tatcgcgaaa ggtaaaaaac acgcgatcgt
ggcgattacc gaacatatgt gtgatgttga cgaactggcg catttcatcg agaaagaaac
cggtcgtgaa acccgcgcaa ctgtgctggg ccacatccag cgcggtggtt ctccggtgcc
ttacgaccgt attctggctt cccgtatggg cgcttacgct atcgatctgc tgctggcagg
ttacggcggt cgttgtgtag gtatccagaa cgaacagctg gttcaccacg acatcatcga
cgctatcgaa aacatgaagc gtccgttcaa aggtgactgg ctggactgcg ccgaaaaaat
gtattaatga

Using this DNA sequence, we will now perform a search of Genbank using BLAST at the NCBI web site.

Start BLAST by clicking the link: http://www.ncbi.nlm.nih.gov/BLAST/

Select the type of comparison (nucleotide blast from the Basic BLAST section).

This will take you to a page where you can run a search of a DNA sequence against a DNA database.

On this page, under Choose database, click Others (nr etc.) then make sure the select database from the pull-down menu is Nucleotide collection (nr/nt).

Under Program Selection, select Somewhat similar sequences (blastn).

Leave all other options at their default settings.

Paste the DNA sequence into the Search box.

Press the BLAST button.

This will take you to a page which will keep refreshing until the results are complete. As before, you can now find the best matches (hits).

As before, you will see that you obtain first a graphical representation of the hits, followed by details of the hits together with bit scores and e-values. By clicking on a score, you are taken to the alignment.

Performing DNA searches in BLAST is much the same as performing protein searches. However, when looking for distant relatives, it is always better to work at the protein level. There are a number of alternative versions of BLAST which allow different types of searches. For example:

a DNA sequence to be searched against a protein database (blastx)
a protein sequence to be searched against a DNA database (tblastn)
a DNA sequence to be searched against a DNA database, but performing translations such that comparisons are done at the protein level (tblastx)

Again, you must always look at the e-values of the hits you obtain.

Text searches

Often you want to perform much simpler searches - you don't want to search with a sequence, but simply with a keyword or the identifier of a sequence (perhaps one you have read about in a paper). In other words, you want to perform a text search of the annotations rather than a sequence search.

There are a number of resources which allow text searches - even a Google search may help you! (Try searching for the SwissProt identifier P00001).

The NCBI provides an integrated search facility within its Entrez system. You use this simply by going to the NCBI home page and entering the term(s) in the search box.

Expasy also provides a simple SwissProt text search facility.

Visit the NCBI by clicking this link: http://www.ncbi.nlm.nih.gov/

Enter pfka into the search box at the top of the page.

Leave the Search pull-down set to All Databases

Click Go

You will be taken to a page which shows the hits for the search term in the various databases maintained at the NCBI. These include the literature database (PubMed) as well as the sequence databases at both Nucleotide and Protein level.

Follow the Proteins: Protein link to obtain a full list of all the hits in the protein sequence database. You can follow any one of the resulting links for details of the protein.

Expasy

The Expasy web site also provides a simple text search of SwissProt and trEMBL (SwissProt is a highly annotated database while trEMBL contains automated translations of DNA sequences in EMBL).

Visit the Expasy web site: http://www.expasy.org/

Enter pfka into the search box at the top of the page.

Set the Search pull-down set to UniProtKB

Click Go

Just as with the NCBI search, you are presented with a list of the sequences which match the search term. You can click one of these to obtain details of the sequence entry.

Since this is a simple text search, internally Expasy uses a tool called Glimpse which is a general text indexing and search tool.

Simple text searches can return a lot of hits. If you know you are looking for a sequence in UniProtKB/SwissProt it is generally easier to search UniProtKB/SwissProt directly rather than using the general search interface at the NCBI.