Sequence comparison searches: BLAST

You will now look at the second method for finding similar sequences and inferring homology , BLAST. It is similar to FASTA: it is rather faster, but somewhat less sensitive at finding remote similarities. It is generally considered one of the most important programs in Bioinformatics.

Here is the same sequence (E. coli pfkA) that you used with FASTA:

>myseq
MIKKIGVLTSGGDAPGMNAAIRGVVRSALTEGLEVMGIYDGYLGLYEDRMVQLDRYSVSD
MINRGGTFLGSARFPEFRDENIRAVAIENLKKRGIDALVVIGGDGSYMGAMRLTEMGFPC
IGLPGTIDNDIKGTDYTIGFFTALSTVVEAIDRLRDTSSSHQRISVVEVMGRYCGDLTLA
AAIAGGCEFVVVPEVEFSREDLVNEIKAGIAKGKKHAIVAITEHMCDVDELAHFIEKETG
RETRATVLGHIQRGGSPVPYDRILASRMGAYAIDLLLAGYGGRCVGIQNEQLVHHDIIDA
IENMKRPFKGDWLDCAKKLY

Start BLAST by clicking the following link: http://www.ncbi.nlm.nih.gov/BLAST/

Click the headings below for instructions on running BLAST and examining the results

You need to select the appropriate type of comparison (protein blast from the Web BLAST section). If you had been searching for a nucleotide sequence match you would have chosen nucleotide blast.

Once you enter the BLAST page, ensure that from the tabs near the top, you have chosen blastp and leave all other options at their default settings ensuring that the Database being searched is Non-redundant protein sequences (nr).

Paste in your sequence for pfkA from E. coli into the Enter Query Sequence box, as you did with FASTA.

Now press the BLAST button. This will take you to a page which will keep refreshing until the results are complete.

When the search has completed, the results page will first show a graphical representation of the hits. Your query is shown as a bold red line with nucleotide base numbers printed along it. Below this are coloured lines representing the 'hits' in the database. The colour of these lines represents the score for the hit (as explained in the key at the top).. Some lines will be broken into parts with different scores for different sections.

Below this is the list of hits together with their scores, query coverage , E-value , Identity and Accession.

As before, you can now find the best matches (hits). Again ask yourself what the e-values mean and scroll down and see what the alignments tell you.

If you click on the Description for a hit you will jump down the page to see the alignment between your Query sequence and the Sbjct (subject) sequence found in the database.

Click the Description for the best sequence which had a Query Coverage of 100% (i.e. all of your probe sequence matched a hit in the database), but had a Per. Ident (Percentage Identity) of less than 100%. In other words you are finding the first sequence which does not exactly match the query sequence.

The sequence shown between the query and subject shows where the two sequences match. Where there are mismatches there is a gap and where similar amino acid residues are present there is a '+'. Take a look at what these differences are; it may take you a while to spot a '+' sign.

Sometimes data in different databanks may disagree - either because of genuine differences in data that have been collected or because people make mistakes!

Click the following link to view the UniProt/SwissProt entry that you used in the searches: https://www.uniprot.org/uniprot/P0A796 This again is a view of the data from a UniProt entry that you saw earlier.

Click the Sequence link on the left to scroll down to the section headed Sequence. Below that is a section labelled Features which contains a number of entries labelled Sequence Conflict

Record how many conflict records there are


Each conflict refers to the same reference stored in PubMed - click the PubMed reference number to be taken to the reference.

In many cases, when you run a BLAST search and look at the alignment, you will see that your query sequence will contain a string of X characters. These result from a process known as masking or filtering. When you run BLAST, your query sequence is masked with a program called 'SEG'. Look at the documentation for SEG, by clicking this link: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#LCR

Make sure you understand the function of masking

The masked regions contain repetitive sequences which are biased in their sequence composition. Such regions can confuse sequence matching programs like BLAST since they are more likely to contain chance matches. For example, leucine zippers occur in many unrelated proteins and contain a particularly high leucine content.

 

When you perform a sequence similarity search, it is critical that you look at the e-value for each of the 'hits' that your search returns. The e-value (or 'expectation value') is defined as the number of hits you expect to see, in this database, with this score or better.

An e-value of ≤0.01 is generally indicative of homology. Proteins that are homologues can have much higher e-values (e.g. 5.0), but the search gives no statistical support to indicate that this is the case.

Continue