BIOC0003 - Introduction to Bioinformatics

You will now look at the second method for finding similar sequences and inferring homology , BLAST. It is similar to FASTA: it is rather faster, but somewhat less sensitive at finding remote similarities. It is generally considered one of the most important programs in Bioinformatics.

Run the BLAST search

You need to select the appropriate type of comparison (protein blast from the Web BLAST section). If you had been searching for a nucleotide sequence match you would have chosen nucleotide blast.

Once you enter the BLAST page, ensure that from the tabs near the top, you have chosen blastp and leave all other options at their default settings ensuring that the Database being searched is Non-redundant protein sequences (nr).

Paste in your sequence for pfkA from E. coli into the Enter Query Sequence box, as you did with FASTA.

Now press the BLAST button. This will take you to a page which will keep refreshing until the results are complete.

Results

When the search has completed, the results page will first show a graphical representation of the hits. Your query is shown as a bold red line with nucleotide base numbers printed along it. Below this are coloured lines representing the 'hits' in the database. The colour of these lines represents the score for the hit (as explained in the key at the top).. Some lines will be broken into parts with different scores for different sections.

Below this is the list of hits together with their scores, query coverage , E-value , Identity and Accession.

As before, you can now find the best matches (hits). Again ask yourself what the e-values mean and scroll down and see what the alignments tell you.

If you click on the Description for a hit you will jump down the page to see the alignment between your Query sequence and the Sbjct (subject) sequence found in the database.

Click the Description for the best sequence which had a Query Coverage of 100% (i.e. all of your probe sequence matched a hit in the database), but had a Per. Ident (Percentage Identity) of less than 100%. In other words you are finding the first sequence which does not exactly match the query sequence.

The sequence shown between the query and subject shows where the two sequences match. Where there are mismatches there is a gap and where similar amino acid residues are present there is a '+'. Take a look at what these differences are; it may take you a while to spot a '+' sign.

Sequence differences

Sometimes data in different databanks may disagree - either because of genuine differences in data that have been collected or because people make mistakes!

Click the following link to view the UniProt/SwissProt entry that you used in the searches: https://www.uniprot.org/uniprot/P0A796 This again is a view of the data from a UniProt entry that you saw earlier.

Click the Sequence link on the left to scroll down to the section headed Sequence. Below that is a section labelled Features which contains a number of entries labelled Sequence Conflict

Record how many conflict records there are

Each conflict refers to the same reference stored in PubMed - click the PubMed reference number to be taken to the reference.

Masking

In many cases, when you run a BLAST search and look at the alignment, you will see that your query sequence will contain a string of X characters. These result from a process known as masking or filtering. When you run BLAST, your query sequence is masked with a program called 'SEG'. Look at the documentation for SEG, by clicking this link: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#LCR

Make sure you understand the function of masking

The masked regions contain repetitive sequences which are biased in their sequence composition. Such regions can confuse sequence matching programs like BLAST since they are more likely to contain chance matches. For example, leucine zippers occur in many unrelated proteins and contain a particularly high leucine content.

Sequence comparison searches: BLAST

Run the BLAST search

Results

Sequence differences

Masking