BLAST is one of the most important programs in Bioinformatics. It allows one to search a protein or DNA sequence against a large database of sequences very rapidly. Similar sequences are identified and the significance of this similarity is calculated. A high degree of similarity can be used to infer homology.

When you perform a BLAST search, it is critical that you look at the e-value for each of the 'hits' that your search returns. The e-value (or 'expectation value') is defined as the number of hits you expect to see, in this database, with this score or better.

An e-value of ≤0.01 is generally indicative of homology. Proteins that are homologues can have much higher e-values (e.g. 5.0), but the BLAST search gives no statistical support to indicate that this is the case.

Click the headings below for instructions on running BLAST and examining the results

  • Visit the BLAST page at the NCBI: http://blast.ncbi.nlm.nih.gov/
  • Select the type of comparison. Click Nucleotide BLAST from the section headed Web BLAST.
  • Paste your sequence into the large search box
  • Under Choose Search Set, ensure that the Database being searched is set to Nucleotide collection (nr/nt).
  • Under Program Selection, ensure that Highly similar sequences (megablast) is selected.
  • Leave all other parameters at their default values.
  • Since this is a long sequence and the 'nr/nt' sequence collection is now extremely large, this search can take a very long time using the normal 'blastn', but 'megablast' is much faster (though less sensitive).

    Instead of wasting the NCBI computer resources and your time, you can also follow a link to precalculated results.

    View the results here

When the search has completed, the results page will first show a tabular list of the hits together with their scores.

Below this is the list of hits together with their scores, query coverage, E-value, Identity and Accession.

If you have used the actual server, rather than the pre-calculated results, you can also click Graphic Summary which gives a graphical representation of the hits. Your query is shown as a bold turquoise line with nucleotide base numbers printed along it. Below this are coloured lines representing the 'hits' in the database. The colour of these lines represents the score for the hit (as explained in the key at the top). Some lines will be broken into parts with different scores for different sections.

  • From the Descriptions list of hits, record the best hit in the database
  • Click on its Accession code

Since this is a perfect match to your sequence, this will open a new page with information about the gene we have been using.

  • From the FEATURES section, look for the location of the TATA box. If you click the regulatory link to the left of the TATA_box information, it will highlight the TATA box within the DNA sequence. Does this correspond to what you found from the promoter predictions?
  • Now scroll back and click the mRNA within the FEATURES section. The exons which are spliced together to form the mature mRMA will be highlighted in the sequence
  • Looking at the FEATURES information, is there anything that suggests this gene undergoes alternative splicing? i.e. are there multiple possible mature mRNA sequences? In particular, the CDS ('CoDing Sequence') record has a join section which shows you which regions of the DNA are exons and are joined (spliced) together to form the mature RNA squence. Is there anything here that suggests there are different ways of splicing the exons together and therefore multiple different coding sequences?
  • Make a note of the GenBank accession code for this gene

 

Continue