Sequence comparison searches: FASTA

In this part of the tutorial, you will learn how to find out which sequences are similar to a search sequence. A high degree of similarity can be used to infer homology .

Both FASTA and BLAST can be used to compare either protein or DNA sequences. However, if possible, it is always preferable to work at the protein level. In trying to identify homology, it is not possible to go back in time, so we must use sequence similarity as a surrogate: the more similar two sequences are, the more likely it is that they are homologues. The fact that the DNA code is redundant and that some amino acids can generally substitute for one another without affecting function allows one to assess sequence similarity at the protein level much more effectively than at the DNA level.

Sequence similarity scoring

Some amino acids have more similar physico-chemical properties to one another than others. For example, serine and threonine are very similar, while serine and leucine are very different. This means that, during evolution, some amino acids can substitute for one another more easily than others and this can be compared with the substitutions one would expect purely from random chance. Thus, serine will be replaced by threonine more often than one would expect by random chance, but serine will be replaced by leucine much less frequently than one would expect by random chance.

Margaret Dayhoff analyzed homologous sequences in 1970. She aligned the sequences and looked to see how frequently substitutions occurred (i.e. the observed number of times a substitution occurred) and how this compared with random changes (i.e. the expected number of times a substitution occurred). She then calculated log(observed/expected) - if the observed count is > the expected count, this value will be positive; if the observed count is < the expected count, this value will be negative. These scores were placed in the Dayhoff 'Mutation Data Matrix'.

Similar scoring matrices have since been developed by Henikoff and Henikoff and are known as the BLOSUM matrices. The main important difference is that the sets of homologous sequences analyzed are different. These BLOSUM matrices are used by the FASTA and BLAST search tools described below.

You will examine two methods for finding similar sequences, FASTA and BLAST.

Start FASTA by clicking the following link: http://www.ebi.ac.uk/Tools/sss/fasta/

Here is the protein sequence that you should have obtained from the 6-frame translation of the E. coli pfkA sequence in FASTA format :

>myseq
MIKKIGVLTSGGDAPGMNAAIRGVVRSALTEGLEVMGIYDGYLGLYEDRMVQLDRYSVSD
MINRGGTFLGSARFPEFRDENIRAVAIENLKKRGIDALVVIGGDGSYMGAMRLTEMGFPC
IGLPGTIDNDIKGTDYTIGFFTALSTVVEAIDRLRDTSSSHQRISVVEVMGRYCGDLTLA
AAIAGGCEFVVVPEVEFSREDLVNEIKAGIAKGKKHAIVAITEHMCDVDELAHFIEKETG
RETRATVLGHIQRGGSPVPYDRILASRMGAYAIDLLLAGYGGRCVGIQNEQLVHHDIIDA
IENMKRPFKGDWLDCAKKLY

Run the FASTA search

Paste the sequence in FASTA format (including the header line) into the sequence box

Important: Under 'Step 1' ensure that ONLY UniProtKB/Swiss-Prot is ticked in the list of databases that could be searched.

Now click the Submit button to run the search. It may take some time to run (but usually no more than a few minutes) so be patient.

The page will keep refreshing until the results are ready. Once the results are displayed, proceed to the next step of the tutorial.

Note: there are a number of alternative FASTA servers at other sites such as: http://fasta.genome.jp/ and http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=select&pgm=fa. You could use one of these if the EBI server is not working, but it is better to wait until the EBI server is available again. Servers not working is, of course, a complete pain, but it is the reality of working with Bioinformatics tools over the web!

Results

The results show you which sequences in the database are similar to the sequence you used for searching.

You should find that the top hits are to pfkA from various strains of E. coli and other very closely related bacteria. This is the enzyme you looked at earlier in the KEGG database. Since this protein sequence is already in the UniProt/SwissProt database, you will see that your sequence is 100% identical.

Ask yourself what the other hits mean. In particular, look at the e-values. For a given score, the e-value indicates the number of hits that one would expect to see by chance with this score or better. This accounts for the size and content of the database being searched. In other words, this is a measure of the chance that this is a false positive hit: the lower the value, the more likely it is that this is a genuine hit.

If you are unfamiliar with the notation used for small numbers, a value of 1.5e-6 means 1.5×10^-6 (i.e. 0.0000015). In written work you should never write it as 1.5e-6!

By clicking a database identifier (in the DB:ID column), you can view details of the protein that was hit.

Other than E.Coli, record which species has 100% identity to the search sequence in the results from the FASTA search of UniProt?

Viewing an alignment

Note! You won't be able to do this section if you didn't use the EBI server

On the left of the FASTA results page, there is a box titled Apply to selection:. Under Tools: at the bottom, change the selector to Clustal Omega, click Launch and then Submit.

This runs a multiple sequence alignment on the results

When the results come back, click the Results Viewers button at the top of the results, then View in MView and then Submit. This takes you to the MView multiple alignment viewer. You can look at the multiple alignment to see conserved features of the sequences.

The alignment shows 50 sequences. The first 80 positions of the alignment are shown horizontally with each row representing a different sequence. The alignment then continues in another block with the next 80 aligned positions.

At the bottom of each block of the multiple alignment, consensus sequences are shown at different levels of conservation. Thus for an amino acid to be shown in consensus/100%, it must be conserved at that position in every sequence. Upper case letters are used to indicate completely conserved residues. Lower case letters and symbols are used to indicate partially conserved residues and full-stops are used to indicate non-conserved residues.

Note! If all your hits have an identical sequence, you have probably forgotten to select only UniProtKB/SwissProt in the databases to search using FASTA. In this case, go back and correct your search.

Later, you will learn how to create your own multiple alignments.

When you perform a sequence similarity search, it is critical that you look at the e-value for each of the 'hits' that your search returns. The e-value (or 'expectation value') is defined as the number of hits you expect to see, in this database, with this score or better.

An e-value of ≤0.01 is generally indicative of homology. Proteins that are homologues can have much higher e-values (e.g. 5.0), but the search gives no statistical support to indicate that this is the case.