Birkbeck MSc Bioinformatics With Systems Biology

Testing our hypothesis using assert

Our hypothesis is that the error is a result of the variable tag being set, so we can use assert to raise an exception if tag should ever be set.

#!/usr/bin/python3

def removeHtmlMarkup(s):
    tag   = False                           
    quote = False
    out   = ""

    for c in s:
        assert not tag                      #                    NEW

        if c == '<' and not quote:          # Start of markup
            tag = True
        elif c == '>' and not quote:        # End of markup  
            tag = False
        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote               
        elif not tag:
            out = out + c

    return out

""" We know this failed """
if __name__ == "__main__":
    print (removeHtmlMarkup('"foo"'), '\t["foo"]')

[Download strip8.py]

If our hypothesis is correct, we would expect an assert exception.
If there is no assert exception, then tag has never been set.

Result

There was no assert exception
We reject our hypothesis
We need a new hypothesis!

Here again is the only place where quotes are handled:


        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote

We know (from our assert) that tag is always False.
If that is the case then (if our code is correct), then we should never enter the block that changes the quote mode

We can test this with another assert:

#!/usr/bin/python3

def removeHtmlMarkup(s):
    tag   = False                           
    quote = False
    out   = ""

    for c in s:
        assert not quote
#        assert quote

        if c == '<' and not quote:          # Start of markup
            tag = True
        elif c == '>' and not quote:        # End of markup  
            tag = False
        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote               
        elif not tag:
            out = out + c

    return out

""" We know this failed """
if __name__ == "__main__":
    print (removeHtmlMarkup('"foo"'), '\t["foo"]')

[Download stip9.py]

We find that the assertion raises and exception, so we know that we are entering the block where the quote variable is changed.

Hypothesis 2

We make a new hypothesis:

The error is due to the condition always evaluating to True

An experiment to verify this:

If the test on the quotes evaluates to true then the next block will be executed.
Add an assert False into the block - this will always throw and exception when the block is entered

#!/usr/bin/python3

def removeHtmlMarkup(s):
    tag   = False                           
    quote = False
    out   = ""

    for c in s:
        if c == '<' and not quote:          # Start of markup
            tag = True
        elif c == '>' and not quote:        # End of markup  
            tag = False
        elif c == '"' or c == "'" and tag:  # Quote          
            assert False                    # Should never be reached NEW
            quote = not quote               
        elif not tag:
            out = out + c

    return out

""" We know this failed """
if __name__ == "__main__":
    print (removeHtmlMarkup('"foo"'), '\t["foo"]')

[Download strip10.py]

The program does indeed throw an exception
The condition is evaluating to True

Revisiting our hypotheses

Double quotes are stripped from the input
The error is a result of tag being set
The error is due to the condition always evaluating to True

i.e. We know that:

tag is not being set
The condition is evaluating to True even though tag is not set

Are all quotes stripped?

Is the problem with quotes general? Are single quotes stripped in the same way?

We make a new hypothesis:

Single quotes are stripped from the input

We modify our test code to test this hypothesis:

#!/usr/bin/python3

def removeHtmlMarkup(s):
    tag   = False                           
    quote = False
    out   = ""

    for c in s:
        if c == '<' and not quote:          # Start of markup
            tag = True
        elif c == '>' and not quote:        # End of markup  
            tag = False
        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote               
        elif not tag:
            out = out + c

    return out

""" Our tests """
if __name__ == "__main__":
    print (removeHtmlMarkup('"foo"'), '\t["foo"]')  
    print (removeHtmlMarkup("'foo'"), "\t['foo']")   # NEW TEST

[Download strip11.py]

We find that the single quotes are not stripped
We reject the hypothesis!

Revisiting our hypotheses

Double quotes are stripped from the input
The error is a result of tag being set
The error is due to the condition always evaluating to True
Single quotes are stripped from the input

What have we learned?

The condition


        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote

True when c is a double quote
False when c is a single quote

Solving the bug

We now have enough information to see exactly what is going on.

When the character is a double quote, the condition is evaluating to True even when tag is False.
When the character is a single quote, the condition is not evaluating to True when tag is False.

The explanation

'and' takes precedent over 'or'. Consequently the code:


        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote

is equivalent to:


        elif c == '"' or (c == "'" and tag):  # Quote          
            quote = not quote

when what we wanted was:


        elif (c == '"' or c == "'") and tag:  # Quote          
            quote = not quote

The final code

#!/usr/bin/python3

def removeHtmlMarkup(s):
    tag   = False                           
    quote = False
    out   = ""

    for c in s:
        if c == '<' and not quote:            # Start of markup
            tag = True
        elif c == '>' and not quote:          # End of markup  
            tag = False
        elif (c == '"' or c == "'") and tag:  # Quote          
            quote = not quote               
        elif not tag:
            out = out + c

    return out

""" Our tests """
if __name__ == "__main__":
    print (removeHtmlMarkup('"foo"'),                      '\t["foo"]')
    print (removeHtmlMarkup("'foo'"),                      "\t['foo']")

    # Old tests
    print ("Old tests...")
    print (removeHtmlMarkup('<b>foo</b>'),                 '\t[foo]')
    print (removeHtmlMarkup('<em>foo</em>'),               '\t[foo]')
    print (removeHtmlMarkup('<a href="foo.html">foo</a>'), '\t[foo]')
    print (removeHtmlMarkup('<a href="">foo</a>'),         '\t[foo]')
    print (removeHtmlMarkup('<a href=">">foo</a>'),        '\t[foo]')
    print (removeHtmlMarkup('<b>foo</b>'),                 '\t[foo]')
    print (removeHtmlMarkup('<b>"foo"</b>'),               '\t["foo"]')
    print (removeHtmlMarkup('"<b>foo</b>"'),               '\t["foo"]')
    print (removeHtmlMarkup('<"b">foo</"b">'),             '\t[foo]')

[Download strip12.py]

An exercise

What about the case of wanting to keep tags if they are in quotes? e.g.

   <b>We want to keep "<thistag>"</b>

should give

   We want to keep "<thistag>"

We would need a state machine with four states:

States: no-tag,no-quote / tag,no-quote / tag,quote / no-tag,quote

Using Assertions

Testing our hypothesis using assert

Result

Hypothesis 2

We make a new hypothesis:

An experiment to verify this:

Revisiting our hypotheses

Are all quotes stripped?

We make a new hypothesis:

Revisiting our hypotheses

What have we learned?

Solving the bug

The explanation

The final code

An exercise