Using Assertions

Assertions are explict expressions of what you assume your program is expected to deal with. An assertion prevents the impossible from being asked of your code (at least the impossible things you can think of!)

The assert statement...

Our hypothesis is that the error is a result of the variable tag being set, so we can use assert to raise an exception if tag should ever be set.

#!/usr/bin/python3

def removeHtmlMarkup(s):
    tag   = False                           
    quote = False
    out   = ""

    for c in s:
        assert not tag                      #                    NEW

        if c == '<' and not quote:          # Start of markup
            tag = True
        elif c == '>' and not quote:        # End of markup  
            tag = False
        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote               
        elif not tag:
            out = out + c

    return out

""" We know this failed """
if __name__ == "__main__":
    print (removeHtmlMarkup('"foo"'), '\t["foo"]')


[Download strip8.py]

  • If our hypothesis is correct, we would expect an assert exception.
  • If there is no assert exception, then tag has never been set.
  • There was no assert exception
  • We reject our hypothesis
  • We need a new hypothesis!

Here again is the only place where quotes are handled:


        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote

  • We know (from our assert) that tag is always False.
  • If that is the case then (if our code is correct), then we should never enter the block that changes the quote mode

We can test this with another assert:

#!/usr/bin/python3

def removeHtmlMarkup(s):
    tag   = False                           
    quote = False
    out   = ""

    for c in s:
        assert not quote
#        assert quote

        if c == '<' and not quote:          # Start of markup
            tag = True
        elif c == '>' and not quote:        # End of markup  
            tag = False
        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote               
        elif not tag:
            out = out + c

    return out

""" We know this failed """
if __name__ == "__main__":
    print (removeHtmlMarkup('"foo"'), '\t["foo"]')


[Download stip9.py]

We find that the assertion raises and exception, so we know that we are entering the block where the quote variable is changed.

We make a new hypothesis:

  • The error is due to the condition always evaluating to True

An experiment to verify this:

  • If the test on the quotes evaluates to true then the next block will be executed.
  • Add an assert False into the block - this will always throw and exception when the block is entered
#!/usr/bin/python3

def removeHtmlMarkup(s):
    tag   = False                           
    quote = False
    out   = ""

    for c in s:
        if c == '<' and not quote:          # Start of markup
            tag = True
        elif c == '>' and not quote:        # End of markup  
            tag = False
        elif c == '"' or c == "'" and tag:  # Quote          
            assert False                    # Should never be reached NEW
            quote = not quote               
        elif not tag:
            out = out + c

    return out

""" We know this failed """
if __name__ == "__main__":
    print (removeHtmlMarkup('"foo"'), '\t["foo"]')


[Download strip10.py]

  • The program does indeed throw an exception
  • The condition is evaluating to True

Revisiting our hypotheses

  1. Double quotes are stripped from the input
  2. The error is a result of tag being set
  3. The error is due to the condition always evaluating to True

i.e. We know that:

  • tag is not being set
  • The condition is evaluating to True even though tag is not set

Is the problem with quotes general? Are single quotes stripped in the same way?

We make a new hypothesis:

  • Single quotes are stripped from the input

We modify our test code to test this hypothesis:

#!/usr/bin/python3

def removeHtmlMarkup(s):
    tag   = False                           
    quote = False
    out   = ""

    for c in s:
        if c == '<' and not quote:          # Start of markup
            tag = True
        elif c == '>' and not quote:        # End of markup  
            tag = False
        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote               
        elif not tag:
            out = out + c

    return out

""" Our tests """
if __name__ == "__main__":
    print (removeHtmlMarkup('"foo"'), '\t["foo"]')  
    print (removeHtmlMarkup("'foo'"), "\t['foo']")   # NEW TEST



[Download strip11.py]

  • We find that the single quotes are not stripped
  • We reject the hypothesis!

Revisiting our hypotheses

  1. Double quotes are stripped from the input
  2. The error is a result of tag being set
  3. The error is due to the condition always evaluating to True
  4. Single quotes are stripped from the input

What have we learned?

The condition


        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote

is

  • True when c is a double quote
  • False when c is a single quote

We now have enough information to see exactly what is going on.

  • When the character is a double quote, the condition is evaluating to True even when tag is False.
  • When the character is a single quote, the condition is not evaluating to True when tag is False.

The explanation

'and' takes precedent over 'or'. Consequently the code:


        elif c == '"' or c == "'" and tag:  # Quote          
            quote = not quote

is equivalent to:


        elif c == '"' or (c == "'" and tag):  # Quote          
            quote = not quote

when what we wanted was:


        elif (c == '"' or c == "'") and tag:  # Quote          
            quote = not quote

#!/usr/bin/python3

def removeHtmlMarkup(s):
    tag   = False                           
    quote = False
    out   = ""

    for c in s:
        if c == '<' and not quote:            # Start of markup
            tag = True
        elif c == '>' and not quote:          # End of markup  
            tag = False
        elif (c == '"' or c == "'") and tag:  # Quote          
            quote = not quote               
        elif not tag:
            out = out + c

    return out

""" Our tests """
if __name__ == "__main__":
    print (removeHtmlMarkup('"foo"'),                      '\t["foo"]')
    print (removeHtmlMarkup("'foo'"),                      "\t['foo']")

    # Old tests
    print ("Old tests...")
    print (removeHtmlMarkup('<b>foo</b>'),                 '\t[foo]')
    print (removeHtmlMarkup('<em>foo</em>'),               '\t[foo]')
    print (removeHtmlMarkup('<a href="foo.html">foo</a>'), '\t[foo]')
    print (removeHtmlMarkup('<a href="">foo</a>'),         '\t[foo]')
    print (removeHtmlMarkup('<a href=">">foo</a>'),        '\t[foo]')
    print (removeHtmlMarkup('<b>foo</b>'),                 '\t[foo]')
    print (removeHtmlMarkup('<b>"foo"</b>'),               '\t["foo"]')
    print (removeHtmlMarkup('"<b>foo</b>"'),               '\t["foo"]')
    print (removeHtmlMarkup('<"b">foo</"b">'),             '\t[foo]')

[Download strip12.py]

What about the case of wanting to keep tags if they are in quotes? e.g.

   <b>We want to keep "<thistag>"</b>

should give

   We want to keep "<thistag>"

We would need a state machine with four states:

States: no-tag,no-quote / tag,no-quote / tag,quote / no-tag,quote

pic3
Continue