Assertions are explict expressions of what you assume your program is expected to deal with. An assertion prevents the impossible from being asked of your code (at least the impossible things you can think of!)
The assert statement...
Our hypothesis is that the error is a result of the variable tag being set, so we can use assert to raise an exception if tag should ever be set.
#!/usr/bin/python3 def removeHtmlMarkup(s): tag = False quote = False out = "" for c in s: assert not tag # NEW if c == '<' and not quote: # Start of markup tag = True elif c == '>' and not quote: # End of markup tag = False elif c == '"' or c == "'" and tag: # Quote quote = not quote elif not tag: out = out + c return out """ We know this failed """ if __name__ == "__main__": print (removeHtmlMarkup('"foo"'), '\t["foo"]')
Here again is the only place where quotes are handled:
elif c == '"' or c == "'" and tag: # Quote quote = not quote
We can test this with another assert:
#!/usr/bin/python3 def removeHtmlMarkup(s): tag = False quote = False out = "" for c in s: assert not quote # assert quote if c == '<' and not quote: # Start of markup tag = True elif c == '>' and not quote: # End of markup tag = False elif c == '"' or c == "'" and tag: # Quote quote = not quote elif not tag: out = out + c return out """ We know this failed """ if __name__ == "__main__": print (removeHtmlMarkup('"foo"'), '\t["foo"]')
We find that the assertion raises and exception, so we know that we are entering the block where the quote variable is changed.
#!/usr/bin/python3 def removeHtmlMarkup(s): tag = False quote = False out = "" for c in s: if c == '<' and not quote: # Start of markup tag = True elif c == '>' and not quote: # End of markup tag = False elif c == '"' or c == "'" and tag: # Quote assert False # Should never be reached NEW quote = not quote elif not tag: out = out + c return out """ We know this failed """ if __name__ == "__main__": print (removeHtmlMarkup('"foo"'), '\t["foo"]')
i.e. We know that:
Is the problem with quotes general? Are single quotes stripped in the same way?
We modify our test code to test this hypothesis:
#!/usr/bin/python3 def removeHtmlMarkup(s): tag = False quote = False out = "" for c in s: if c == '<' and not quote: # Start of markup tag = True elif c == '>' and not quote: # End of markup tag = False elif c == '"' or c == "'" and tag: # Quote quote = not quote elif not tag: out = out + c return out """ Our tests """ if __name__ == "__main__": print (removeHtmlMarkup('"foo"'), '\t["foo"]') print (removeHtmlMarkup("'foo'"), "\t['foo']") # NEW TEST
The condition
elif c == '"' or c == "'" and tag: # Quote quote = not quote
is
We now have enough information to see exactly what is going on.
'and' takes precedent over 'or'. Consequently the code:
elif c == '"' or c == "'" and tag: # Quote quote = not quote
is equivalent to:
elif c == '"' or (c == "'" and tag): # Quote quote = not quote
when what we wanted was:
elif (c == '"' or c == "'") and tag: # Quote quote = not quote
#!/usr/bin/python3 def removeHtmlMarkup(s): tag = False quote = False out = "" for c in s: if c == '<' and not quote: # Start of markup tag = True elif c == '>' and not quote: # End of markup tag = False elif (c == '"' or c == "'") and tag: # Quote quote = not quote elif not tag: out = out + c return out """ Our tests """ if __name__ == "__main__": print (removeHtmlMarkup('"foo"'), '\t["foo"]') print (removeHtmlMarkup("'foo'"), "\t['foo']") # Old tests print ("Old tests...") print (removeHtmlMarkup('<b>foo</b>'), '\t[foo]') print (removeHtmlMarkup('<em>foo</em>'), '\t[foo]') print (removeHtmlMarkup('<a href="foo.html">foo</a>'), '\t[foo]') print (removeHtmlMarkup('<a href="">foo</a>'), '\t[foo]') print (removeHtmlMarkup('<a href=">">foo</a>'), '\t[foo]') print (removeHtmlMarkup('<b>foo</b>'), '\t[foo]') print (removeHtmlMarkup('<b>"foo"</b>'), '\t["foo"]') print (removeHtmlMarkup('"<b>foo</b>"'), '\t["foo"]') print (removeHtmlMarkup('<"b">foo</"b">'), '\t[foo]')
What about the case of wanting to keep tags if they are in quotes? e.g.
<b>We want to keep "<thistag>"</b>
should give
We want to keep "<thistag>"
We would need a state machine with four states:
States: no-tag,no-quote / tag,no-quote / tag,quote / no-tag,quote