This blog is a programming and Computer Science blog. Therefore the posts I have made and will make, have code embedded in them and for readability I thought it would be nice if there was some form of syntax highlighting. Now, this blog is hosted on blogger and blogger doesn’t have any support for syntax highlighting, so I was forced to look elsewhere for a solution.
I really wasn’t happy with the idea of wading through all of this rubbish every time I had to edit my posts’ HTML, so I made two choices:
- I was never going to use bloggers built-in writer again. At least not until it was improved.
- I was going to clean the HTML of all my past posts
Putting the first choice into action is on hold at the moment, I can’t find a decent blogging client for Linux. If anyone knows one, please tell me. Moving on.
The second choice seemed like a lot of work that I didn’t want to do manually, so I started up VIM and got to writing a Python script to clean up blogger’s HTML markup automatically.
Now, this script isn’t perfect, but it gets the HTML markup to a point where it is navigable and easier to read. Lots of useless markup get’s deleted and it gets reformatted a bit. It gets the job that I wanted done, done; and that’s an achievement to me. You can view the whole project here.
Despite the fact that the script is technically unfinished, and I will get to that, there is an interesting part that might be useful and/or interesting to you readers; utilising a stack to find unclosed HTML tags.
USING A STACK TO FIND UNCLOSED HTML TAGS
This script uses a stack, or at least a list in a stack-like manner to find opening HTML tags that aren’t paired with the corresponding closing tag.
It does this by going through all the tags in the document and pushing opening tags onto a stack. Then when the closing tag is reached, they are popped off again. If however the closing tag is never reached, or a different closing tag is reached, it will return the position of the unclosed tag.
def find_next_unclosed(text): """Finds the next unclosed HTML tag""" tag_stack =  # Get an iterator of all tags in file. tag_regex = re.compile(r'<[^>]*>', re.DOTALL) tags = tag_regex.finditer(text) for tag in tags: # If it is a closing tag check if it matches the last opening tag. if re.match(r'</[^>]*>', tag.group()): top_tag = tag_stack[-1] if tag_match(top_tag.group(), tag.group()): tag_stack.pop() else: unclosed = tag_stack.pop() return (unclosed.start(), unclosed.end()) else: tag_stack.append(tag)
SO WHAT’S GOING ON?
# Get an iterator of all tags in file. tag_regex = re.compile(r'<[^>]*>', re.DOTALL) tags = tag_regex.finditer(text)
First, using regex and Python’s built in re.compile and re.finditer() (both of which can be read about on the Python documentation) we get an iterator of all the tags found in the text we are searching.
for tag in tags: # If it is a closing tag check if it matches the last opening tag. if re.match(r'</[^>]*>', tag.group()):
Next we loop over each tag and check to see if it is a closing tag, e.g. </body>.
If isn’t a closing tag, it is therefore an opening tag and is pushed onto a stack which contains exclusively opening tags.
if tag_match(top_tag.group(), tag.group()): tag_stack.pop()
If it is a closing tag, it then checks to see if the top tag on the opening tags stack matches it. Note that <div> and </div> match but <div> and <div> or <div> and </body> don’t. If it matches, this means that the current closing tag closes the opening tag on top of the stack and everything is fine. It then pops the top element off the opening tags stack.
else: unclosed = tag_stack.pop() return (unclosed.start(), unclosed.end())
However if they don’t match, the top tag in the opening tags stack is unclosed, or at least not closed in the right place and the position of this tag is returned (as a start and end point).
There we have it, a simple but effective way to search for unclosed HTML tags, or if slightly adapted, brackets. This is good for checking the syntax of a markup file or some file written in a programming language like Java, which makes heavy use of curly braces.
‘Ello, I’m Jamal – a Tokyo-based, indie-hacking, FinTech software developer with a dependence on data.
I’m friendly, so feel free to say hello!