Yes, in temporary code this kind of quick and dirty solution is good enough. The "proper" alternative in your example would entail a full-fledged HTML parser.RBerenguel wrote:A similar kind of problem I think appears in the life of every programmer: semi-parsing stuff worrying about closing tags. So, detecting everything inside a <p> until the "equivalent" instance of </p> appears for example. The first time I coded something for this (it was a funky C parser I wrote partially in C and partially in Lex and I used it to generate "visual code patterns" I could visually scan to detect code cheating in an assignment) I just used addition: every time you see a <p> after the first one, add 1. Every time you see </p> substract 1. Once you get to 0 you are done. There are better ways to do it if you are interested in the real structure of the document, but if you just want to know the content this is as easy as it gets.
Quick and dirty for the escaping issue I mentioned would probably be to count the number of contiguous backslashes preceding the "]" character. If it is odd, the bracket is escaped.
The regular expression I use in my code for this is: (?:\\.|[^\\\]])+\]In this comment parsing it's (somewhat) easier: each \ escapes the next character (UTF8 may hurt here, but anyway) so the idea would be to pick (in "regular expression syntax") \\(.) and write instead &1, i.e. whatever is after the slash when parsing the comment. Once there are no slashes left, any closing bracket closes the comment. Now should come the plan to implement this(and worse, in an efficient way if it has to run with many files.)
UTF8, or more to the point, character encodings in general, can be a huge pita. SGF allows you to define the character encoding with the CA tag. My parser makes the assumption (I know, I know) that property values cannot contain a "]" equivalent byte as part of some funky unicode codepoint, which is true for UTF8 and common latin/windows encodings (but might be false for e.g. UTF-16 or BIG5).
But anyway, at Loons' level we're just talking gibberish now