Working with SGFs (as a rote beginner programmer)

HermanHiddema · Post by **HermanHiddema** » Wed Dec 04, 2013 9:31 am

RBerenguel wrote:A similar kind of problem I think appears in the life of every programmer: semi-parsing stuff worrying about closing tags. So, detecting everything inside a <p> until the "equivalent" instance of </p> appears for example. The first time I coded something for this (it was a funky C parser I wrote partially in C and partially in Lex and I used it to generate "visual code patterns" I could visually scan to detect code cheating in an assignment) I just used addition: every time you see a <p> after the first one, add 1. Every time you see </p> substract 1. Once you get to 0 you are done. There are better ways to do it if you are interested in the real structure of the document, but if you just want to know the content this is as easy as it gets.

Yes, in temporary code this kind of quick and dirty solution is good enough. The "proper" alternative in your example would entail a full-fledged HTML parser.

Quick and dirty for the escaping issue I mentioned would probably be to count the number of contiguous backslashes preceding the "]" character. If it is odd, the bracket is escaped.

In this comment parsing it's (somewhat) easier: each \ escapes the next character (UTF8 may hurt here, but anyway) so the idea would be to pick (in "regular expression syntax") \\(.) and write instead &1, i.e. whatever is after the slash when parsing the comment. Once there are no slashes left, any closing bracket closes the comment. Now should come the plan to implement this (and worse, in an efficient way if it has to run with many files.)

The regular expression I use in my code for this is: (?:\\.|[^\\\]])+\]

UTF8, or more to the point, character encodings in general, can be a huge pita. SGF allows you to define the character encoding with the CA tag. My parser makes the assumption (I know, I know) that property values cannot contain a "]" equivalent byte as part of some funky unicode codepoint, which is true for UTF8 and common latin/windows encodings (but might be false for e.g. UTF-16 or BIG5).

But anyway, at Loons' level we're just talking gibberish now

RBerenguel · Post by **RBerenguel** » Wed Dec 04, 2013 10:28 am

HermanHiddema wrote: (?:\\.|[^\\\]])+\]

But anyway, at Loons' level we're just talking gibberish now

It's the best way to learn: what's this gibberish they are talking about? Let's see... (never, ever try to read the UTF8 or any other encoding reference, unless forced to do so.)

Also, I don't get the (? part: what's the conditional about? The parenthesis? Lately my regex abilities (which have never been that much) have been severely crippled by writing more or less concurrently emacs lisp style regex, Go regexes and PHP regexes (the three have different rules for some sets of extended operators!) so anything ever so slightly fancy doesn't ring any bell any more...

HermanHiddema · Post by **HermanHiddema** » Wed Dec 04, 2013 11:08 am

RBerenguel wrote:
HermanHiddema wrote: (?:\\.|[^\\\]])+\]

But anyway, at Loons' level we're just talking gibberish now
It's the best way to learn: what's this gibberish they are talking about? Let's see... (never, ever try to read the UTF8 or any other encoding reference, unless forced to do so.)

Also, I don't get the (? part: what's the conditional about? The parenthesis? Lately my regex abilities (which have never been that much) have been severely crippled by writing more or less concurrently emacs lisp style regex, Go regexes and PHP regexes (the three have different rules for some sets of extended operators!) so anything ever so slightly fancy doesn't ring any bell any more...

(?: starts a non-capturing group. You can leave out the ?: and it'll work just as well, except afterwards you would be able to reference the part between the parenthesis separately. Since I only need the full match, I used a non-capturing group.

So the expression is:

(...)+\] Match at least one occurrence, and as many as possible, of the sub-expression between the parentheses, followed by a closing bracket.

And the sub-expression has two options:

[^\\\]] any single character that is not backslash or closing bracket
\\. any pair of characters where the first is a backslash

SmoothOper · Post by **SmoothOper** » Wed Dec 04, 2013 11:23 am

Planning is good, and so is research. I would find what other sgf parsers are out there, and also what kind of data structures people have divined for storing the moves, and see if any of those help you out conceptually, if not with actual code. I also suspect that C# may not be the best language for this particular application, since I haven't seen many Go programs in C#, which is more commercially oriented, so you may not have much help from the open source community, though you should be able to read the code without much problem. I have seen python(Kombillo) and C/C++ (Gnu go, Kombillo query engine) and I know there are java programs out there like KGS.

moyoaji · Post by **moyoaji** » Wed Dec 04, 2013 11:37 am

RBerenguel wrote:I don't really get this comment. Object oriented programming is just a paradigm that makes some things easier, usually for some largish-scale codebases, not for all. People have been programming without OO for a lot of time, and some languages don't even have the concept of object and work perfectly fine for many projects and many lines of code.

If I were to write this in a language without OO (which are the languages I usually use, let's say C to pick the barest thing) I'd write functions which act on a struct which will be the board (what OO would probably call methods) and work from there.

I know you can do this without OO. I even said that later in my post. However, creating a collection of patterns for the program to search could be more tedious without using objects. The problem is not working with the board, the problem is adding new patterns to your collection that can be searched.

It would take more time to write a function that takes a pattern object and figures out how to use that pattern than it would to just tell the program how to scan for one particular pattern. This is true. But once the function is written you can add new patterns in a fraction of the time it would take. Without having the luxury of creating a pattern object that conforms to an interface that can be understood by your function you'll need to program how to search for each pattern.

I learned to code in Java, which tries to force you to do things with object orientation. When I work in C I do sometimes miss the ability to create objects to work with. A struct can do some of what I would want an object to do in this case, but it would be easier (in my mind) to use objects. I'm not saying I couldn't figure out a way to define patterns with a struct, but I can't see how it would be faster than setting up a pattern interface and then making objects that conform to it.

There is nothing wrong with procedural programming, but for a project like this I think an object oriented approach would be easier. There are times when being object oriented just slows you down, but, depending on how many patterns, and different types of patterns, are going to be searched, I think OO would be nicer.

Loons · Post by **Loons** » Wed Dec 04, 2013 2:06 pm

Thanks so much for the comments everyone, once I have done more significant work (fun) I will report back!

I did a really cursory mission statement, paper prototype and pseudocode for the program, though it is definitely too incomplete - I know that I am going to ad lib some as I go along.

Structs are just value-type, cut-down classes (in C#) right? So I think I might already be being a little object oriented. I may have a look online and through my textbook and start using classes proper.

(Now to begin thinking about escape characters in comments...!)

Mike Novack · Post by **Mike Novack** » Wed Dec 04, 2013 3:59 pm

OK, just a warning. I may be using language about programming a little differently than many of you. Some examples .....

I wouldn't think of "object oriented" and "procedural" as opposed. To me, the opposite of "procedural" is "functional" (do you define a process or define a function; does you code do something or evaluate something). In any case, many of our "object oriented" languages aren't (objects do not exist at run time).

And when I say abstract I mean much more abstract. Thus here we have a parsing problem. OK, what is the FSM (finite state machine) that solves the problem. This is as much a matter of finding the right place to divide a problem up, to be sure of correctness both in design and that you have tested the coding of that design. In the final program it might no longer be obvious "this is a finite state machine", especially if there are just a couple states and you don't use obvious names. Or let it remain obvious (an example for the junior programmers).

Language is pretty much irrelevant and in the real world you will be restricted to the couple used in your shop. It is very rare that the slightly greater ease using one language over another for a particular problem is worth the trouble (when they call the standby programmer in the middle of the night because program X hung, how many languages does this person need to know?).

But if you want to see extreme examples of "more suited" don't think about language features so much as "what are the native variables". In other words,is fundamental to this language numbers, lists, or strings. If you want an example of this try the following problem in a number of languages:

The program is to accept input of a string, determine whether a palindrome according to the usual rules for TEXT (not mathematical) palindromes, report the result and be willing to accept another string. Notice the problem did not define things like "strings of length less that M".

Try that with any of your favorite languages and then try it using just the bash shell and standard 'nix library. We don't have any languages devoted to string processing (like SNOBOL) around anymore but bash + the library will do very well as a string processing language.

You might end up with fewer characters in that script than lines of code in most other languages. Been a while (and I lost it in a house fire) but I think my solution was maybe a five line script of somewhat over 100 characters.

The point here is that a language like C++ will come provided with objects to support abstract structures like lists (or queues, o ...) but if you were using a language like LISP you don't need anything special for lists (the fundamental data type of the language).

oren · Post by **oren** » Wed Dec 04, 2013 4:02 pm

Mike Novack wrote:OK, just a warning. I may be using language about programming a little differently than many of you. Some examples .....

I wouldn't think of "object oriented" and "procedural" as opposed. To me, the opposite of "procedural" is "functional" (do you define a process or define a function; does you code do something or evaluate something).

I think this is quite a bit different than what a majority of software engineers think.

RBerenguel · Post by **RBerenguel** » Wed Dec 04, 2013 4:35 pm

oren wrote:
Mike Novack wrote:OK, just a warning. I may be using language about programming a little differently than many of you. Some examples .....

I wouldn't think of "object oriented" and "procedural" as opposed. To me, the opposite of "procedural" is "functional" (do you define a process or define a function; does you code do something or evaluate something).
I think this is quite a bit different than what a majority of software engineers think.

I think it's a matter of tastes. I don't think there are opposites. Lisp has an object system, you can certainly extend Forth to do so, and you can write almost object-like code in plain C using structs and function pointers. There is nothing special about object oriented programming. Of course there is a clear distinction between functional and non-functional languages, just like stack based are different from variable based, or between Prolog and everything else. Each one has its strong points and its weak points, and deep inside all are equivalent. Choose the best, but don't settle on a hammer thinking all to cabinetmaking is driving nails. You also need planes, screws and rasps. Learn as much as possible, as different fields as possible.

vier · Post by **vier** » Wed Dec 04, 2013 7:28 pm

HermanHiddema wrote:According to the spec (http://www.red-bean.com/sgf/sgf4.html#text):
Any char following "\" is inserted verbatim

It is not clear what a char is in this context. Many East Asiatic character sets have multibyte characters that properly contain \ or ].
For example, "Honinbo" written in Big5 contains a ']'.

HermanHiddema · Post by **HermanHiddema** » Thu Dec 05, 2013 4:38 am

vier wrote:
HermanHiddema wrote:According to the spec (http://www.red-bean.com/sgf/sgf4.html#text):
Any char following "\" is inserted verbatim
It is not clear what a char is in this context. Many East Asiatic character sets have multibyte characters that properly contain \ or ].
For example, "Honinbo" written in Big5 contains a ']'.

SGF has the CA tag, which specifies the character set, and which therefore theoretically should make this clear. But the SGF standard specifies that the charset used should be in RFC 1345, which is over 20 years old and does not mention UTF-8, UTF-16, UTF-32, Big 5 and many other character sets.

And of course, since the CA tag is inside the SGF, you don't know the encoding until after you've started reading the SGF, so to do it correctly, you should switch to the specified encoding specifically when you start reading the content of a text property, you can't just read the whole file with a specific encoding.

Text parsing and processing outside simple ASCII is a pain in the ass.

Mike posted about writing a palindrome checker in five lines of code, but I bet that assumed simple ASCII input

@Loons: Just assume it is safe to read the file via the standard library file reading methods, there are far more interesting things to write code for than weird edge cases in character encodings

RBerenguel · Post by **RBerenguel** » Thu Dec 05, 2013 5:11 am

Writing short code for the sake of it is... well, I won't say stupid, because it is not, but usually it's just a fun thing you should do on the side, when bored. After all, if terseness is the only thing you want, learn to program in J or APL. The J function to check if a (lower-cased or upp-ercased) string is a palyndrome is this beauty of 5 characters:

Code: Select all

-: |.

I don't know enough J to say much about this aside from knowing J is very hard and here we probably have a fork and an adverb in place.

And that's like:

Code: Select all

palin =: -: |. NB Here we define the function with =: and NB is the J way of putting a comment
palin 'abba'
1
palin 'abc'
0
palin 'abcba'
1

Mike Novack · Post by **Mike Novack** » Thu Dec 05, 2013 7:54 am

RBerenguel wrote: Each one has its strong points and its weak points, and deep inside all are equivalent.

If you are a computer science major (as opposed to "data processing") you will unboubtably prove this (or at least learn to follow the proof).

A "Turing Machine" (procedural) and a "Wang Machine" (stack) are equivalent.

For a naive "it's obvious", picture how a Turing machine is defined with its state machine moving back and forth on an infinite "tape" and a state machine placed between two stacks (a "Wang"). If one can compute anything that is computable so can the other.

IMHO the greatest usefulness of "object oriented" is to shortcut learning time. But there are arguments both ways. For example, if you learned C from the original Ritchie text then you would know how to construct your own version of all the standard 'nix library things (because exercises of the text). If we had had been using an objct oreinted language in the shop where I worked all that meant would be that one or two of us "top hands" would be deifning the objects to be used by the programmers.

It shouldn't take very long before an experienced programmer (having started from scratch) had in a personal library all those useful bits and pieces or at least knew from which preciously written program they could be grabbed as necessary. When I say I wrote about 300,000 lines of code in my day I don't mean 300,000 lines of new code. After the first ten years probably 80% of the average new program written would be from old programs, slighttly modified.

And you really can write anything in anything. To replace a great deal of hard coding in many programs (check if this policy is an exception case) I wrote fast search routine (that when first called read in data from a file users could add exception cases to) that was a "hash table of lists" and this was a COBOL* shop! << specially written with instructions in the comments so that the programmers could do any necessary future maintenance without understanding what a hash table was, what lists were, etc. >>

* OK, assembler was allowed too, but there were even fewer of us around who maintained those. We tried really hard to avoid adding any new assembler programs.

RBerenguel · Post by **RBerenguel** » Thu Dec 05, 2013 9:01 am

Mike Novack wrote:
RBerenguel wrote: Each one has its strong points and its weak points, and deep inside all are equivalent.
If you are a computer science major (as opposed to "data processing") you will unboubtably prove this (or at least learn to follow the proof).

I'm no computer science major (I'm a mathematician,) this is just a fact I know. If I'd need to show a line of proof from stuff I know, I know S-expressions as defined by McCarthy in his seminal Lisp paper are Turing complete, hence any language where his eval/apply (and the 6-7 basic primitives) can be programmed is also Turing complete. Since writing this in Forth (stack-based language) is relatively straightforward and Forth is almost as bare a high level stack based language gets (save Brainf*ck,) stack languages are also Turing complete.

What is a "data processing"?

vier · Post by **vier** » Thu Dec 05, 2013 9:34 am

HermanHiddema wrote:
vier wrote:
HermanHiddema wrote:According to the spec (http://www.red-bean.com/sgf/sgf4.html#text):
Any char following "\" is inserted verbatim
It is not clear what a char is in this context. Many East Asiatic character sets have multibyte characters that properly contain \ or ].
For example, "Honinbo" written in Big5 contains a ']'.
SGF has the CA tag, which specifies the character set, and which therefore theoretically should make this clear.

Perhaps. There are two interpretations of the FF4 standard, and both occur in real life. One uses char=byte. That makes life easy for a parser, and the CA property is not needed to parse the game, only to display the text fields. After inserting escapes, the resulting SGF file need no longer be a text file, and ordinary text editors may not be able to handle it. The other uses char=(possibly multibyte) character. Now the SGF file remains a plain text file, easy to handle, but the CA property is needed to be able to parse the game. A parser must know about all possible text encodings.
Since both interpretations occur, FF4 is not a well-defined format.

Life In 19x19

Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)

Re: Working with SGFs (as a rote beginner programmer)