Life In 19x19
http://www.lifein19x19.com/

Scraping library for KGS Game Archives written in Perl
http://www.lifein19x19.com/viewtopic.php?f=24&t=8651
Page 1 of 1

Author:  anazawa [ Sat Jun 21, 2014 6:03 am ]
Post subject:  Scraping library for KGS Game Archives written in Perl

Hi all,

I've uploaded WWW::KGS::GameArchives to CPAN.
This module enables you to send a query to the archives, and also parses the result into Perl data structure.

Although this module itself is not harmful, if you abuse this module, it can be harmful to the archives server.
See also http://www.gokgs.com/robots.txt.

If you feel inconvenience to this module, please let me know. I'll improve it.
I don't intend to violate KGS's policy.

By the way, I think gameArchives.jsp should send the Last-Modified response header.
It would be useful for caching.

Enjoy!

Ryo

-----

EDIT:

WWW::KGS::GameArchives was renamed to WWW::GoKGS::Scraper::GameArchives.
The WWW::GoKGS distribution also provides scrapers which can scrape KGS Tournament pages.
The scrapers are tested by Travis CI once a day.

EDIT2:

As of 0.12, WWW::GoKGS#user_agent defaults to LWP::RobotUA which consults /robots.txt
before sending HTTP requests, and also set a proper delay between requests.
To use this module, the users must provide an email address which is used
to generate the From request header, while they can still set their own
user agent whenever they want.

NOTE: LWP::RobotUA fails to read /robots.txt on KGS since the web server
doesn't return the Content-Type response header. This module can not
solve this problem.

Author:  skydyr [ Sun Jun 22, 2014 2:29 pm ]
Post subject:  Re: Scraping library for KGS Game Archives written in Perl

anazawa wrote:
I've uploaded WWW::KGS::GameArchives to CPAN.
This module enables you to send a query to the archives, and also parses the result into Perl data structure.

Although this module itself is not harmful, if you abuse this module, it can be harmful to the archives server.
See also http://www.gokgs.com/robots.txt.

If you feel inconvenience to this module, please let me know. I'll improve it.
I don't intend to violate KGS's policy.


Without looking at the code, you may be able to build in a delay between requests so that the server doesn't get flooded accidentally due to poor coding practices on the part of module users.

Author:  anazawa [ Sun Jun 22, 2014 4:04 pm ]
Post subject:  Re: Scraping library for KGS Game Archives written in Perl

@skydyr

Thanks for your suggestion. I agree with you.
Fortunately, I can easily build in a delay between requests
by replacing the web user agent used by my module.
(Concretely, I'll replace LWP::UserAgent with LWP::RobotUA
which consults /robots.txt before sending HTTP requests).

On the other hand, even if the default user agent is replaced,
I think the module users should be able to set their own user agent as ever.
(They need to set #user_agent attribute *explicitly* if they want)

EDIT:

Since "GET /robots.txt" doesn't return the Content-Type header (!),
LWP::RobotUA fails to read the resource, and so the user agent
doesn't think /gameArchives.jsp is "Disallowed". I don't know how to solve
this problem :sad:

Try:

$ curl -I http://www.gokgs.com/robots.txt

Page 1 of 1 All times are UTC - 8 hours [ DST ]
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/