It is currently Thu Mar 28, 2024 12:37 pm

All times are UTC - 8 hours [ DST ]




Post new topic Reply to topic  [ 3 posts ] 
Author Message
Offline
 Post subject: Scraping library for KGS Game Archives written in Perl
Post #1 Posted: Sat Jun 21, 2014 6:03 am 
Dies in gote
User avatar

Posts: 27
Liked others: 6
Was liked: 7
Hi all,

I've uploaded WWW::KGS::GameArchives to CPAN.
This module enables you to send a query to the archives, and also parses the result into Perl data structure.

Although this module itself is not harmful, if you abuse this module, it can be harmful to the archives server.
See also http://www.gokgs.com/robots.txt.

If you feel inconvenience to this module, please let me know. I'll improve it.
I don't intend to violate KGS's policy.

By the way, I think gameArchives.jsp should send the Last-Modified response header.
It would be useful for caching.

Enjoy!

Ryo

-----

EDIT:

WWW::KGS::GameArchives was renamed to WWW::GoKGS::Scraper::GameArchives.
The WWW::GoKGS distribution also provides scrapers which can scrape KGS Tournament pages.
The scrapers are tested by Travis CI once a day.

EDIT2:

As of 0.12, WWW::GoKGS#user_agent defaults to LWP::RobotUA which consults /robots.txt
before sending HTTP requests, and also set a proper delay between requests.
To use this module, the users must provide an email address which is used
to generate the From request header, while they can still set their own
user agent whenever they want.

NOTE: LWP::RobotUA fails to read /robots.txt on KGS since the web server
doesn't return the Content-Type response header. This module can not
solve this problem.


Last edited by anazawa on Mon Jun 23, 2014 5:45 am, edited 1 time in total.

This post by anazawa was liked by: virre
Top
 Profile  
 
Offline
 Post subject: Re: Scraping library for KGS Game Archives written in Perl
Post #2 Posted: Sun Jun 22, 2014 2:29 pm 
Oza

Posts: 2493
Location: DC
Liked others: 157
Was liked: 442
Universal go server handle: skydyr
Online playing schedule: When my wife is out.
anazawa wrote:
I've uploaded WWW::KGS::GameArchives to CPAN.
This module enables you to send a query to the archives, and also parses the result into Perl data structure.

Although this module itself is not harmful, if you abuse this module, it can be harmful to the archives server.
See also http://www.gokgs.com/robots.txt.

If you feel inconvenience to this module, please let me know. I'll improve it.
I don't intend to violate KGS's policy.


Without looking at the code, you may be able to build in a delay between requests so that the server doesn't get flooded accidentally due to poor coding practices on the part of module users.

Top
 Profile  
 
Offline
 Post subject: Re: Scraping library for KGS Game Archives written in Perl
Post #3 Posted: Sun Jun 22, 2014 4:04 pm 
Dies in gote
User avatar

Posts: 27
Liked others: 6
Was liked: 7
@skydyr

Thanks for your suggestion. I agree with you.
Fortunately, I can easily build in a delay between requests
by replacing the web user agent used by my module.
(Concretely, I'll replace LWP::UserAgent with LWP::RobotUA
which consults /robots.txt before sending HTTP requests).

On the other hand, even if the default user agent is replaced,
I think the module users should be able to set their own user agent as ever.
(They need to set #user_agent attribute *explicitly* if they want)

EDIT:

Since "GET /robots.txt" doesn't return the Content-Type header (!),
LWP::RobotUA fails to read the resource, and so the user agent
doesn't think /gameArchives.jsp is "Disallowed". I don't know how to solve
this problem :sad:

Try:

$ curl -I http://www.gokgs.com/robots.txt

Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 3 posts ] 

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group