Life In 19x19 http://www.lifein19x19.com/ |
|
Cleaning Sensei's Library Webpages for Offline Storage http://www.lifein19x19.com/viewtopic.php?f=18&t=389 |
Page 1 of 1 |
Author: | RobertJasiek [ Sat May 01, 2010 11:45 pm ] |
Post subject: | Cleaning Sensei's Library Webpages for Offline Storage |
If you want to clean potentially dangerous files, potentially dangerous (JavaScript, forms) or superfluous (header, footer, left pane, TOC) source code from Sensei's Library webpages for your offline storage, do the following before viewing the webpages offline in your HTML viewer or browser: Delete all JavaScript *.js files: On Windows, put all the files and their subdirectories in a directory, open the command line, go to that directory and type: del /S *.js The parameter /S deletes also in all the subdirectories. Edit the source code by means of (regular) expressions as follows: Use a program that allows batch processing of files and lists of (regular) expressions. As of 2010-05-02, set these expressions, where you will have to use your program's suitable syntax instead of the placeholders FROM, TO, REPLACEBY: Deleted text: FROM <!-- TO --> FROM <script TO </script> FROM <div id="pageheaders"> TO </div> FROM <table id="toc" TO </table> FROM <form TO </form> FROM <div class="editsection"> TO </div> FROM <div class='editsection'> TO </div> Replaced text: FROM <div id="pgfooter"> TO </body> REPLACEBY </body> |
Author: | Phelan [ Sun May 02, 2010 4:23 am ] |
Post subject: | Re: Cleaning Sensei's Library Webpages for Offline Storage |
I know you're not a fan of Java, if I remember correctly, but have you tried http://senseis.xmp.net/?SenseisLibraryOnTour or http://senseis.xmp.net/?SLSnapshot ? I don't know how different these are from your method. |
Author: | kirkmc [ Sun May 02, 2010 5:02 am ] |
Post subject: | Re: Cleaning Sensei's Library Webpages for Offline Storage |
Maybe not the ideal place to post this; this should be in off-topic or something. Robert, you _can_ post in forums other than the Rules forum. ![]() |
Author: | Harleqin [ Sun May 02, 2010 6:31 am ] |
Post subject: | Re: Cleaning Sensei's Library Webpages for Offline Storage |
You should not try to parse HTML with regular expressions, because HTML is not a regular language (please note the very specific meaning of "regular" here). Every popular language has a proper HTML parsing library. Browsers usually support the disabling of JavaScript anyway. For Firefox, the NoScript addon gives you the ability to disable/enable it for selected sites. |
Author: | RobertJasiek [ Sun May 02, 2010 6:49 am ] |
Post subject: | Re: Cleaning Sensei's Library Webpages for Offline Storage |
Snapshots are not suitable for me. I do not want the entire SL as a copy but only the pages that interest me. This topic is hard to put in the right forum; I find Go Rules to be the most fitting because it is about getting the expressions aka rules right. Actually I do not use classical regular expressions for the purpose but others might because it is much easier to find an RE editor than a FROM-TO expressions editor. Disabling JavaScript does not prevent it from being stored locally. NoScript does that but not everybody (also not I) uses NoScript. It may, if one uses it, solve the the JavaScript problem but it does not treat the other undesired parts of a webpage. Since I do not use NoScript, editing expressions of the source code is the most fitting approach for me. Presumably not for everybody but everybody has to know his preferred way anyway. I think my expressions list is not complete for SL yet. Does somebody have a more complete one? |
Author: | Harleqin [ Sun May 02, 2010 7:37 am ] |
Post subject: | Re: Cleaning Sensei's Library Webpages for Offline Storage |
I would not use a blacklist of things I do not want from a page, but a whitelist of the things I do want. In other words, parse the HTML into a tree data structure (this is what an HTML parser does), then select the nodes of interest. |
Author: | RobertJasiek [ Sun May 02, 2010 9:42 am ] |
Post subject: | Re: Cleaning Sensei's Library Webpages for Offline Storage |
Which short whitelist would work for all webpages? I do not know. Therefore I use a substitute for whitelisting: looking through the edited HTML source code in a plain text editor whether it still contains dubious tags. |
Author: | willemien [ Fri May 14, 2010 4:44 pm ] |
Post subject: | Re: Cleaning Sensei's Library Webpages for Offline Storage |
maybe easiest is to start with the sl snapshot and copy from there all you want to keep.... |
Page 1 of 1 | All times are UTC - 8 hours [ DST ] |
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group http://www.phpbb.com/ |