Webbots, Spiders, and Screen Scrapers: book review
written by craig, 14 May 2012
I was initially surprised to discover this title. “Webbots, Spiders, and Screen Scrapers” is a niche topic which the majority of web developers never consider. Michael Schrenk apparently does little else and his enthusiasm is evident throughout the book.
While browsers are great, they are general purpose tools which display web pages and have no concept of the underlying information. Schrenk will inspire you to use your web skills in a different context and achieve a lot given a small investment. It certainly inspired me to write a small PHP program to check several thousand pages for holiday availability — it saved hours of hunting.
The 362 pages are split into four main sections. The first introduces fundamental concepts such as PHP’s cURL functions, fopen and fget. The author provides a downloadable library which is used to simplify most of the example code.
Part two provides a number of sample projects such as link verifiers, image capture bots, FTP downloaders and email analysers. The chapters are concise but provide enough information to get you started.
Part three covers more advanced topics such as spiders, procurement, SSL, authentication, cookies, scheduling (in Windows) and browser macros. Part four describes larger considerations such as proxy servers, fault tolerance, redirection handling and legal implications. The author gives good suggestions and useful pointers. There’s little practical code, but that’s not unexpected given the wide scope of the topics.
Finally, there are a number of useful appendices including a good cURL reference, email to SMS gateways, HTTP and NNTP status codes. The 18-page index illustrates just how many topics Schrenk had to cover.
There are a number of issues which concerned me…
The author dislikes regular expressions. He admits it’s controversial but states they are hard to use and don’t show information context. My opinion: if regular expressions are useful for anything, it’s web page parsing. I accept they can be difficult to comprehend but it’s possible to create a series of simpler expressions if that becomes an issue. I’m not convinced by the ‘lack of context’ argument either; a single regular expression can extract titles and associated data.
There is also a chapter dedicated to reverse-engineering HTML forms. It recommends using an online form parser provided by the developer but makes no mention of HTTP analysis tools such as Fiddler, Firebug, HttpFox or Live HTTP Headers which make the process immeasurably easier.
I was also surprised that hashes or checksums weren’t described in the chapters about data storage. File compression is covered, but there may be no need to store or compress a file if a previously-generated checksum indicates it hasn’t changed.
Finally, my biggest concern is the use of the author’s own libraries — which he states are not particularly elegant. The samples describe how to use his code rather than the lower-level PHP and cURL instruction sets. That’s a shame; the book covers a heavy-duty topic and is aimed at an advanced audience but Schrenk abstracts the technicalities away from curious developers.
“Webbots, Spiders, and Screen Scrapers” is well-written and easy to read. Schrenk will encourage you to look at the web as a data resource and inspire you to write useful code which saves time and money. It’s ideal for those new to the subject, but is a little too lightweight for experienced developers.
Note for Amazon: you really should look at your prices for paper and Kindle versions in the US!