Webbots, Spiders, and Screen Scrapers, 2nd Edition

Webbots, Spiders, and Screen Scrapers, 2nd Edition

A Guide to Developing Internet Agents with PHP/CURL
by Michael Schrenk
March 2012, 392 pp.
ISBN-13: 
978-1-59327-397-2

Webbots, Spiders, and Screen Scrapers is "unmatched to my knowledge in how it covers PHP/CURL. It explains to great details on how to write web clients using PHP/CURL, what pitfalls there are, how to make your code behave well and much more."
Daniel Stenberg, creator of cURL (Read More)

There's a wealth of data online, but sorting and gathering it by hand can be tedious and time consuming. Rather than click through page after endless page, why not let bots do the work for you?

Webbots, Spiders, and Screen Scrapers will show you how to create simple programs with PHP/CURL to mine, parse, and archive online data to help you make informed decisions. Michael Schrenk, a highly regarded webbot developer, teaches you how to develop fault-tolerant designs, how best to launch and schedule the work of your bots, and how to create Internet agents that:

  • Send email or SMS notifications to alert you to new information quickly
  • Search different data sources and combine the results on one page, making the data easier to interpret and analyze
  • Automate purchases, auction bids, and other online activities to save time

Sample projects for automating tasks like price monitoring and news aggregation will show you how to put the concepts you learn into practice.

This second edition of Webbots, Spiders, and Screen Scrapers includes tricks for dealing with sites that are resistant to crawling and scraping, writing stealthy webbots that mimic human search behavior, and using regular expressions to harvest specific data. As you discover the possibilities of web scraping, you'll see how webbots can save you precious time and give you much greater control over the data available on the Web.

Author Bio 

Michael Schrenk has developed webbots for over 15 years, working just about everywhere from Silicon Valley to Moscow, for clients like the BBC, foreign governments, and many Fortune 500 companies. He's a frequent Defcon speaker and lives in Las Vegas, Nevada.

Table of contents 

Introduction

Part I: Fundamental Concepts and Techniques
Chapter 1: What’s in It for You
Chapter 2: Ideas for Webbot Projects
Chapter 3: Downloading Web Pages
Chapter 4: Basic Parsing Techniques
Chapter 5: Advanced Parsing with Regular Expressions
Chapter 6: Automating Form Submission
Chapter 7: Managing Large Amounts of Data

Part II: Projects
Chapter 8: Price-Monitoring Webbots
Chapter 9: Image-Capturing Webbots
Chapter 10: Link-Verification Webbots
Chapter 11: Search-Ranking Webbots
Chapter 12: Aggregation Webbots
Chapter 13: FTP Webbots
Chapter 14: Webbots That Read Email
Chapter 15: Webbots That Send Email
Chapter 16: Converting a Website into a Function

Part III: Advanced Technical Considerations
Chapter 17: Spiders
Chapter 18: Procurement Webbots and Snipers
Chapter 19: Webbots and Cryptography
Chapter 20: Authentication
Chapter 21: Advanced Cookie Management
Chapter 22: Scheduling Webbots and Spiders
Chapter 23: Scraping Difficult Websites with Browser Macros
Chapter 24: Hacking iMacros
Chapter 25: Deployment and Scaling

Part IV: Larger Considerations
Chapter 26: Designing Stealthy Webbots and Spiders
Chapter 27: Proxies
Chapter 28: Writing Fault-Tolerant Webbots
Chapter 29: Designing Webbot-Friendly Websites
Chapter 30: Killing Spiders
Chapter 31: Keeping Webbots out of Trouble

Appendix A: PHP/CURL Reference
Appendix B: Status Codes
Appendix C: SMS Gateways

Index

View the detailed Table of Contents (PDF).
View the detailed Index (PDF).

Reviews 

"Webbots, Spiders, and Screen Scrapers is well-written and easy to read. Schrenk will encourage you to look at the web as a data resource and inspire you to write useful code which saves time and money"
Craig Buckler, SitePoint (Read More)

"This book is a great resource for those looking to move beyond the Internet browser with automated solutions for collecting and using data. It should prove to stimulate your imagination with the possibilities of what can be done."
iComputebetter.com (Read More)

"There are certainly many ways that a web developer can learn to code webbots and spiders, but one would be hard pressed to find a better starting point than reading Schrenk's second edition. The text and its associated code library lay an excellent foundation from which almost no webbot project is out of reach."
Ecommerce Developer (Read More)

"Overall the book is interesting and readable, and the code is straightforward and easy to follow even for those without a solid grounding in PHP."
TechBookReport (Read More)

"This book is one of the few that attempts to gather together the range of techniques that you need to write programs that work with web sites intended to be used by humans."
I Programmer (Read More)

"Overall, I found this a very clear, very readable, and thorough presentation of the topic. Given that this is the second edition of this volume, others before have realized that Schrenk has written probably the definitive introduction to this topic and made the whole field of crawlers, spiders, and bots an approachable and interesting area to explore. Highly recommended."
Andrew Binstock, Dr. Dobbs (Read More)

Updates 

Page 83
In Table 7-1 where the code is:

$array = exe_sql(DATABASE, "select *
from people where ID='2'");
$array['ID']="2";

The result code is...

$array['NAME']="Sabrina Duncan";
$array['CITY']="Anaheim";
$array['STATE']="CA";
$array['ZIP']="92604";

The last code line, $array['ZIP']="92604";, should instead read: $array['ZIP']="92812";.