Intro to Web Spiders

Abstract

A web spider is an automated socket application that requests data from a network server. Based on the presence and context of the collected data, new actions can be taken by the spider. In this way, spiders are autonomous robots surfing the web and collecting data based on pre-defined logic and criteria. A common application is to automate requests to a web spider by requesting a web server's URL over http. Once the data is received, the data is parsed for additional URLs or any other desired data. The new URLs are saved to a list that the web spider crawls through. Each new URL is visited and the process is repeated until all available URLs are visited.

1. Introduction

A web spider acts like a robot surfing the Internet. You can give a web spider a simple instruction for fetching, reading, and handling content being served by any available web server, typically via the http & https protocols. There are 2 basic kinds of spiders: Indexers and Scrapers.

Indexers require very little logic to operate and contextually cannot understand much of anything other than the basics of html requirements. So indexers do just that, they index what they find. Their advantage is that they can index large numbers or URLs and their association to various attributes with not that much data and in a relatively short amount of time based on given resources.

Scrapers require much more logic and their operation and context management is more specific and geared toward the desired data set. In this way, the application “scrapes” the data off the web page much like you would want to do if you were viewing the web site in a browser and you wanted to take the desired content off the screen. You would “scrape” it off. These scraping spiders have turned into what is now called the data mining industry on the Internet.

2. Indexers

Indexers (also known as dumb spiders) require simple logic to operate and not much input other than a URL to start from. The spider starts on the given URL, parses the page like a browser would, locates and returns the available linked URLs and repeats the process. Along the way, URLs are logged for indexing. Any data that can be matched for collection can be saved. A common thing to do is log all URLs the spider visited and their return code to a database. This way, you can see all available links on a website and their availability to an end user.

A working example of a dumb spider is ASPIDER[1]which is a dumb spider that just spiders a given URL and attempts to download any file that matches a given mime type. Indexers attempt to index the URLs for a given range of URLs. Maybe someone wants to index a given website, so the indexing spider knows to only crawl that URL from the given domain. In this way, indexing spiders can autonomously control what URLs to visit.

Thus, indexers are used to determine what information exists on a given websites. The pages returned from requesting the URL are parsed into what can be seen by the end user and indexed based on keyword patterns. The indexing spider tries to determine what text content is visible to the end user and determines the keyword ratio for various keywords. Indexers can also be automated to download any matching file or URL. Perhaps you want to create an indexing spider that crawls a web page and downloads a given file type like pdfs or mp3s. Indexing spiders can perform this role quite well.

3. Data Mining

Scrapers or data miners are also known as smart spiders. Ironically, these spiders' logic often breaks and are more difficult to maintain because they are based on website design and layout (typically). These spiders get the reputation of being the "duct tape" of the Internet because there is often no real official protocol to collect the desired data. However, their existence and the fact that they do work are what some entire online industries are based on. Instead of having the spider run all over every available link, a specific set of data is usually desired and the whole site need not be visited. This also requires some customization for site-specific modules or logic for parsing the desired data. Essentially, a smart spider is like a macro for a browser. You want to automate the clicking and downloading of specific content on a site, not all the available links.

These spiders can be used to gain a specific set of data from a given online resource. For example, gathering information from a television guide listing web site. You would create a smart spider to target the listing page only and download and parse the data containing the television guide listings. Large numbers of data sources can be monitored and collected, with the draw back being the data collection is dependent of web site layout and design. Thus the goal of a smart spider writer is to write these applications based on as much generic information as possible. Instead of writing code to look for data in a table tag, you might just want to parse all data in all tables and grab the data the pattern matches in a time format, 9:00 a.m. for example. It depends on the data that needs to be gathered but generalizing the aggregating logic is important when creating these spiders. Too much site-specific logic can cause endless re-writing. Of course if the entire layout and functionality of the site changes drastically, then no generic logic in the world can prevent the smart spider from a logical re-write.

4. Applications

Data consolidation or data aggregation is a big business on the Internet. This is often maximized through the use of web spiders. For example, in any given industry, you might have 100’s of different web sites all over the Internet serving related content. You would have to build a smart spider to collect the desired content for each web site to summarize and display in one location. This prevents users from having to surf all over the Internet, which saves user’s valuable searching time. Let these spiders do the leg work and let the back end processes save, format, and display the data to the user as a web subscription service instead.

Many businesses use data aggregating applications including stock ticker information services, real estate listing services, location or store location related services as well as school or local business information services.

The Internet has a wide variety of important information, but it is how quickly you can gather this information and analyze it in the right context, which determines how successful and relevant the information can become. Having a real time up-to-date web service for any type of information on the Internet is a valuable resource. If there is currently no real time service set up for any information based service on the Internet, then one can probably be set up with little effort and become successful. Up to date information is what the Internet is now being widely used for.

5. Examples

Different working examples of smart spiders would be a list of the following:

♦Updating a database of new businesses from chamber of commerce websites used for targeted mailings.

♦Information service for updating a database of new real estate agents from public license files used for marketing mailings.

♦Updating a database for U.S. census data from census.gov used for local area information and analysis.

♦Services for updating a database of local school information from on-line government resources used for area information.

♦Updating business listings from yellowpages.com used for area information.
♦Providing an updated list of real estate listings for real estate listing websites or consolidating user account data from various web interfaces into one web interface such as financial information or work or personal data.

There are still more uses for these types of spiders as more uses are being discovered and as new uses and more content for the web emerge. Other examples of dumb spiders being used are Google or other search engines for content indexing. These spiders run all over the Internet and attempt to "index" the entire Internet's available content. They often obey and respect the /robots.txt file. While Google spiders are highly advanced and can probably do complex pattern matching, they still perform the same function on each page and rank pages based on keywords. It should be noted that Google possesses modular logic for indexing web sites, meaning the web sites can be categorized based on a predefined list of URLs and categories. If a URL exists in a list of categories, for example a web site is an online shopping mall, then the Google spider can assume logic and look for data sets and give matching data more meaning and value. Indexers are quickly becoming smarter in nature.

As html formatting and cascading style sheets become more in use, more of a predictable data parsing routine can be made to understand web content. As Indexers become more and more like scrapers, a library of artificial intelligent logic will be created closely mimicking how a human would surf the web and gather and understand content. But until that day, hard coding site or page specific configuration is still necessary to gather unique and unformatted information on the Internet.

6.Server Side Point of View

Most web servers usually are not expecting spiders to scrape their data. Some web sites want their content scraped and become more widespread such as bank rates on their websites', real estate listings, business listings, and online advertisers. Even some websites contain content that is free to take, such as public information. This can include the census, public real estate tax roll information, public court documents, or government funded research documents.

There are some websites who display un-copyrightable information yet they do not want people to spider them. This could be your competition or large search engines/websites that block spiders. They only want to spend their bandwidth on live people viewing content, not data collecting spiders. Some do not want their data being saved into a database for other websites to use. Dealing with this situation takes us to a whole new area of web spiders called stealth spiders.

7. Stealth Spiders

Spidering using stealth spiders without using a web browser involves many techniques. The first technique is IP blocking. A good solution to this is to use http or web proxies. Open proxies exist all over the Internet. Some open proxies are malicious attacks on servers and compromised, other proxies are free service proxies and adhere to standards and protocols. Take care in using the right proxy application. The ultimate rule of thumb is to make sure you know the proxy you are using. While this isn’t always possible, it should be treated as such by at least testing the proxies you are going to use before using them. By creating a proxy library that can manage and test the functionality of open proxy lists, you can create a good network of proxy farms that are at your disposal. By using multiple open proxies, a spider can look like 1000 individual users surfing a web site.

Of course if you use the same proxy over and over, that could get flagged as malicious usage. Take as much care as you possibly can to ensure the order in which the links are crawled by the spider and the amount of links each proxy uses are randomized. The idea is to have each proxy look like a separate individual user. Using unknown proxies can also result in unreliable service. The library that uses the proxies should also be responsible for updating the status of the proxies being used to avoid using bad proxies.

Modifying the user-agent is critical in order to resemble a web surfer. Conversely, it is possible to use an actual web browser to surf a website and gather its content. The actual ideal web spider would be to automate an actual number of web browsers in concert with a packet sniffer like SNORT[2] to collect the data. Using windows COM libraries, you can easily automate a windows web browser application.

8. Overcoming Hurdles

Often websites will put their content in a binary file in an attempt to hide the content from unwanted data mining applications like web spiders. For example, PDFs are binary application files that can only be read by humans in a pdf application. This was the intention anyway. Many open source applications exist to read and parse a pdf to extract at least the text. Also most Microsoft office documents like Excel spreadsheets can be opened using existing standards. Java has a good library to handle Excel files. Even shockwave files can be parsed and have their audio, video, and textual content extracted. For PDFs you can use the open source application pdftohtml or pstotext to extract text. For Excel spreadsheets, you can use xls2xml or Java Excel libraries. Shockwave's are a little tricky. You need to launch a windows application called swfcatcher (a flash decomplier) and automate the parsing of the shockwave to get the content.

9. Conclusion

Web spiders are still an emerging field of application development on the Internet today. The path of evolution seems to be adding more and more logical functionality to the Indexing spiders to make them more like scraping spiders.

[1] Aspider – Open source web spider implemented in Java, http://aspider.sourceforge.net/

[2] Snort – Open source network Intrusion Detection Software – http://www.snort.org/