A web spider is an automated socket application that requests data from a network server. Based on the presence and context of the collected data, new actions can be taken by the spider. In this way, spiders are autonomous robots surfing the web and collecting data based on pre-defined logic and criteria. A common application is to automate requests to a web spider by requesting a web server's URL over http. Once the data is received, the data is parsed for additional URLs or any other desired data. The new URLs are saved to a list that the web spider crawls through. Each new URL is visited and the process is repeated until all available URLs are visited.
A web spider acts like a robot surfing the Internet. You can give a web spider a simple instruction for fetching, reading, and handling content being served by any available web server, typically via the http & https protocols. There are 2 basic kinds of spiders: Indexers and Scrapers.
Indexers require very little logic to operate and contextually cannot understand much of anything other than the basics of html requirements. So indexers do just that, they index what they find. Their advantage is that they can index large numbers or URLs and their association to various attributes with not that much data and in a relatively short amount of time based on given resources.
Scrapers require much more logic and their operation and context management is more specific and geared toward the desired data set. In this way, the application “scrapes” the data off the web page much like you would want to do if you were viewing the web site in a browser and you wanted to take the desired content off the screen. You would “scrape” it off. These scraping spiders have turned into what is now called the data mining industry on the Internet.
Indexers (also known as dumb spiders) require simple logic to operate and not much input other than a URL to start from. The spider starts on the given URL, parses the page like a browser would, locates and returns the available linked URLs and repeats the process. Along the way, URLs are logged for indexing. Any data that can be matched for collection can be saved. A common thing to do is log all URLs the spider visited and their return code to a database. This way, you can see all available links on a website and their availability to an end user.
A working example of a dumb spider is ASPIDERwhich is a dumb spider that just spiders a given URL and attempts to download any file that matches a given mime type. Indexers attempt to index the URLs for a given range of URLs. Maybe someone wants to index a given website, so the indexing spider knows to only crawl that URL from the given domain. In this way, indexing spiders can autonomously control what URLs to visit.
Thus, indexers are used to determine what information exists on a given websites. The pages returned from requesting the URL are parsed into what can be seen by the end user and indexed based on keyword patterns. The indexing spider tries to determine what text content is visible to the end user and determines the keyword ratio for various keywords. Indexers can also be automated to download any matching file or URL. Perhaps you want to create an indexing spider that crawls a web page and downloads a given file type like pdfs or mp3s. Indexing spiders can perform this role quite well.
3. Data Mining
Scrapers or data miners are also known as smart spiders. Ironically, these spiders' logic often breaks and are more difficult to maintain because they are based on website design and layout (typically). These spiders get the reputation of being the "duct tape" of the Internet because there is often no real official protocol to collect the desired data. However, their existence and the fact that they do work are what some entire online industries are based on. Instead of having the spider run all over every available link, a specific set of data is usually desired and the whole site need not be visited. This also requires some customization for site-specific modules or logic for parsing the desired data. Essentially, a smart spider is like a macro for a browser. You want to automate the clicking and downloading of specific content on a site, not all the available links.
These spiders can be used to gain a specific set of data from a given online resource. For example, gathering information from a television guide listing web site. You would create a smart spider to target the listing page only and download and parse the data containing the television guide listings. Large numbers of data sources can be monitored and collected, with the draw back being the data collection is dependent of web site layout and design. Thus the goal of a smart spider writer is to write these applications based on as much generic information as possible. Instead of writing code to look for data in a table tag, you might just want to parse all data in all tables and grab the data the pattern matches in a time format, 9:00 a.m. for example. It depends on the data that needs to be gathered but generalizing the aggregating logic is important when creating these spiders. Too much site-specific logic can cause endless re-writing. Of course if the entire layout and functionality of the site changes drastically, then no generic logic in the world can prevent the smart spider from a logical re-write.
Data consolidation or data aggregation is a big business on the Internet. This is often maximized through the use of web spiders. For example, in any given industry, you might have 100’s of different web sites all over the Internet serving related content. You would have to build a smart spider to collect the desired content for each web site to summarize and display in one location. This prevents users from having to surf all over the Internet, which saves user’s valuable searching time. Let these spiders do the leg work and let the back end processes save, format, and display the data to the user as a web subscription service instead.
Many businesses use data aggregating applications including stock ticker information services, real estate listing services, location or store location related services as well as school or local business information services.
The Internet has a wide variety of important information, but it is how quickly you can gather this information and analyze it in the right context, which determines how successful and relevant the information can become. Having a real time up-to-date web service for any type of information on the Internet is a valuable resource. If there is currently no real time service set up for any information based service on the Internet, then one can probably be set up with little effort and become successful. Up to date information is what the Internet is now being widely used for.
Different working examples of smart spiders would be a list of the following:
♦Updating a database of new businesses from chamber of commerce websites used for targeted mailings.
♦Information service for updating a database of new real estate agents from public license files used for marketing mailings.
♦Updating a database for U.S. census data from census.gov used for local area information and analysis.
♦Services for updating a database of local school information from on-line government resources used for area information.
♦Updating business listings from yellowpages.com used for area information. ♦Providing an updated list of real estate listings for real estate listing websites or consolidating user account data from various web interfaces into one web interface such as financial information or work or personal data.
There are still more uses for these types of spiders as more uses are being discovered and as new uses and more content for the web emerge. Other examples of dumb spiders being used are Google or other search engines for content indexing. These spiders run all over the Internet and attempt to "index" the entire Internet's available content. They often obey and respect the /robots.txt file. While Google spiders are highly advanced and can probably do complex pattern matching, they still perform the same function on each page and rank pages based on keywords. It should be noted that Google possesses modular logic for indexing web sites, meaning the web sites can be categorized based on a predefined list of URLs and categories. If a URL exists in a list of categories, for example a web site is an online shopping mall, then the Google spider can assume logic and look for data sets and give matching data more meaning and value. Indexers are quickly becoming smarter in nature.
As html formatting and cascading style sheets become more in use, more of a predictable data parsing routine can be made to understand web content. As Indexers become more and more like scrapers, a library of artificial intelligent logic will be created closely mimicking how a human would surf the web and gather and understand content. But until that day, hard coding site or page specific configuration is still necessary to gather unique and unformatted information on the Internet.
6.Server Side Point of View
Most web servers usually are not expecting spiders to scrape their data. Some web sites want their content scraped and become more widespread such as bank rates on their websites', real estate listings, business listings, and online advertisers. Even some websites contain content that is free to take, such as public information. This can include the census, public real estate tax roll information, public court documents, or government funded research documents.
There are some websites who display un-copyrightable information yet they do not want people to spider them. This could be your competition or large search engines/websites that block spiders. They only want to spend their bandwidth on live people viewing content, not data collecting spiders. Some do not want their data being saved into a database for other websites to use. Dealing with this situation takes us to a whole new area of web spiders called stealth spiders.
7. Stealth Spiders
Spidering using stealth spiders without using a web browser involves many techniques. The first technique is IP blocking. A good solution to this is to use http or web proxies. Open proxies exist all over the Internet. Some open proxies are malicious attacks on servers and compromised, other proxies are free service proxies and adhere to standards and protocols. Take care in using the right proxy application. The ultimate rule of thumb is to make sure you know the proxy you are using. While this isn’t always possible, it should be treated as such by at least testing the proxies you are going to use before using them. By creating a proxy library that can manage and test the functionality of open proxy lists, you can create a good network of proxy farms that are at your disposal. By using multiple open proxies, a spider can look like 1000 individual users surfing a web site.
Of course if you use the same proxy over and over, that could get flagged as malicious usage. Take as much care as you possibly can to ensure the order in which the links are crawled by the spider and the amount of links each proxy uses are randomized. The idea is to have each proxy look like a separate individual user. Using unknown proxies can also result in unreliable service. The library that uses the proxies should also be responsible for updating the status of the proxies being used to avoid using bad proxies.
Modifying the user-agent is critical in order to resemble a web surfer. Conversely, it is possible to use an actual web browser to surf a website and gather its content. The actual ideal web spider would be to automate an actual number of web browsers in concert with a packet sniffer like SNORT to collect the data. Using windows COM libraries, you can easily automate a windows web browser application.
8. Overcoming Hurdles
Often websites will put their content in a binary file in an attempt to hide the content from unwanted data mining applications like web spiders. For example, PDFs are binary application files that can only be read by humans in a pdf application. This was the intention anyway. Many open source applications exist to read and parse a pdf to extract at least the text. Also most Microsoft office documents like Excel spreadsheets can be opened using existing standards. Java has a good library to handle Excel files. Even shockwave files can be parsed and have their audio, video, and textual content extracted. For PDFs you can use the open source application pdftohtml or pstotext to extract text. For Excel spreadsheets, you can use xls2xml or Java Excel libraries. Shockwave's are a little tricky. You need to launch a windows application called swfcatcher (a flash decomplier) and automate the parsing of the shockwave to get the content.
Web spiders are still an emerging field of application development on the Internet today. The path of evolution seems to be adding more and more logical functionality to the Indexing spiders to make them more like scraping spiders.
Since computer networks and the Internet have been around, they have been vulnerable to malicious attacks. The trend of network security has been to prevent known network attacks while monitoring the network for new attacks. Recently, network administrators have also been trying to react to and prevent these attacks dynamically, meaning, as the attack occur. The action taken to block the intruder is taken immediately. Over the years, network administrators have been closing up ports and securing resources to improve security via firewalls and access control lists. This has caused attackers to increase their sophistication of attacks and level of knowledge about the existing network and system. Ever since, this has been an on-going struggle between attacker and network administrator. When the attacker finds a vulnerability in the network and exploits it, the network administrator needs to identify the intrusion and launch an application to take the appropriate action and block against future attacks. This leaves the attacker to find a new more sophisticated attack. And the cycle continues.
As it stands now, the main tool network security personnel use to identify attackers is an IDS or "Intrusion Detection System". And the main problem the network security personnel face is parsing and analyzing the torrent of log entries made by the IDS in order to make meaningful sense of it all. It's favorable to have the data analyzed and parsed in a timely manner so action can be taken to block the intruder in time if necessary. The way network administrators are dealing with the problem is through a few commercial software solutions  and by fine tuning IDS rule sets to eliminate false positives to return a more successful valid alert ratio.
2. Popular Methods of Intrusion
As of late, popular methods of intrusion or attack are viruses and DDoS or Distributed Denial of Service attacks. Viruses are successful often through the proliferation of email messages and popular insecure email clients like Microsoft Outlook, along with malicious web pages and insecure web browsers such as Microsoft Internet Explorer. The virus typically would appear as a valid email attachment, such as a zip file, and upon opening the file, the virus application installs itself. Or if a malicious URL is accessed on an insecure browser, the application can just be willingly installed by the web browser. Once the virus is installed, any number of resources can be exploited. One common application is to log and transmit the infected user’s keystrokes in hopes of gaining sensitive information such as passwords and financial data. Another application is to run a TCP proxy on the machine to allow remote users access to the infected user’s resources. This could be using the infected user as a mail proxy for spam, or a web proxy used in concert with a DDoS attack, or any number of TCP/UDP services. DDoS attacks can occur to any number of hosts on the network. All the attacker does is over load the exposed and vulnerable network service with legitimate requests from multiple clients. If the number of clients and the number of requests from each client exceed the load capacity of the server, the server can get too loaded down to serve regular requests, resulting in latency or packet loss. As of March 2005, these attacks occur every few months on a wide scale, and prove successful in causing latency and occasional downtime. This is mostly due to the popularity of Microsoft and the fact that their applications frequently have security flaws.
3. Other Methods of Intrusion
For Unix users, flaws in Microsoft applications aren’t a problem, but they still face the same issues. On a seemingly much less frequent basis, security flaws are found in popular open source applications, and open source operating system kernels. These exploits are typically patched within hours of discovery and usually the exploit doesn’t do any more damage than a DoS. Rarely is information or system control compromised. It is worthy to note that certain Unix operating systems are historically known to have chronic problems with kernel security, the biggest one, and most popular one is the Linux kernel. It is probable that because of it popularity, it is more often a target for malicious attackers looking for exploits, but the regular occurrence of serious kernel bugs are unmistakable Another method of intrusion on Unix operating systems is a Trojan virus, also called a “root kit”. The root kit is an application usually written around a known security flaw in a kernel or an application running as root that exploits the flaw giving the attacker partial or full control over the system as super user.
4. Methods of Defense and Prevention
When it comes to protecting networks, the security administrator has a few mechanisms. One method is to use a firewall. This essentially blocks traffic based on rule sets consisting of host addresses and/or port numbers. The more firewalls you set up between hosts on a network, the more you can isolate traffic and protect surrounding networks from an attack if an attack were to try to spread though out your network. Another mechanism is to use access control lists for your hosts and their applications. An access control list or acl is just a list of hosts that you want to allow access to. In this way, you can assure only known host addresses are being allowed to use the system and services. A 3rd line of defense isn’t really a defense but more of an alert system, its called IDS, or “Intrusion Detection System”. IDS is used to monitor and report known attack signatures contained in network data packets. Using firewalls, acls and an IDS are good for blocking the problems that are known, but what about preventing the unknown? As far as prevention goes, there really isn’t any good solid solution yet. Open source and commercial vendors are attempting to bring forth something called an IPS or Intrusion Prevention System. The main difference between an IDS and IPS is an IPS will not only log the attack but also take appropriate action to prevent the current attack to prevent any more damage. While this sounds like a good idea, its not quite being implemented successfully at the time. The problem is still identifying an attack from just random network “noise” (false positives). Attackers are always trying to disguise their intent so by the nature of the problem, it is hard to determine what is really an attack and what isn’t. Another problem for large non-homogenous networks is coordinating the response between hosts on the multiple large networks. On a large network, there might be many different hosts and many different firewalls all running different operating systems (such is the Internet) and they all need to take action against blocking a certain malicious host or vulnerable port or application. The difficulty of negotiating the proper response sometimes gets bogged down in semantics and non-homogeneous security APIs all existing on one network. It is favorable to have all hosts and firewalls to use the same APIs or protocols. This is possible on networks for a business but not likely to happen on the Internet as a whole. Even on a network with all the same host systems and security APIs, problems can arise when preventing an on-going attack.
Network security is still an emerging and fast evolving profession. Commercial vendors still fall short of their promises and the only real way to secure a network is to have as many sets of human eyes on security log files as often as possible to ensure the most reliable security. Most of the process can be automated but a human is still needed to look at the packet or intrusion attempt to determine if its malicious or not.