QuadSucker/Web: Users Manual

(c) Scott M Baker, smbaker@quadsucker.com


This documentation also serves for UltraSucker/Web, and you may find specific differences between QuadSucker and UltraSucker below.

Purpose:

QuadSucker/Web downloads files from websites to your computer. Unattended operation is supported, allowing entire websites to be downloaded automatically without user intervention. Here is a list of some possible uses:

Special Features:

This program is optimized for downloading from image-intensive sites and provides built-in jpeg and gif image viewing during the download process. Multithreading is used to download multiple files in parallel to maximize transfer efficiency. Specific benefits include:


Table of Contents


  1. Preface
    1. Purpose
    2. Features
  2. Getting Started
    1. Installation
    2. Quick Start
  3. Setup
    1. General Settings
      1. Authentication
      2. Directory Structure Creation
      3. Download Directory
      4. Misc Settings
      5. Spider
    2. Filtering
      1. Duplicates
      2. Filter
    3. Advanced Settings
      1. Misc Advanced
      2. Link Relativizer
      3. Proxy Server Settings
  4. Menu Options
  5. Interactive Form Fill-out
  6. Command Line Options
  7. QuadSucker vs UltraSucker
  8. Registration
  9. Contacting the Author
  10. Revision History

Getting Started


Installation:

Installation is very simple and mostly automated:

  1. Unzip the program archive into a directory with a decompressor, such as Winzip
  2. Run the program Setup.Exe. You may run it from the Windows-95 explorer, with the Start menu's run command, or from a shell prompt.
  3. Installation will be performed by the automated setup program

Quick Start:

  1. Execute QuadWeb
  2. Type the URL you want into the Website: field in the main dialog. An example of a URL is http://www.yahoo.com
  3. Press the <Start> button

QuadSucker/Web will begin downloading files from the URL you provided. As each file is downloaded, it will be parsed and the hyperlinks extracted to form new URL's which will be downloaded, and this process will continue until the entire site has been downloaded.

When files are downloaded, they are (by default) placed in a directory structure that mirror's the webservers. For example, suppose you were to download files from http://www.yahoo.com/science/astronomy, and QuadSucker/Web was installed into c:\program files\QuadWeb. Your files would be placed in:

c:\program files\QuadWeb\www.yahoo.com\science\astronomy

Note that the web includes hyperlinks which may reach outside of that directory. For example, yahoo places many of it's images in http://us.yimg.com.Therefore, some files would get thrown into c:\program files\QuadWeb\us.yimg.com

To view the files you have downloaded, use the viewer program of your choice. If you've downloaded a large collection of images, then try SB-Software's SBJV or Sortpics utilities.

Note: Most options above, such as the download directory settings and spidering behaviour may be modified by QuadSucker/Web's setup menu.


Setup


General Setup:

 

Authentication:

Some websites protect access with authentication which requires you to enter a name and password to access the site. Normally your browser would pop up a window asking for this information when authentication is required, but QuadSucker/Web is intended for automated operation, so it would be best if you entered your authentication information ahead of time.

Press the <add> button to add a new authentication setting. You will be prompted for the following:

Note that some Adult sites, such as AdultCheck do not normally use authentication. On these sites you are actually sent to an AdultCheck form which asks you for information and then directs you back to a "special page" on the originating site. All you need to do is follow this link in your browser, and note which "special page" is returned -- then enter the special page's url into QuadSucker/Web.

Directory Structure Creation:

If this option is enabled, then a directory structure will be created which mirrors the host's directory structure. (See the notes under Quick Start). If this option is disabled then all of the files will be placed in the same directory.

Most users will probably want directory structure creation enabled. The drawback is that sometimes files may have to be hunted down the the directory structure because web servers sometimes place inline images in funny places. Users who are just downloading bulk files (i.e. from a picture site) may wish to disable directory creation so all of the images are lumped into the same spot.

Download Directory:

The download path is where files will be stored on your computer. You should use a hard drive where there is plenty of disk space! The default is the program directory where QuadSucker/Web is installed.

Misc Options

Delete "short" files

When a web server sends a file, it (usually) includes a count of how many bytes are supposed to arrive in the file. Sometimes a file is completed with fewer bytes than there are supposed to be. In this case, the server has probably aborted the transfer and just not told us about it. The file is probably worthless and should be deleted.

Delete files that time out

A file times out if no bytes are received in a certain interval. The files is most probably worthless and should be deleted.

Delete files that are aborted

Files are aborted when the user presses the <Stop> button, right-clicks and aborts a specific file, or closes the program. The files are probably ruined and should be deleted.

 

Spider:

Spidering is the process of following links in a website to download all of the files. Options in this section control what kind of links are followed and how many of them are followed:

Spider Link Depth Limit:

The link depth limit allows you to set how 'deep' the links will be followed. For example, if the maximum link depth was set to '3', and QuadSucker/Web followed the file a->b->c, and c contained another hyperlink, then the new hyperlink would not be added since the depth is already three levels deep.

Duplicate Checking:

The duplicate checker maintains a database of recently downloaded items to prevent downloading of those items in the future. The database is maintained independently of the files themselves, so you may delete the files from your hard drive and still "remember" them in the duplicate list.

Duplicate Checker Mode:

The duplicate checker may be disabled in which no duplicate checking is performed (however, URL's will still be remembered), or the duplicate checking may be set to either 'all' or 'not-modified'. The 'all' setting will always reject a duplicate if it is present in the database. The 'not-modified' setting will ask the web-server if the document has been modified. Asking the web server does incur significant overhead, but if you are downloading large files, the savings will be substantial.

Duplicate Checker Entries/Count:

This is a tally of how many URL's are stored in the duplicate checker. You may also selectively delete some of the entries if you wish to get rid of them.

Filter:

The filter options control which types of files are downloaded to your computer. Html (Hypertext Markup Language) files are always downloaded since they are reuqired for spidering to function. Other file types, such as gifs, jpegs, and sound files may be individually turned on and off.

You may also limit by the byte size of the object, rejecting files that are either too large or too small.

Misc Advanced:

The advanced options are intended for people who feel the need to tamper with things that are best left alone. I suggest you not modify any of the advanced options unless you specifically know what you are doing!

Link Relativizer:

(Please forgive me while I get a bit technical here; If you don't understand the following, then don't worry about it)

Web pages contain several different types of links, such as relative links and absolute links. A relative link specifies a location relative to the page which you are currently viewing. For example, if you were viewing http://myhost.com/mypage.html, and it contained a link called "page2.html", this would be a relative link, and would resolve to http://myhost.com/page2.html.

However, an absolute link specifies an entirely new location. For example, if you came across the link http://anotherhost.com/page2.html, then this would be an absolute link and have no relation to the source document.

Now, assume that QuadSucker/Web has downloaded a web page to your computer, and you open that page, located on your local disk, with netscape. Any relative links will be resolved properly by netscape and point to copies of pages which QuadSucker/Web has downloaded. However, absolute links always point to an entirely new location. If you follow an absolute link with your browser, you would find yourself back out on the Internet, downloading the linked page from scratch.

(end of the technical discussion, now let's talk about the impact)

The link relativizer translates absolute links into relative links. This is essential for viewing pages with your web browser offline. What this means:

All that being said, it should also be noted that the link relativizer is EXPERIMENTAL. I'm still working out the bugs, so some pages might not display properly. I suggest you try to get what you want done without using the link relativizer, and only enable it if you determine there is a problem that absolutely requires the relativizer to work.

Proxy Server Configuration:

A proxy server is a computer which services web requests on behalf of it's network clients. Proxy servers are common in large corporations and even in small companies where once computer is etup with an Internet connection and other computers on the LAN use it. If you don't have a proxy server, then you won't need to use this option.


Menu Options


File: Open Bookmark File:

Many people have asked me for a way to automatically download all pages stored in their Netscape bookmark file. This option has been added specifically for that reason. In case you don't use netscape, or don't know what a netscape bookmark file is, it is a html file stored on your hard drive containing links to your favorite (or otherwise notable) pages. In fact, any html file stored on your computer may be used as the bookmark file.

If you want to download each page in your bookmark file + it's images, but not spider the sites:

  1. Go into setup, find the <Spider> tab, and set the spider link depth to 1. This will cause QuadSucker/Web to only download the pages, and not spider the websites.
  2. Use File:Open Bookmark File pull-down menu option to set the URL to your bookmark file.
  3. Press Start.

If you want to spider each site in your bookmark file:

  1. Go into setup, find the <Spider> tab, and disable the spider link depth, or set it to the number of levels deep that you want to spider each site plus one.
  2. Use File:Open Bookmark File pull-down menu option to the set URL to your bookmark file.
  3. Press Start.

You don't need to use netscape to create bookmark files -- you can create them with your favorite html editing tool. You can even create them by hand. For example, a bookmark file for the quadsucker, sortpics, and sb-software websites might look like the following:

<a href=http://www.quadsucker.com/></a>
<a href=http://www.sb-software.com/></a>
<a href=http://www.sortpics.com/></a>

File: URL Sequence Wizard:

'The Sequenced URL wizard lets you download a series of URLs (pages) that have a numerical order to them. For example, if you know a website numbers it's pages, then you might want to download the following:

http://website.com/page1.html
http://website.com/page2.com
....
http://website.com/page500.html

The sequence wizard requires you to seperate the URL into three parts:

beginning part
middle part
end part

In our example, the beginning part is "http://website.com/page", the middle part is a number from 1 to 500, and the end part is ".html".

The sequence wizard also lets you choose padding. Padding will add zeroes until a number is a certain length. For example, if you set padding to 5, then the following numbers would be used:

00001
00002
...
00500

Setup: Priority Keywords:

QuadSucker/Web maintains a list, also known as a queue, or webpages that are waiting to be downloaded. Normally, this queue is services in first in/first out order. The first page to be retrieved is the first page that was entered into the waiting queue, and new pages are inserted at the tail of the queue. It works like a line at a grocery store -- the customer at the front of the line is served first and new customers are added at the back of the line (to do otherwise would be chaos!).

However, sometimes you might want some URL's to have a greater priority than others. For example, you are downloading from a fine art website, and Rafael (a good choice) is your favorite painter. The fine art website is very large and you're anxious to see the Rafael pictures as soon as possible. You'd rather they make their way to the front of the waiting queue, ahead of other less favorable painters.

This is where the priority keywords option is used. If a URL is placed encountered which contains a priority keyword, then it automatically gets placed at the front of the waiting queue. Continuing our example above, if you entered "rafael" into the priority keywords dialog, then any URL which contained the substring"rafael" would get pushed to the front of the list. It's as simple as that.

Setup: Kill Strings:

Kill strings may be used to prevent certain files from being downloaded. For example, if you entered "yahoo" as a kill strings, any URL that contained the phrase "yahoo" would not be downloaded. These would include:

Pretty simple, eh? Just enter the things you don't like and they won't be downloaded.

A technical note: Requests are killed as they are retrieved from the queues. This means you can use the queue viewer to see what's coming up and add kill strings for it. The items won't be killed immediately -- they will remain in the queue until QuadSucker gets to them and then will be killed. (If this doesn't make sense to you then just ignore it)

View: Log Window

The log window presents a list of errors and undesirable events that have occurred since you pressed the start button.

View: Status

The status window presents a summary of how many files have been downloaded, how many bytes have been downloaded, and a count for various types of errors which QuadSucker/Web has had to deal with while processing your requests.

View: Image Indexer

The Image Indexer is designed to allow you to browse image files downloaded via QuadSucker/web with your web browser. The Image Indexer produces index files containing links and thumbnails for each image. When you execute the image indexer, it will create two index files in each directory in your download path. THese files will be used by your brwoser when you wish to view the images. Everything is automatic, and should be fairly self explanatory.

 


Interactive Form Fill-out


Many sites use forms to log people in and out of the site. For example, you may come across a page where you have to enter your name, account id, and/or password in order to access a site.

There is a distinction between form-based access and authentication. Forms typically request their credentials in the main window of your browser. Many forms are very elaborate with graphics and logos for the website, textual descriptions of the website, and/or a disclaimer. Authentication, however, presents itself as a small pop-up window which will appear over the top of your browser window. This section describes forms, if you're looking for information on authentication, check a pages above.

The problem with Forms is that they are difficult to automate -- each form is different. QuadSucker cannot determine how a form needs to be filled out by itself, since forms are website-dependent. However, there is an interactive feature that allows you to log into a website using a browser built into QuadSucker/Web.

Here are some steps to get you started:

  1. First you must identify the page with the form -- navigate to the site with your browser and find the page that requests the login details. You can usually copy the URL directly out of your web browser's URL window.
  2. Next, enter that URL into quadsucker's main window.
  3. Press the <Form> button, located next to the <Start> and <Resume> buttons.
  4. QuadSucker will bring up the from in a built-in web browser and let you fill out the form.
  5. Press a <Submit> button on the form. Sometimes a website may call the button something other than <Submit>, but it should be fairly obvious.
  6. QuadSucker will close the browser window and begin downloading with the very next page that occurs after the form.

 


Command Line Options


QuadWeb supports a number of command line options that can be used to automate program options. You may, for example, cause QuadWeb to launch, automatically download a specific URL, and then automatically exit. Here are the following command line options:

-autoexit   automatically exit the program when downloading is complete  (registered users only)
-autominimize   automatically minimize the program window (registered users only)
-autostart   automatically connect and start downloading
-singlepage   set single page mode (download page plus images)
-singlefile   set single file mode
-entiresite   set entire site mode (all pages and all files on one site)
-multisite   set multiple site mode (spider multiple web sites)

In addition, you can specify a URL on the command line. Some Examples:


UltraSucker vs QuadSucker


UltraSucker is a special stripped-down high performance version of QuadSucker. It is extremely similar in function and configuration with the following differences:

QuadSucker should suffice for most people. The thumbnails are nice, and four threads achieve very good utilization of the network connection. However, if you have a good high-speed connection to the Internet (ISDN, cable modem, DirectPC, T1, DSL, etc), and you are dealing with a slow end-server, then UltraSucker/Web may gain you some performance at the expense of losing a few bells and whistles.

Your internet connection has a fixed capacity. If you're using a 56kbps modem, then you can only download 56kbps. It doesn't matter if your pulling 4 web pages at a time, 12 web pages at a time, or a million pages at once. Only 56kbps is going to come down the line. QuadSucker/Web will work just fine for you.

However, let's say that you're using a faster connection, such as a DSL at 512kbps. Assume the web server you're connecting to is very slow, giving you huge latencies and dishing out documents at about 50kbps. That means 462kbps of your DSL connection is under-utilized. If you could pull twelve web pages at a time, you could fill your DSL line to the maximum. That's where ultrasucker fits in. There are other factors to consider, which make the argument more complex, but the basic truth is the same.

You have a 56kbps modem connection to the internet Use QuadSucker/Web
You have a fast connection (128kpbs or more) to the internet Try UltraSucker/Web and see if performance increases. If not, go back to QuadSucker.
The web server you're talking to is really really really slow Use UltraSucker/Web
You like thumbnail images to be displayed while you're downloading so you can see what is going on. Use QuadSucker/Web.

Registration


This program is distributed as "Shareware". This means you are allowed to try it for a limited amount of time to determine if it suits your needs, and then you must pay for it in order to continue using it. I don't set a specific evaluation period -- you may take as long as you feel necessary, but I think 30 days is a good guideline, and if you're using the unregistered version after 30 days, then you ought to feel very very guilty. :(

However, you are allowed to fully evaluate the program during the evaluation period. Make sure it works to your satisfaction before you register -- that's what the evaluation period is for.

QuadSucker registration is included in my SB-Software registration package. By paying the modest $26.95 registration fee, you are registered to not only QuadSucker/Web, but all of my 'SB' series shareware programs as well. Likewise, if you've already registered for one of my other programs (such as SBNews or SBWcc), your registration is automatically good for QuadSucker/Web.

I feel this is the farest and simplest policy -- other authors will make you register for each and every program seperately, even when there is considerable overlap between the products. With SB-Software, you get it all!

Registration is good for life, and includes all past, present, and future versions.

For full details of the registration policy, and a list of programs included in the registration fee, see http://www.sb-software.com/credit/register.html.

You may register online via credit card at http://www.sb-software.com/credit/


Contacting the Author


You may reach me at the QuadSucker website, http://www.quadsucker.com/, or the SB-Software website, http://www.sb-software.com/.

I can be reached via email at smbaker@quadsucker.com or smbaker@sb-software.com.

My US mailing address is available on the website somewhere.


Revision History