For this Project 4 Crawl milestone, your project must maintain the functionality from the Project v3.1 Tests assignment, as well as create a web crawler that can add a single web page to the inverted index.
TABLE OF CONTENTS
You must complete the following assignments before beginning to work on this one:
Your main
method must be placed in a class named Driver
and must process the following additional command-line arguments:
-html [seed]
where the flag -html
indicates the next argument [seed]
is the seed URI the web crawler should download and process to build the index. See the “HTML Processing” section for how the download and processing must be completed.
If the -html
flag is present, assume multithreading is enabled as if the -threads
flag was also provided.
In other words, both the -threads
and -html
flags will trigger multithreaded classes to be initialized. If the -threads
flag is not present, use the default number of threads to initialize the work queue.
These are in addition to the command-line arguments from the previous Project v3.1 Tests assignment.
The command-line flag/value pairs may be provided in any order or not at all. Do not convert paths to absolute form when processing command-line input!
Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main()
method output a stack trace to the user!
Web pages must be requested using sockets and HTTP/S from the web server as follows:
200 OK
HTTP/S response status code and HTML content type, then download, process, and add the HTML content to the inverted index.200 OK
is returned. Associate the final response with the original cleaned URI and process. For example, the URI ~cs212/redirect/one eventually redirects to ~cs212/simple/hello.html. The web crawler will associate the HTTPS response of ~cs212/simple/hello.html with the original URI ~cs212/redirect/one when processing.For efficiency (and to avoid being blocked or rate-limited by the web server), do not download unnecessary content and only download necessary content exactly once from the web server. Specifically:
200 OK
status code. For example, only the headers (not the content) will be downloaded for large text file without the text/html
content-type, or for a 404
status web page.