For this Project 4 Crawl milestone, your project must extend the functionality from the Project v4.0 Tests assignment to support a multithreaded web crawl that can add multiple pages to the inverted index.
TABLE OF CONTENTS
You must complete the following assignments before beginning to work on this one:
Your main
method must be placed in a class named Driver
and must process the following additional command-line arguments:
-html [seed]
must be modified such that it also enables multithreading even if the -threads
flag is not present. It must also modify how links are processed on web pages.
See the “Link Processing” section below for details.
-crawl [total]
may optionally be provided such that the flag -crawl
indicates the next argument [total]
is the total numbers of URLs to crawl when the -html
flag is provided.
If the -crawl
flag is not provided, or the [total]
argument is not provided or an invalid number, then the -html
flag should only download and process only 1
web page (the seed).
See the “Web Crawl” section below for how to determine what pages to crawl.
These are in addition to the command-line arguments from the previous Project v3.1 Tests assignment.
The command-line flag/value pairs may be provided in any order or not at all. Do not convert paths to absolute form when processing command-line input!
Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main()
method output a stack trace to the user!
The way that web pages are processed from Project v4.0 Tests must be modified such that:
head
, style
, script
, noscript
, and svg
block elements.a
anchor tag and href
property within the HTML content in the order they are provided on the page.