For this homework, you will use the HttpsFetcher code from lecture to create an HtmlFetcher class specifically for downloading only HTML content from web servers efficiently using sockets and the HTTP/S protocol.

Motivation

Your search engine project starting with Project 4 Crawl will index web pages instead of text files. Before doing that, it must be able to download HTML web pages over a socket connection from a web server. For efficiency, content should not be downloaded unless certain conditions are met first.

<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" /> Before using this homework with Project 4 Crawl, you need to make some modifications. Specifically, you eventually need to make your HtmlFetcher, HtmlCleaner, and LinkFinder classes work together so that you fetch HTML, then parse links after stripping block elements, but before stripping tags and entities. (This is not required for the homework, however.)

</aside>

Hints

Below are some hints that may help with this homework assignment:

It will help to have a HTTP reference. The MDN Web Docs have nice HTTP reference references, including references for HTTP headers and HTTP status codes.
Do not fetch the entire page unless necessary! For the most efficient solution, do not directly use HttpsFetcher.fetch(URI uri) in your implementation. Instead, setup the sockets and get the headers. Based on those headers, decide how to proceed.
Some of these methods can be done using regular expressions, but it is not required.

These hints are optional. There may be multiple approaches to solving this homework.