For this homework, you will use the HttpsFetcher code from lecture to create an HtmlFetcher class specifically for downloading only HTML content from web servers efficiently using sockets and the HTTP/S protocol.

Motivation

Your search engine project starting with Project 4 Crawl will index web pages instead of text files. Before doing that, it must be able to download HTML web pages over a socket connection from a web server. For efficiency, content should not be downloaded unless certain conditions are met first.

<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" /> Before using this homework with Project 4 Crawl, you need to make some modifications. Specifically, you eventually need to make your HtmlFetcher, HtmlCleaner, and LinkFinder classes work together so that you fetch HTML, then parse links after stripping block elements, but before stripping tags and entities. (This is not required for the homework, however.)

</aside>

Hints

Below are some hints that may help with this homework assignment:

These hints are optional. There may be multiple approaches to solving this homework.