For this homework, you will clean HTML of all markup leaving only the text content. This includes converting or removing HTML entities, removing HTML tags and comments, and removing some HTML block elements.

Motivation

Your search engine project starting with Project 4 Crawl will index web pages instead of text files. Before doing that, it must be able to remove certain HTML tags so that it only indexes the content (not the markup). Since this task does not require validating the HTML tags or parsing the DOM (document object model), your project can use simple regular expressions to detect and remove these tags.

<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" /> Before using this homework with Project 4 Crawl, you need to make some modifications. Specifically, you eventually need to make your HtmlFetcher, HtmlCleaner, and LinkFinder classes work together so that you fetch HTML, then parse links after stripping block elements, but before stripping tags and entities. (This is not required for the homework, however.)

</aside>

Handling Entities

HTML entities are how special characters are encoded in HTML. For example, the entity &copy; renders as the © copyright symbol in HTML and is the © Unicode character.

Each HTML entity should be converted from the HTML representation to its Unicode character when possible using the StringEscapeUtils.unescapeHtml4 method of the Apache Commons Text third-party library.

Any entities that cannot be converted must be removed. (This includes some valid HTML 5 entities, as those are not yet supported in this library.)

Hints

Below are some hints that may help with this homework assignment:

These hints are optional. There may be multiple approaches to solving this homework.