For this homework, you will clean HTML of all markup leaving only the text content. This includes converting or removing HTML entities, removing HTML tags and comments, and removing some HTML block elements.

Motivation

Your search engine project starting with Project 4 Crawl will index web pages instead of text files. Before doing that, it must be able to remove certain HTML tags so that it only indexes the content (not the markup). Since this task does not require validating the HTML tags or parsing the DOM (document object model), your project can use simple regular expressions to detect and remove these tags.

<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" /> Before using this homework with Project 4 Crawl, you need to make some modifications. Specifically, you eventually need to make your HtmlFetcher, HtmlCleaner, and LinkFinder classes work together so that you fetch HTML, then parse links after stripping block elements, but before stripping tags and entities. (This is not required for the homework, however.)

</aside>

Handling Entities

HTML entities are how special characters are encoded in HTML. For example, the entity © renders as the © copyright symbol in HTML and is the © Unicode character.

Each HTML entity should be converted from the HTML representation to its Unicode character when possible using the StringEscapeUtils.unescapeHtml4 method of the Apache Commons Text third-party library.

Any entities that cannot be converted must be removed. (This includes some valid HTML 5 entities, as those are not yet supported in this library.)

Hints

Below are some hints that may help with this homework assignment:

View the Javadoc comments using the "Javadoc" view in Eclipse. It will render the HTML in the comments properly.
Each method can be completed using a combination of regular expressions and the String.replaceAll method.
You will have to generate a regular expression for the HtmlCleaner.stripElement method based on the element name provided as a parameter. It can be helpful to use Java format strings for this, but it is not required.
If you are getting stack overflow exceptions, it is likely your regular expressions are too complicated, too broad (e.g. using . to match any character too often), or too greedy.

These hints are optional. There may be multiple approaches to solving this homework.