For this homework, your code must find all of the HTTP/S URIs from well-formed hyperlinks (or just links) from a validating HTML web page. Specifically, your code must return a list of HTTP or HTTPS URIs (or links) from the a
anchor tag href
attribute found within an HTML web page. This should not include URIs in the href
attribute of the link
tag!
Helper methods are provided to make sure URIs are in a properly-encoded normalized form and to convert from relative to absolute URIs. See below for more.
Your search engine project starting with Project 4 Crawl will index web pages instead of text files. To support crawling, it must be able to find and extract hyperlinks on the web page. Since this task does not require validating the HTML tags or parsing the DOM (document object model), your code can use regular expressions to find and extract those hyperlinks.
<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" /> Before using this homework with Project 4 Crawl, you need to make some modifications. Specifically, you eventually need to make your HtmlFetcher, HtmlCleaner, and LinkFinder classes work together so that you fetch HTML, then parse links after stripping block elements, but before stripping tags and entities. (This is not required for the homework, however.)
</aside>
You will need to be familiar with the HTML anchor tag <a>
and URI/URL protocols for this assignment. Some resources include:
The anchor tag is used to create hyperlinks (or just links) on web pages. For example:
<a href="<https://www.cs.usfca.edu/>">USF CS</a>
The above code will generate the link USF CS, where the link text is USF CS
and the link destination is the URL https://www.cs.usfca.edu/
.
For simplification, assume the href
property will always have "
quotation marks around the value. For example, it is not required to parse this link:
<a href=https://www.cs.usfca.edu/>USF CS</a>
The link must be placed in the href
attribute of the a
tag, but not all a
tags will have this attribute. For example, this is a valid a
tag without the href
attribute:
<a name="home" class="bookmark">Home</a>
And, the href
attribute may appear in other tags. For example, this is a valid link
tag to include a style sheet:
<link rel="stylesheet" type="text/css" href="style.css">