For this homework assignment, you will create a class that is able to clean and parse text into stemmed words using the SnowballStemmer
class. Use UTF_8
and try-with-resources when writing your files. Do not use the java.io.File
class.
Before we can index and store data for our search engine, we need to figure out how to process text into a consistent form (such as converting to lowercase). Many search engines also stem words (converting words like practicing
and practiced
to a common root pratic
) to help return relevant results no matter the form of the word used within the text.
We will rely on the Snowball stemmer algorithm found within the Apache OpenNLP library for stemming in this class.
<aside> <img src="/icons/git_gray.svg" alt="/icons/git_gray.svg" width="40px" /> This homework assignment is directly useful for your project. Consider copying this class into your project repository when done!
</aside>
Below are some hints that may help with this homework assignment:
You need to have use the third-party Apache OpenNLP library. The library should be automatically setup in Eclipse by Maven. See the main(String[])
method and the Apache OpenNLP Tools Javadoc for how to use this library. For example:
Stemmer stemmer = new SnowballStemmer(ENGLISH);
String demo = "practicing";
String stem = stemmer.stem(demo).toString();
System.out.println("Word: " + demo); // practicing
System.out.println("Stem: " + stem); // practic
System.out.println();
When working with files, you should use try-with-resources, the UTF-8
character encoding, and buffered readers and writers.
Look for opportunities to reduce duplicate code. For example, you could reuse some existing methods and/or create a new helper method that is reused in the methods that work with files.
These hints are optional. There may be multiple approaches to solving this homework.