For this project, you will write a Java program that processes all text files in a directory and its subdirectories, cleans and parses the text into word stems, builds a word count and in-memory inverted index from that processed text, and outputs it JSON format.
TABLE OF CONTENTS
The process of stemming reduces a word to a base form (or “stem”), so that words like interesting
, interested
, and interests
all map to the stem interest
. Stemming is a common preprocessing step in many search engines.
An inverted index is a nested data structure that stores the mapping from words to the documents and positions within those documents where those words were found. It is a common in-memory data structure used by many search engines.
For example, suppose we have the following inverted index:
{
"capybara": {
"input/mammals.txt": [
11
]
},
"platypus": {
"input/dangerous/venomous.txt": [
2
],
"input/mammals.txt": [
3,
8
]
}
}
This indicates that the word capybara
is found in the file input/mammals.html
in position 11
. The word platypus
is found in two files, input/mammals.html
and input/dangerous/venomous.html
. In the file input/mammals.html
, the word platypus
appears twice in positions 3
and 8
. In file input/dangerous/venomous.html
, the word platypus
is in position 2
in the file.
This project is broken into multiple milestones:
See each milestone for more details.