For this Project 1 Index milestone, your project must also store and output an in-memory inverted index of the processed file(s) alongside the previously computed word counts.
TABLE OF CONTENTS
You must complete the following assignments before beginning to work on this one:
Your main
method must be placed in a class named Driver
and must process the following additional command-line arguments:
-text [path]
where the flag/value pair is modified to trigger the calculation of word counts and an inverted index. See the “Text Processing” section below for details.
<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" /> Calculating the word counts and the inverted index should be part of the same process, not two separate processes. In other words, files should be opened and processed only once.
</aside>
-index [path]
where the flag -index
indicates the next argument [path]
is the path to use to output the inverted index to file in “pretty JSON” format. See the “Output Format” section below for details.
If the [path]
argument is not provided, use index.json
as the default output filename. If the -index
flag is not provided, your code should still calculate the inverted index but should not produce an output file of that index.
These are in addition to the command-line arguments from the previous release of the project.
The command-line flag/value pairs may be provided in any order or not at all. Do not convert paths to absolute form when processing command-line input!
Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main()
method output a stack trace to the user!
The input files should be cleaned, parsed, and stemmed as before, however your code must now also create an in-memory inverted index data structure alongside the word counts. The inverted index must store a mapping from a word to the document location(s) the word was found, and the numeric position(s) in that document the word is located. The positions should start at 1. This will require nesting multiple built-in data structures.
Each file should only be opened once; the word counts and the inverted index should be built at the same time.
For example, suppose we have the following inverted index:
{
"capybara": {
"input/mammals.txt": [
11
]
},
"platypus": {
"input/dangerous/venomous.txt": [
2
],
"input/mammals.txt": [
3,
8
]
}
}