For this project, you will write a Java program that processes all text files in a directory and its subdirectories, cleans and parses the text into word stems, builds a word count and in-memory inverted index from that processed text, and outputs it JSON format.

TABLE OF CONTENTS


Introduction

The process of stemming reduces a word to a base form (or “stem”), so that words like interesting, interested, and interests all map to the stem interest. Stemming is a common preprocessing step in many search engines.

An inverted index is a nested data structure that stores the mapping from words to the documents and positions within those documents where those words were found. It is a common in-memory data structure used by many search engines.

For example, suppose we have the following inverted index:

{
  "capybara": {
    "input/mammals.txt": [
      11
    ]
  },
  "platypus": {
    "input/dangerous/venomous.txt": [
      2
    ],
    "input/mammals.txt": [
      3,
      8
    ]
  }
}

This indicates that the word capybara is found in the file input/mammals.html in position 11. The word platypus is found in two files, input/mammals.html and input/dangerous/venomous.html. In the file input/mammals.html, the word platypus appears twice in positions 3 and 8. In file input/dangerous/venomous.html, the word platypus is in position 2 in the file.

Milestones

This project is broken into multiple milestones:

Project v1.0 Tests

Project v1.1 Tests

Project v1.2 Review

Project v1.3 Review

Project v1.4 Design

See each milestone for more details.

Homework