For this project, you will modify your Project 1 Index code to also support processing of multi-word search queries from a query text file, conducting an exact search or partial search through the inverted index data structure for those queries, and outputting the search results ranked by term frequency in a pretty JSON format to file.

TABLE OF CONTENTS


Introduction

The primary reason to create an inverted index is to enable search. A search query indicates what we want to search for. For example, suppose we want to search for the following multi-word query in our search engine:

Observers observing 99 HIDDEN               capybaras!

Just as the build process involved processing the text into stems before indexing it, we must process the query in a similar way before conducting the search.

This query line will be processed into unique query stems as follows:

[capybara, hidden, observ]

We can then search for exact matches or partial matches start with our query words within our inverted index.

A search result is the location (file path) we found a match with one or more of the query words. The search engine needs to return these search results in sorted order, so that the most relevant results are listed first.

To be able to sort the search results, we need to collect metadata alongside the results. This includes the number of times a query word was found, the ratio of words in the file that are relevant, and the location itself. For example:

{
  "capybara hidden observ": [
    {
      "count": 1,
      "score": 1.00000000,
      "where": "input/text/simple/.txt/hidden.txt"
    },
    {
      "count": 1,
      "score": 1.00000000,
      "where": "input/text/simple/capital_extension.TXT"
    },
    {
      "count": 13,
      "score": 0.36111111,
      "where": "input/text/simple/words.tExT"
    }
  ]
}

See the milestones below for more details on how to process the queries, conduct an exact or partial search, as well as rank and format the search results.

Milestones

This project is broken into multiple milestones:

Project v2.0 Tests

Project v2.1 Tests