This post reproduces the methods of a paper by Mühleisen et al., Old Dogs Are Great at New Tricks: Column Stores for IR Prototyping.
The premise of Mühleisen et al. is that instead of using custom made index structures, researchers should represent their documents in a column store relational database. The ranking functions can then be represented as SQL queries. This allows for easy comparison of different ranking functions as they only have to be represented as a SQL query. We can focus on the retrieval functions, and let the column store database take care of efficiently retrieving the documents. Another advantage is that all advances in the underlying database technology benefit our system for free.
This post will specifically focus on reproduction of the methods. Later posts will describe how to extend this approach to a different ranking function and how this approach can be used in a streaming context instead of a static one.
The paper implements the Okapi BM25 retrieval model, which also will be presented in this post. The implementation can easily be extended to other ranking functions, however depending on which ranking functions will be implemented, it might be useful to represent data differently in the database.
Wikipedia gives a clear definition of this ranking formula. Given a document and a query , BM25 is defined as:
In this formula, is the term frequency of in document
This corresponds to the assumption that word frequency indicates importance.
is the number of words in document , it is useful to take
this into account because a word occurring a couple of times in a short
document probably should give a stronger signal than a word the same amount of
times in a very long document.
is the average document length of the documents in the
collection, this parameter is useful to determine how the document you are
analyzing relates to the other documents in the collection in terms of length.
and are free parameters that can be optimized for the collection.
Often, when no optimization is carried out, respectively and
are chosen for and .
, the inverse document frequency, is the weight given to
individual query terms in the formula.
The occurrence of each query term should be given more weight if it is less
common in the collection. Imagine for example the query:
wizard is less common in the collection than the word
therefore more informative for the information need of the user.
The IDF used is defined as:
is the amount of document is the collection, and is the amount of documents the query term appears in.
In order to capture all these parameters in the SQL query, we need to be able to calculate them from the database tables. and are independent of the query terms and can be hard-coded in the SQL query or be extracted directly from the document representations in relational tables. This holds for and assuming the collection is static. However if the collection is not static, it is still possible to update these values when new documents are added or removed. The other parameters depend on the query terms and should be calculated query time.
Let’s look at the proposed method in more detail.
doc1 with the text
I put on my robe and wizard
hat, an example also used by Mühleisen et al. The following tables are
Note that not all terms appear in the table as an effect of stop word removal.
dict table keeps track of how many times terms appear in the collection.
The parameter in the IDF can be queried using this table.
The document term count is saved in table
terms, allowing for the
calculation of .
Document length is saved in the
docs table, making it possible to
Ideally the tables would be created directly while parsing the documents, it should be possible to insert all information using a one-pass approach. For this post a different approach was taken. Anserini was used to create a Lucene index. Anserini is a toolkit for reproducible information retrieval. The advantage of using Anserini to create a classic index, before creating the tables, is that the same preprocessing pipeline that Anserini uses can be taken advantage of. After the Lucene index is created, the relevant information can easily be extracted to build database tables.
For this experiment the Robust04 data set was used. Anserini supports easy indexing for Robust04, among other data sets often used for information retrieval research.
The following snippet shows an example of how BM25 scores can be calculated for a document given a query. Note that the implementation shown in the snippet is conjunctive BM25 for a query of length three, a variant of BM25 with similar effectiveness as a disjunctive variant, but faster. Some parameters are hard-coded in this query in order to represent the BM25 formula as a SQL query without adding unnecessary SQL code.
WITH qterms AS (SELECT termid, docid, df FROM terms WHERE termid IN (10575, 1285, 191)), subscores AS (SELECT docs.docid, len, term_tf.termid, tf, df, (log((528155-df+0.5)/(df+0.5))*((tf*(1.2+1)/ (tf+1.2*(1-0.75+0.75*(len/188.33)))))) AS subscore FROM (SELECT termid, docid, df AS tf FROM qterms) AS term_tf JOIN (SELECT docid FROM qterms GROUP BY docid HAVING COUNT(distinct termid) = 3) AS cdocs ON term_tf.docid = cdocs.docid JOIN docs ON term_tf.docid=docs.docid JOIN dict ON term_tf.termid=dict.termid) SELECT scores.docid, score FROM (SELECT docid, sum(subscore) AS score FROM subscores GROUP BY docid) AS scores JOIN docs ON scores.docid=docs.docid ORDER BY score DESC;
This post shows how to represent documents in a column store relational database. Using this representation it is easy to calculate BM25, and rank the documents accordingly. Later posts will describe how to extend this to a streaming collection and a different ranking function.
The code written for this post can be found on github. Specifically, the code takes a Lucene index and creates the tables as csv files that directly map to tables described in this post. This can for example be imported by the MonetDB column store.