6 March 2019

Teaching An Old Dog A New Trick

by Chris

This post reproduces the methods of a paper by Mühleisen et al., Old Dogs Are Great at New Tricks: Column Stores for IR Prototyping.

The premise of Mühleisen et al. is that instead of using custom made index structures, researchers should represent their documents in a column store relational database. The ranking functions can then be represented as SQL queries. This allows for easy comparison of different ranking functions as they only have to be represented as a SQL query. We can focus on the retrieval functions, and let the column store database take care of efficiently retrieving the documents. Another advantage is that all advances in the underlying database technology benefit our system for free.

This post will specifically focus on reproduction of the methods. Later posts will describe how to extend this approach to a different ranking function and how this approach can be used in a streaming context instead of a static one.

Okapi BM25

The paper implements the Okapi BM25 retrieval model, which also will be presented in this post. The implementation can easily be extended to other ranking functions, however depending on which ranking functions will be implemented, it might be useful to represent data differently in the database.

Wikipedia gives a clear definition of this ranking formula. Given a document $D$ and a query $Q$ , BM25 is defined as:

$\begin{align*} \textit{BM25}(D, Q) = \sum^n_{i=1}IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)} {f(q_i, D) + k_1 \cdot \Big(1 - b + b \cdot \frac{|D|}{avgdl}\Big)} \end{align*}$

In this formula, $f(q_i, D)$ is the term frequency of $q_i$ in document $D$ . This corresponds to the assumption that word frequency indicates importance. $|D|$ is the number of words in document $D$ , it is useful to take this into account because a word occurring a couple of times in a short document probably should give a stronger signal than a word the same amount of times in a very long document. $avgdl$ is the average document length of the documents in the collection, this parameter is useful to determine how the document you are analyzing relates to the other documents in the collection in terms of length. $k_1$ and $b$ are free parameters that can be optimized for the collection. Often, when no optimization is carried out, respectively $1.2$ and $0.75$ are chosen for $k_1$ and $b$ . $IDF(q_i)$ , the inverse document frequency, is the weight given to individual query terms in the formula. The occurrence of each query term should be given more weight if it is less common in the collection. Imagine for example the query: blue wizard. The word wizard is less common in the collection than the word blue and therefore more informative for the information need of the user. The IDF used is defined as:

$\begin{align*} IDF = \log\Big(\frac{N-df+0.5}{df+0.5}\Big) \end{align*}$

$N$ is the amount of document is the collection, and $df$ is the amount of documents the query term appears in.

In order to capture all these parameters in the SQL query, we need to be able to calculate them from the database tables. $k_1$ and $b$ are independent of the query terms and can be hard-coded in the SQL query or be extracted directly from the document representations in relational tables. This holds for $avgdl$ and $N$ assuming the collection is static. However if the collection is not static, it is still possible to update these values when new documents are added or removed. The other parameters depend on the query terms and should be calculated query time.

An example

Let’s look at the proposed method in more detail. Consider document doc1 with the text I put on my robe and wizard hat, an example also used by Mühleisen et al. The following tables are constructed:

Table: dict
termid	term	df
1	put	1
2	robe	1
3	wizard	1
4	hat	1

Note that not all terms appear in the table as an effect of stop word removal.

Table: terms
termid	docid	count
1	1	1
2	1	1
3	1	1
4	1	1

Table: docs
docid	name	length
1	doc1	4

The dict table keeps track of how many times terms appear in the collection. The $df$ parameter in the IDF can be queried using this table. The document term count is saved in table terms, allowing for the calculation of $f(q_i, D)$ . Document length is saved in the docs table, making it possible to lookup $|D|$ .

Creating the tables

Ideally the tables would be created directly while parsing the documents, it should be possible to insert all information using a one-pass approach. For this post a different approach was taken. Anserini was used to create a Lucene index. Anserini is a toolkit for reproducible information retrieval. The advantage of using Anserini to create a classic index, before creating the tables, is that the same preprocessing pipeline that Anserini uses can be taken advantage of. After the Lucene index is created, the relevant information can easily be extracted to build database tables.

For this experiment the Robust04 data set was used. Anserini supports easy indexing for Robust04, among other data sets often used for information retrieval research.

How to query

The following snippet shows an example of how BM25 scores can be calculated for a document given a query. Note that the implementation shown in the snippet is conjunctive BM25 for a query of length three, a variant of BM25 with similar effectiveness as a disjunctive variant, but faster. Some parameters are hard-coded in this query in order to represent the BM25 formula as a SQL query without adding unnecessary SQL code.

WITH qterms AS (SELECT termid, docid, df FROM terms 
	WHERE termid IN (10575, 1285, 191)), 
	subscores AS (SELECT docs.docid, len, term_tf.termid, 
	tf, df, (log((528155-df+0.5)/(df+0.5))*((tf*(1.2+1)/
	(tf+1.2*(1-0.75+0.75*(len/188.33)))))) AS subscore 
	FROM (SELECT termid, docid, df AS tf FROM qterms) AS term_tf 
	JOIN (SELECT docid FROM qterms 
		GROUP BY docid HAVING COUNT(distinct termid) = 3) 
		AS cdocs ON term_tf.docid = cdocs.docid 
	JOIN docs ON term_tf.docid=docs.docid 
	JOIN dict ON term_tf.termid=dict.termid) 
SELECT scores.docid, score FROM (SELECT docid, sum(subscore) AS score 
	FROM subscores GROUP BY docid) AS scores JOIN docs ON 
	scores.docid=docs.docid ORDER BY score DESC;

Conclusion

This post shows how to represent documents in a column store relational database. Using this representation it is easy to calculate BM25, and rank the documents accordingly. Later posts will describe how to extend this to a streaming collection and a different ranking function.

Source

The code written for this post can be found on github. Specifically, the code takes a Lucene index and creates the tables as csv files that directly map to tables described in this post. This can for example be imported by the MonetDB column store.