Create a data search engine from a relational database

What allowed Google to triumph over AltaVista, the first comprehensive Internet search engine, was the ability to sort the result by relevance to the user. Google used the fact that popular pages were mostly linked to than less relevant one to compute a static score, the famous page rank, and display pages in order. The rest is history: Google is multi-billion dollars business and AltaVista is gone.

The techniques used by search engines are also relevant in implementing search engines on every day databases. For search and query, the data are described in tables that relate to each other. This means that, like as web search were based on page links, it is possible to sort data by the way they link at each others.

For this example, I will use an extraction of the famous IMDB. Using the public data files and IMDbPY, I have reconstructed a 5GB+ relational database. The database has this (very simplified) schema:

From here it is easy to create a full text search table using the really good FTS4 extension. How ever, like the ancient AltaVista, the results are not sorted and not always relevant:

The result are just ordered in there database order making “M.A.N. Matrix Adjusted Normal” of 1992 appearing before “The Matrix” of 1999. Obviously, this is not very useful.

However, it is possible to take advantage of the relational database information to create a graph of all the results and linking them together. For example, with Matrix, we will have:

From this graph, it is possible to compute a relevance score based the different links. The most well known algorithm for that is the famous “Page’s Rank” (From Larry Page now CEO of Google). However, “Page’s Rank” is designed to score a static set (the original snapshot of the Internet that early Google was doing at regular time). In our case, computing “Page’s Rank” on the generated graph will be slow and not very relevant.

The HITS algorithm was designed to run on a dynamic graph. For each node, it computes two scores: an “Authority” score defining how much a node has been linked to, and a “Hub” score defining how much a node links to the others.

If we compute the HITS on the Matrix result set, we suddenly have more relevant order:

The process is really good as it even works with more elusive:

Implementation

The search engine process a text query in 4 sequential steps:

  • The text query is passed to the full text search engine of the database. Here I’m using the FTS extension of SQLite. However, any equivalent of other vendors will do.
  • The list of entries is converted to a graph and the engine will link them together.
  • If the maximum number of graph node is not reached, a second generation is created.
  • The HITS values are computed for each node of the graph and the best result are extracted.

For performance reason, the original database is reconstructed into a separate SQLite database which allows the engine to easily identify the full text results and links the node together.

The “main_xxx” tables are global tables used by the engine:

  • Main_Domain and Main_DomainMetadata describe the contains and the relationship of the different domains (aka searchable data collection)
  • Main_Master contains all the unique ItemID of all the domains. This table allows finding which domain is associated to a given ItemID. It is also a good place to add any static scores.
  • Main_FullText contains all the searchable strings of all the domains. With SQLite each entry can only be associated with one value called “DocID”. For this reason, we need an extra table to associate an entry back to its domain.

Each domain is contained in its own table. Obviously, searchable text is stored as an identifier pointing back to the fulltext table. Depending on the metadata, external IDs (link MovieID and NameID) are ItemIDs or using secondary indexes.

This structure allows implementing most of the database extraction directly in SQL. Using the metadata table, the engine can create two
SELECT queries to get all the forward links of an Item and another to get all the incoming links.

Optimization

The graph generation phases are by far the most expensive. The experience shows that you don’t actually need a lot of generations in most of the cases.

In this experiment, I’ve run the same research “Kwai” but once with only one generation and 500 nodes max (the text result returns around
400 nodes). The second time I allowed the engine to issue more nodes with a second generation. The two runs return equivalent results however the first one just ran for a fraction of time of the second:

FAQ

    Why not run the engine on the original relational database structure? This could avoid the Extraction and Transformation phase.
    The new database is optimized for the graph construction and extension. If the engine would use a classic layout it will require complex queries to identify search results and get node links. However, if you trade flexibility to data storage, I’m sure it is possible to use these techniques directly on an existing schema.
    Why HITS? Why not using Page Rank?
    Page’s Rank is a static score algorithm. This means that page rank is
    not designed to run on a dynamic tree but on a whole collection. This could be a nice addition for improving the result order when the Autority score is too small to be meaningfull.
AttachmentSize
DataSearch.zip1.67 MB