What allowed Google to triumph over AltaVista, the first comprehensive Internet search engine, was the ability to sort the result by relevance to the user. Google used the fact that popular pages were mostly linked to than less relevant one to compute a static score, the famous page rank, and display pages in order. The rest is history: Google is multi-billion dollars business and AltaVista is gone.
The techniques used by search engines are also relevant in implementing search engines on every day databases. For search and query, the data are described in tables that relate to each other. This means that, like as web search were based on page links, it is possible to sort data by the way they link at each others.
From here it is easy to create a full text search table using the really good FTS4 extension. How ever, like the ancient AltaVista, the results are not sorted and not always relevant:
The result are just ordered in there database order making “M.A.N. Matrix Adjusted Normal” of 1992 appearing before “The Matrix” of 1999. Obviously, this is not very useful.
However, it is possible to take advantage of the relational database information to create a graph of all the results and linking them together. For example, with Matrix, we will have:
From this graph, it is possible to compute a relevance score based the different links. The most well known algorithm for that is the famous “Page’s Rank” (From Larry Page now CEO of Google). However, “Page’s Rank” is designed to score a static set (the original snapshot of the Internet that early Google was doing at regular time). In our case, computing “Page’s Rank” on the generated graph will be slow and not very relevant.
The HITS algorithm was designed to run on a dynamic graph. For each node, it computes two scores: an “Authority” score defining how much a node has been linked to, and a “Hub” score defining how much a node links to the others.
If we compute the HITS on the Matrix result set, we suddenly have more relevant order:
The process is really good as it even works with more elusive:
The search engine process a text query in 4 sequential steps:
For performance reason, the original database is reconstructed into a separate SQLite database which allows the engine to easily identify the full text results and links the node together.
The “main_xxx” tables are global tables used by the engine:
Each domain is contained in its own table. Obviously, searchable text is stored as an identifier pointing back to the fulltext table. Depending on the metadata, external IDs (link MovieID and NameID) are ItemIDs or using secondary indexes.
This structure allows implementing most of the database extraction directly in SQL. Using the metadata table, the engine can create two
SELECT queries to get all the forward links of an Item and another to get all the incoming links.
The graph generation phases are by far the most expensive. The experience shows that you don’t actually need a lot of generations in most of the cases.
In this experiment, I’ve run the same research “Kwai” but once with only one generation and 500 nodes max (the text result returns around
400 nodes). The second time I allowed the engine to issue more nodes with a second generation. The two runs return equivalent results however the first one just ran for a fraction of time of the second: