CMPH, CDH and Dr. Amjad Daoud

Two years ago I was working on a project where the Minimum Perfect Hash was looking as a great way to improve performance. For this, I ported the CHD Algorithm from the CMPH Library from C++ to C#. I decided later to share it on this web site on this minimal web page.

I received the following email from Dr. Amjad Daoud asking me to quote that the CHD Algorithm is very close/identical to his own CACM1992 Algorithm II he published in his thesis in 1992, 17 years before the CDH paper from F. C. Botelho, D. Belazzougui and M. Dietzfelbinger.

(I removed the email addresses)

From Dr. Amjad Daoud

Dear Laurent Dupuis,

I am writing to you after I read on your web site that you implemented the CDH algorithm approach while it is in fact an C# implementation of my CACM1992 Algorithm II and an Algorithm in my thesis; please see attached.

It seems that many have quoted you as well. Please note that our original algorithm II in CACM1992 implemented minimal perfect hashing by first mapping the keys, then ordering buckets according to their size, and fitting them into the hash table using a parameter usually randomised using a small random jump table.

Later in 1992, I developed a much more compact algorithm that used faster ordinary hash functions.

Recently I was surprised that the "Hash, displace, and compress" paper ESA2009 page 13:

"Our implementation needs to find an injective mapping x → (g(x), f 1(x), f 2 (x)) for all x ∈ S ⊆ U, where g : U → [r], f 1 : U → [m], and f 2 : U → [m].
Function g will map each key to one of the r buckets. Then, for each bucket B i , 0 ≤ i < r, we will assign a pair of displacements (d 0 , d 1 ) so that each key x ∈ B i is placed in an empty bin given by (f 1 (x) + d 0 f 2 (x) + d 1 ) mod m."

My thesis, page96:
"The class of functions searched is:
h(k) = (h0(k) + h1(k) + 7 ∗ (h2(k) − 1) ∗ g(h1(k))) (mod n) "

this is equivalent to h(k) = f1(x )+ 7*f2(x)*d - 6*d when h1(k) corresponds to your x and f1(x), f2(x) corresponds to h0(k) and h2(x) respectively. (h0,h1,h2) must be injective so your "d0" corresponds to my "7*d" and your "d1 corresponds to my "-6*d"

Kindly find attached Prof. Martin explanation.

Also, I posted relevant information regarding my algorithm(s) at http://iswsa.acm.org/mphf/index.html. I would appreciate it if cite this page
as well.

Sincerely,
Dr. Amjad Daoud

---------- Forwarded message ----------
From: Martin Dietzfelbinger
Date: Wed, 20 Feb 2013 22:40:43 +0100
Subject: Re: "Hash, displace, and compress" paper
To: Amjad Daoud
Cc: Martin Dietzfelbinger

Dear Dr Daoud,

please excuse the long delay for my answer.

I must say that only your comments and the following re-reading the CACM paper (Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud, FHCD) made me recognize that indeed Algorithm 2 of that paper has some resemblance with our Algorithm 1.

The work on our "Hash-Displace-Compress" paper started out with Algorithm 1, found independently by my esteemed coauthor Djamal Belazzougui. When working out the details and finding references and related work in the literature our main reference was
"Rasmus Pagh: Hash and Displace: Efficient Evaluation of Minimal Perfect
Hash Functions. WADS 1999, p. 49-54", which uses an approach close to our Algorithm 1. (It uses 2(1+epsilon)n
buckets A[i], sorts them in decreasing order of magnitude, and "places" them in a table, see appendix A of Pagh's paper. We use
(1+epsilon)n buckets and place them in decreasing order of magnitude.

The compression part is totally new, and original.

Pagh's approach in turn uses ideas from "Robert Endre Tarjan and Andrew Chi Chih Yao. Storing a sparse table. Communications of the ACM,
22(11):606--611, 1979". Already there "buckets" or "blocks" or "patterns" were treated in order of decreasing size. They called their
approach "first fit decreasing". So the sorting motif was definitely not new with us, or with any work in the 1980s.)

It is very unfortunate, and I am deeply sorry that we overlooked citing your FHCD paper for its Algorithm 2. This was definitely unintentional. We cited Pagh's paper extensively, so there was no claim whatsoever that the splitting, sorting and placing approach was something never
seen before.

In the enclosed document, I have compared Algorithm 1 from the "Hash-Displace-Compress" paper with Algorithm 2 from your FHCD paper in ome detail. I hope you will find this comparison interesting. In summary, it argues that there are quite deep differences between the
two algorithms in spite of similarities at the surface, in particular regarding the asymptotic behaviour.

So now I am convinced that Algorithm 1 from "Hash-Displace-Compress" cannot be called "a copy of" Algorithm 2 from FHCD, because of differences in the structure, the behaviour, and the properties that can be proven about them.

For the future I would like to suggest the following, if this finds your consent. If and when we write a journal version of our paper, which
we hope will be so in the near future, we shall cite your paper, your thesis, and the survey paper by Czech, Havas, and Majewski, which also presents and discusses Algorithm 2 from FHCD. We will also give a brief account of Algorithm 2 and its behaviour in your own experiments.

This will include that it works well for not too large key sets (nowadays we are talking billions instead of millions, of course). It will also be stated that every algorithm that uses buckets of size cn/log n for a constant c and range n=m cannot have expected linear construction time in the asymptotic sense, meaning for very large n. (This is so no matter if the buckets are sorted by size or not. A sketch of a proof is in the enclosed document.)

All these remarks about the asymptotic behaviour in no way deny that Algorithm 2 from FHCD can be very useful in practice, if one lets the factor c grow with n. I really congratulate you on finding this helpful heuristic method!

Your letter kindled my interest in Algorithm 2 from FHCD, and I shall try to have a student investigate the difference in behaviour of the different algorithms we are discussing, by experiments.

I will inform you on the outcome.

I hope that this, and the enclosed document, settles the questions you raised in your email.

I would be very grateful if you could inform your coauthors of FHCD of this exchange of emails.

Again, I deeply regret that we failed to cite the papers that you mentioned.

Sincerely yours
Professor Martin Dietzfelbinger

--
Univ.-Prof. Dr. Martin Dietzfelbinger
Ilmenau University of Technology
Faculty of Computer Science and Automation
Complexity Theory and Efficient Algorithms Group
http://www.tu-ilmenau.de/en/ktea/people/univ-prof-dr-martin-dietzfelbinger/

Am 11.02.2013 17:01, schrieb Amjad Daoud:
Dear Prof. Martin,

Page 82 of my thesis in 1993 (alg developed late 1987; using a NeXT 68030 machine with 32MB of RAM to compute MPHF (4 bits/key) for 3.8 million keys reading the key set from 15 tapes; 12 minutes to read and map; 12 minutes to order, 12 minutes to search; total =36
minutes :

Figure 4.3 gives the algorithm for the Searching step. At the beginning of the Searching step , a set of 20 small primes is chosen (or fewer if n is quite small) that do not divide n. Each time (3) is executed, one of the primes q is chosen at random to be s1 and is used as an increment to obtain the remaining s_j. Thus, the random probe sequence is
0 q 2q 3q : : : (n-1)q

In (2), we start fi tting patterns according to their cardinality in descending order. Entries are popped from the stack of the current pattern size being processed.
In (4), locations in the hash table T are calculated and checked if assigned previously (5).
If the probes are successful (6), we assign the keys to the hash table locations.
If any of these probes were assigned, we pick another sj value and repeat the process.
If we fail to fi t a pattern within n trials, the Searching step flags an error.
In case of an error in the Searching step, the hash table is reinitialized and the Searching step is repeated with a diff erent random probe sequence.
In (8), we assign patterns of size 1. "

I hope that you can see the --infinite sequence of independent hash functions -- in the class of functions searched
h(k ) = h0 (k) g (i) + h2(k)*g(i) where i = h1(k) and that the g table is used to derive an infinite sequence of independent
hash functions.

Attached is the detailed and full algorithm; page 82.

Sincerely,
Amjad Daoud

On Mon, Feb 11, 2013 at 4:35 PM, Martin Dietzfelbinger wrote:

Dear Dr Daoud,

thank you for your E-mail and your interest in our work.

I looked at your paper in CACM, 35(1):105-121 again, especially Algorithm 2.
I do not see why it should be identical to our Algorithm 1 (which is not outlined but given in full).
If you do not agree, could you please explain in more detail why you believe our algorithm is "a copy of an algorithm" from your thesis?
(In your thesis, I do not see an infinite sequence of independent hash functions, but only functions h_1, h_2 and g.)

Kind regards
Martin Dietzfelbinger

Am 07.02.2013 17:54, schrieb Amjad Daoud:

Dear Prof. Martin,

I contacting regarding your paper "Hash, displace, and compress" coautherd with "Djamal Belazzougui, Fabiano C. Botelho". Algorithm 1 you
outlined is a copy of an algorithm i developed in my thesis ... described as algorithm II in the following paper:

"Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M.Daoud, Practical minimal perfect hash functions for large databases, CACM,
35(1):105-121"

The paper cites "Hash and displace: Efficient evaluation of minimal perfect hash functions." In Workshop on Algorithms and Data Structures
(WADS’99), pages 49–54, 1999, that cites our CACM paper and presents the an incorrect version however where buckets
are not sorted according to their cardinality in descending order.

I hope that you give us credit for the at least 7 years of work improving Sager's algorithm from O(n^5) to a linear algorithm.

Sincerely,

Amjad