By Ivan Skytte Jørgensen
Copy of blog post I made for Privacore/Findx on 2017-12-04
We have finally gotten around to implement search widening, starting with Danish.
We are currently working on Danish-specific search widening, meaning that if you search for a word we may include other words such as:
We found a Danish lexicon (in linguistic terms) covering approximately 50.000 Danish words and each entry contains detailed information about the word class, part-of-speech, alternative spellings, inflections, etc. this gives us a good starting point for generating search widenings. later on we may use it for analysing what the user’s query is about.
The lexicon we found was STO (sprogteknologisk orddatabase) from Center for Sprogteknologi (CST). It is not meant for human consumption but instead for machine use.
The intent is to allow the user to specific whether search widening is allowed, and if so, by how much. And for the really nerdy users allow them to specify each type of widening. The search widening is not active in production yet.
Additionally, if you enclose with word in quotation marks "" then Findx will not try to widen that word.
We are toying with the idea that when you search for something and the result is no matches or suspiciously few matches then automatically show the user the option to widen the search parameters a bit.
The STO gives us alternate spellings for some words, eg "cirklen" and "cirkelen". So when you search for "cirklen" Findx knows that "cirkelen" is an alternate spelling and search for that too.
We also hardcoded the rule that some proper nouns (cities, towns, places, persons, …) may use the old written double-a form instead of "å". An example is the southern Danish city Aabenraa. So if you search for "Åbenrå" Findx will automatically look for "Aabenraa" too (and vice versa).
We also hardcoded rules for stripping and adding acute accent / accent-aigu. Eg the French first name René is often written with an accent aigu on the last e but not always. So if you search for "rene" or "bangs alle" Findx knows to also search for "rené" and "bangs allé" respectively. And vice versa.
We also support changing nouns, eg:
Initial tests revealed that changing singular to plural is mostly ok, but plural to singular doesn’t work so well. So the first widening will have a heigher weight than the second widening.
Currently we only change past tense to past tense. In Danish the imperfect past ("he ate") and the perfect past ("he has eaten") are mostly interchangeable, especially in colloquial form.