By Ivan Skytte Jørgensen
Copy of blog post I made for Privacore/Findx on 2018-04-17
We have been extending some of the indexing and tokenization (what we treat as words) over the past few months. This post sums up the changes so far.
We decompose stylistic ligatures (ĳ/ﬄ/…) into the component letters.
We decompose language-specific ligatures in English (œ and æ) and French (œ).
This means that if a web page has written “encyclopædia” then the “æ” is decomposed and you can finding it when searching for “encyclopedia” and “encyclopaedia”.
More details at indexing ligatures and digraphs.
For words with diacritis we index the word both as-is, and with the non-native/optional diacritis depending on language.
This means that in “Noël” would be indexed as both “Noël” and “Noel” if the text is German, but only as “Noël” if the text is French.
This requires knowledge of both the language, the official orthography, the actual orthography, the common keyboard layouts used, and the regular habits of people using that language. Because of this we have only implemented if for Danish, Norwegian, Swedish and German so far.
We don’t do that for English yet, partially due to Québec which muddies the waters wrt. which diacritics neighbouring english writers care about.
For all other languages we keep our hands off. But for Danish/Norwegian/Swedish/German you should be able to find useful page for “buche de noel” even though the pages may have written it using French orthography “Bûche de Noël”.
More details at indexing diacritics (accents) and language differences.
We detect <some word> <something that could look like an apostrophe> <the letter s> and index that as a single token. For languages that aren’t English or Dutch we also index <word>+s because German/Norwegian/… don’t use apostrophe for possessive-s.
More details at indexing possessive apostrophe s.
We index both H₂O and m². The support is not complete for html tags sub/sup with content longer than 1 character. But it’s a start. We don’t treat “m^2” as m-superscript-2. Maybe we should.
More details at indexing superscripts and subscripts.
A web site may write their phone number as “+45 75 66 10 02” but when a user searches for it (reverse-lookup) he would typically enter it as “75661002”. Since we don’t support searches for “something possibly with white-space and punctuation sprinkled inbetween” it’s a problem. Another problem is that most countries have an entity controlling the numbering plan and number allocation and they typically also recommend a format – which almost without fail turns out to be the least used format in the real world. An example is the German telephone numbers are 10-11 digits long and it is recommend to write them in one of these 4 formats
So what were the first two formats I saw when I checked photos from German streets? 040/999.999-999 and 49(0)40-999999-999.
We recognize and index telephone numbers from Denmark, Norway, Sweden and Germany. The other countries will be added as I have time to implement some of the tricky recognition. Don’t get me started on Italian telephone numbers…
There must be some nerds / telephone number spotters / tele-numerologists that maintain the wikipedia articles on telephone number systems – all the European numbering systems are documented in good detail.
We detect several hyphen-looking unicode codepoints (hyphen-minus, soft hyphen, real hyphen, ….) and index the words slightly differently. Eg. the text Newcastle-Upon-Tyne is indexed as a single entity too.
We treat some abbreviations that use slash a different way wrt. to bigrams. Example abbreviations: km/h, m/s², A/S
We also rewrite ampersand to the language’s equivalent of “and” in Danish/English/German
We recognize some special words as “C++” or “F#”
Some additional work is pending (beyond extending the above cases to all the supported languages and countries):
We need to activate this feature. But when that is done and indexes have been re-generated for affected pages it should be easier for you to find Bûche de Noël, German phone numbers, greengrocer’s apostrophe, kerfuﬄes, and much more.