By Ivan Skytte Jørgensen
Copy of blog post I made for Privacore/Findx on 2018-02-12
When encountering words with diacritics which challenges are there?
As a starting point we index documents as they are written and we search for words as you write them. This gives you exact matches as you would expect. But in some cases that would leave out relevant results. This post series will explore some cases where what you search for may not be how it is written. A follow-up post will describe what we do about it.
Diacritics are those little marks that can be added to a letter modifying it. Most people have heard about acute accent (eg. "é") and grave accent ("è"). If you have studied French then you also know about circumflex (ô), diaeresis (ë) and cedilla (ça). If you have studied Spanish then you know about the tilde (ñ). But wait! There are more! Many, many more. Such as macron (Š), breve (ā), hook (ā), and several ancient ones. The diacritics change the way a letter is pronounced.
The French car maker writes their name natively with a diaeresis over the e letter because that is how it is pronounced. Leaving out the diaeresis would make the word a different one. When French users search for Citroën they mostly write it with a diaeresis because they know that that is how it is spelled. When non-french users search for Citroën they mostly don’t use the diaeresis. If the official website for Citroën uses diaeresis consistently then there would be no search match. For commonly-known brands and names users still find the website because there may be widespread links to the website with the link text without diaeresis. There can be numerous reasons why the non-french users don’t use the diaeresis, for example:
The story about Škoda is similar to Citroën. Their websites consistently use the caron over the s. However, the problem is a little worse because some users are aware that something is going on with the s letter, but they don’t know exactly what it is and even if they did they would have no way to produce it with their keyboard (the caron is only available on some Slavic and Baltic keyboard layouts (and a few more)). So what do users do? They type a plain s.
Only Icelanders or the very observant reader would notice that the third i letter actually has an acute accent instead of a dot. Granted, it is an Old Norse word but you cannot be sure that it has been written like that in all texts, or that the user is using the same orthography as the documents he is looking for.
When searching for a good recipe I discovered that something odd was going on with the grave accent over the u letter. Sometimes it was there. Sometimes it was missing. And sometimes it had degraded into an apostrophe after the u. Digging deeper I found out what might be the cause. None of the Italian keyboard layouts (yes, all three of them) gives access to uppercase vowels with diacritics. So if you are Italian and want to write "É una bouna idea" you can’t. You have to omit the acute accent or write it as an apostrophe (or similar) after the capital E. That problem appears to sometimes carry over to lowercase for some of the rarer accents, such as u-with-grave-accent, which I saw written as u-with-apostrophe-after ("ragu’ alle bolognese").
Italian is not the only language where the official orthography has a mismatch with the common keyboard layout.
A coworker searched for that but didn’t find any good recipes because he typed "buche de noel". The circumflex and diaeresis are not used in Danish so that is understandable. It seems that most recipes for bûche de Noël have written it with the diacritics, presumably because it is a very French thing and people writing about it have a French cookbook and know how to write the diacritics.
In English the diaeresis is optional but can be used for clarifying the pronunciation. This is typically seen in names that have two vowels that would normally be pronounces as a diphthong. The magazine The New Worker uses it in may words, eg. coöperation.
How you spell/type a word may not be how other people do it. For diacritics this is particularly widespread when crossing borders or locales. And diacritics cannot just be ignored because they often change the meaning of words.