By Ivan Skytte Jørgensen
Copy of blog post I made for Privacore/Findx on 2018-07-05
We support multiple languages, but how (and which are they)? Are Czech results useful to Slovaks? Can Italians read Spanish?
The Internet is large. We did consider crawling the entire Internet but quickly decided against it due to space limitation and because it would require at least one person per language to verify result quality, spam, inaccurate classification, etc. The search engine we run on is also currently limited to 64 languages, which is something we will fix sometime in the future.
So we limit our web crawler to the languages spoken in Europe. But even then we had to leave some out, eg. Frisian, Sami, or the distinction between Nynorsk/Bokmål, Romanian/Moldavian, and Italian/Sardinian/Sicilian.
Some document specify which language they are in. Some don’t. Some lie. For instance, it is not uncommon to see a HTML document with the <html lang="en"> header with the actual content in something different eg. Norwegian, so what the documents claim cannot be trusted entirely. The same applies to any Content-Language HTTP header. We do use them as a hint though, together with the inherent hint from ccTLDs (eg. .de). But we mainly use some 3rd-party libraries to detect the document language.
Detecting which language the user’s query is can sometimes be easy as in "rindfleisch" which is only used in one language, sometimes it is difficult as in "computer" which is used in many languages. Web browsers can tell us which languages the user allegedly understands (in the Accept-Language HTTP request header) but that cannot be trusted entirely because some browsers lie by including extra languages the user has not chosen.
So instead of trying hard to detect exactly which language a query is in (and failing for multi-lingual words) we try instead to detect which language results would be useful to the user.
We take into account:
For instance, if we have:
then we adjust the language weights accordingly. The TLD indicates French or Dutch so the weights for those languages are adjusted up. The Geo-IP indicates French so that is adjusted further up. The Geo-IP also indicates (weakly) that the user has had teaching in English as foreign language so the English weight is also adjust up. The Accept-Language indicates French, English, and Italian so those weights are adjusted up but not equally – the languages are usually prioritized in the header. Additionally we consider language intelligibility / sprachbunds / dialect continuums in the Accept-Language header. French and English has no obvious friends, but Italian is approximately 30% intelligible with Spanish so the weight for Spanish is adjusted a bit up. The query indicates German so the weight for that is adjusted up. The query letters include "k" and "w" so the weights for Italian and Finnish are adjusted down. The resulting language weights / usefulness probability could be:
These weights are then factored in when we rank the results. So the presumably French-speaking Belgian searching for "kraftwerk" would primarily get results about the German band in French, but also in English and Dutch if the results were good enough.
You can override this mechanism in the Findx frontend by going to findx→settings→search result language. We plan on giving full access to tweaking the language weights (we’re working on it), but for now we do all the above behind the scenes unless you explicitly override it.
We are not the only search engine that doesn’t consider multilingualism as a freak of nature – Duckduckgo has multiple interface languages, Bing allows you to select multiple results languages, Qwant supports Corsican and Breton in their interface.
But as far as we know no other search engine considers language intelligibility or non-official country languages when ranking results. They usually either pick 1-2 languages or none at all. They don’t consider that if a user is doing an obviously Czech query then perhaps Slovak results might be useful too; or that Maltese and Swedes are more likely to understand English than Poles (source: CEMFI:do you speak English?); or that the first foreign language taught in school seems to stick better in Slovenia than in Ireland (source: Eurostat:foreign language skills, Eurostat:database).
Other sources we used when constructing the current weights and interaction:
If you are curious and know the programming language Python then you can play around with the prototype I used to testing out ideas: query_language.py (alternative link). The prototype is close to the final algorithm we use. The differences are:
./query_language.py dk da,en,de kanelsnegl ./query_language.py dk da,en,de ålegilde ./query_language.py de de,en currywurst ./query_language.py de de,en,da currywurst ./query_language.py de de,en Maßstab ./query_language.py us en What Is the Airspeed Velocity of an Unladen Swallow
When users don’t specify any language hints on findx.com/findx.de/… then we should most of the time do the right thing for them. For monolingual countries/users it shouldn’t matter. For multilingual countries you should sometimes see a difference from other approaches that only consider a single language useful.
Lots of tweaking, and possibly a more fine-grained estimation on the query words, because some queries are using words exclusive to a language according to dictionaries and CLD2/CLD3, but the phrase is international. Eg "kerry blue terrier" is clearly English, except it isn’t because it is the name of a dog breed. We intend to do something better automatically but for now you have to select the result language yourself.
There are also some language intelligibility that is missing/unused i.e. Finnish/Estonian and Latvian/Lithuanian.