By Ivan Skytte Jørgensen
Copy of blog post I made for Privacore/Findx on 2017-07-29
Just a short note: We currently don’t distinguish between Bokmål and Nynorsk.
We inherited a code base (https://github.com/privacore/open-source-search-engine) which had set aside 6 bits (64 values) for language ids. We recently discovered that it only had one code for Norwegian. But Norwegian comes in two versions: Bokmål and Nynorsk, which a similar but do have differences. Webmaster aren’t that often careful to insert the correct lang= tag in the web pages so we use a library for language detection (CLD2). Regarding its Norwegian detection it says "… are detected separately (but not robustly)". So we currently treat those two variants as the same.
I asked my Norwegian friend if this would be a major problem. His answer was that it was acceptable to lump the search results together, but it would be nice to be able to filter/distinguish them.
So we have put this problem on the todo-list but we will first solve it when we reorganize the our database. And it is not going to be perfect.