By Ivan Skytte Jørgensen

Copy of blog post I made for Privacore/Findx on 2017-07-29

Norwegian — Bokmål or Nynorsk

Just a short note: We currently don’t distinguish between Bokmål and Nynorsk.

We inherited a code base (https://github.com/privacore/open-source-search-engine) which had set aside 6 bits (64 values) for language ids. We recently discovered that it only had one code for Norwegian. But Norwegian comes in two versions: Bokmål and Nynorsk, which a similar but do have differences. Webmaster aren’t that often careful to insert the correct lang= tag in the web pages so we use a library for language detection (CLD2). Regarding its Norwegian detection it says "… are detected separately (but not robustly)". So we currently treat those two variants as the same.

I asked my Norwegian friend if this would be a major problem. His answer was that it was acceptable to lump the search results together, but it would be nice to be able to filter/distinguish them.

So we have put this problem on the todo-list but we will first solve it when we reorganize the our database. And it is not going to be perfect.