By Ivan Skytte Jørgensen
Copy of blog post I made for Privacore/Findx on 2018-03-05
The apostrophe is normally used for English possessive-s, but there anomalies.
In English the apostrophe is used for the possessive s, such as
What does that have to do with search engines? Two things.
Our search engine (and most other search engines too) index words, not strings. So when we index “John’s cat” we split the text into tokens:
and index that. In addition we also index word pairs/bigrams:
For texts such as “my red car”, “the little mermaid” or “quick guide to handling cats” it works really well for improving search precision and result quality. It doesn’t work so well for the possessive-apostrophe-s case. So what we do is that we ignore the apostrophe and also index:
And search precision goes up because the tokens “john” and “s” are strongly connected and we consider them a single word.
This is not limited to English. The …s suffix is used in other Common-Germanic languages (Dutch/German/Norwegian/…) but the orthography in those languages do not use the apostrophe (eg. they just use plain “Svends kat”). However, the possessive apostrophe is seen in some company names, signs, and informal text. Whether that is unofficial orthography, anglophone influence or bad sign makers – we don’t judge. We just deal with it.
The blotch between the main noun and the possessive-s is normally an apostrophe. But our crawler also encounters other signs used in place of the apostrophe:
Sometimes it looks like the result of people unfamiliar with the apostrophe or a too-helpful word processing program. Sometimes it is a mystery how a particular not-apostrophe codepoint got involved. But the meaning is still clear. I looked through the relevant blocks in unicode and selected the codepoints that from a distance visually look like an apostrophe.
In English and in few other languages the apostrophe is also used for showing contractions, eg.
where the “‘s” is a contraction of “has” or “is”. Distinguishing between possessive and contraction would require NLP and we don’t do that. When we get something similar to STO for English we could make more precise bigrams.