Do What I Mean! Second-Try Suggestions for Wrong-Keyboard and Transliterated Search

What The Keyboard?

People who regularly use two different writing systems—¹like the Cyrillic alphabet for Russian and the Latin alphabet for English—will sometimes accidentally type with the wrong one, especially on a laptop or desktop. They might type Dbrbgtlbz when they meant Википедия (“Wikipedia”), or type Цшлшзувшф (approximately “Tsshlshzuvshf”) when they meant to type Wikipedia. Obviously, this is a recoverable error like any other typo: delete the mistake, switch to the right keyboard, and try again.

Notes From the DWIM Dim Past

However, this kind of thing happens often enough on Russian and Hebrew Wikipedia that, many years ago, the users there created gadgets to ameliorate the problem; the Hebrew original was given the hackerishly humorous name DWIM, an initialism for “do what I mean”. Every time the UI would show as-you-type suggestions, the gadget would check if there were fewer than 10 suggestions (the maximum that can be shown), and if so it would map back and forth between Cyrillic and Latin (or Hebrew and Latin, depending on the wiki), get additional suggestions for the transformed query, if any, and append them to the original suggestion list.

The DWIM gadgets worked well for a long time, but eventually changes to various on-wiki UI libraries had the side effect of eliminating the Javascript hooks they needed to function, and DWIM was no more.

I’d been tracking “wrong-keyboard” queries for a long time—they can be up to 1%² of fulltext queries on Russian Wikipedia—and the Search Team had been thinking about adding “wrong-keyboard Russian” to our cross-wiki language-detection tools.³ Unfortunately, integrating wrong-keyboard detection into cross-wiki language detection requires additional data workflows that we never managed to work on.

After DWIM had been offline for a while and didn’t seem likely to come back, we decided to recreate it on the search backend for as-you-type suggestions. We also realized the same infrastructure could potentially be used for transliterated searches, too!

Transliteration / transliteratsia / Lipyantaran

For various reasons—that surely vary by language, country, device, or even by individual—a lot of people don’t (or maybe even can’t) type in their native script when searching Wikipedia in their language. There has been a long-standing task to allow searchers on Georgian Wikipedia (a.k.a. ვიკიპედია) to search by typing in Latin or Cyrillic—not in English or Russian, to be clear, but in Georgian, transliterated into the Latin or Cyrillic alphabets. I also filed a task a while back to support searching in Romanized⁴ Hindi after discovering that part of the reason that so many searches on Hindi Wikipedia (a.k.a. विकिपीडिया) get no results is that many seem to be in Hindi, but written in the Latin alphabet.

The DWIM second-try model of transforming a query and adding additional results seemed like it could be a good fit for transliteration, too. There were lots of fun languagey complications—as there always are—but overall it did indeed work a lot of the time.

So, you can now type transliteratsia in the Georgian Wikipedia search box and get a suggestion for ტრანსლიტერაცია (“transliteration”), or Lipyantaran in the Hindi Wikipedia search box and get a suggestion for लिप्यंतरण (also “transliteration”).

By the Numbers

We originally enabled DWIM-like wrong-keyboard second-try remapping⁵ for Hebrew- and Russian-language wikis, then added Latin and Cyrillic transliteration remapping for Georgian, and finally Latin transliteration remapping for Hindi.

The usage patterns for as-you-type suggestions vary by language and wiki—some ignore the suggestions and go to fulltext search results or the default page more than others and some generally like clicking on suggestions more than others.

To keep things simple we’re going to just look at two stats for second-try suggestions:

As a percentage of all “clicks” in the search box at the top of the wiki page. These include hitting return after typing or clicking the “Search” button, clicking “Search for pages containing [your query]” after the suggestions, or clicking on a suggestion.
As a percentage of suggestion clicks, which is limited to just the presented as-you-type suggestions, whether the normal suggestions or the second-try suggestions.

Keep in mind that not all queries get second-try suggestions. There could be too many regular suggestions to have room for second-try suggestions, or there might be no second-try suggestions for a given query. Also, very small projects are excluded from the list because of data sparsity over the sample period.

Without further ado:

Wiki	Second-Try Type	% suggestion clicks	% all clicks
hewiki	DWIM	3.4%	1.8%
hewiktionary	DWIM	1.4%	0.6%
hewikisource	DWIM	2.2%	0.5%

ruwiki	DWIM	3.8%	2.6%
ruwiktionary	DWIM	1.3%	0.7%
ruwikivoyage	DWIM	4.8%	1.8%

kawiki	Translit	4.1%	1.6%

hiwiki	Translit	60.1%	16.3%
hiwiktionary	Translit	10.3%	5.0%

For Hebrew (he) and Russian (ru), the Wikipedia and Wiktionary results are surprisingly similar. Wiktionary sees less usage, probably because there are more titles in Latin script (because there are lots of English, French, Spanish, etc., words in Russian Wiktionary) that can match a Latin query.

Georgian (ka) Wikipedia is broadly similar to Hebrew and Russian Wikipedias, too. Over 1.5% of all clicks in the top-of-page search box and more than 3% of clicks on suggestions is an awesome result.

Then there’s Hindi (hi), which was implemented and evaluated last, but which has the biggest effect! 16% of all clicks in the top-of-page search box and more than half of clicks on any suggestions are from the transliterated suggestions. I feel like this is a huge improvement for searchers while still being indicative of an underlying problem with Hindi/Devanagari input.

Fun Languagey Complications

Hebrew and Russian

The Hebrew DWIM gadget—which I think is the original, at least on-wiki—had a nice character-by-character map that swapped ש and a, נ and b, ב and c, etc. Hebrew doesn’t have an upper/lowercase distinction, so when DWIM was adapted to Russian, they normalized the case before doing their mapping.

Well, that works most of the time, but because Russian has so many more letters in its alphabet, some of them map to punctuation on the US QWERTY keyboard. And while as-you-type search suggestions will mostly⁶ ignore case, as with щ/Щ or o/O, it does not “uppercase” ; (semicolon) to : (colon) so ж and Ж (the same keys on the Russian keyboard) behaved differently from each other. When I noticed this, the Russian Wikipedia gadget did get updated to handle upper- and lowercase better.

When I started working on the search-internal DWIM mappings, I realized that this particular complication was even more complicated. Russian keyboard ю maps to QWERTY keyboard . (period), but Russian . maps to QWERTY / (slash) and Russian / maps to QWERTY | (pipe). There are a few other such chains. Fortunately, the way the mapping was set up ignored this ambiguity in favor of mapping toward Cyrillic, which was the more common case. A fun side effect is that the all-punctuation string :','[' on the US QWERTY keyboard maps to Жэбэхэ on the Russian keyboard.

I discovered that Hebrew had similar problems—like Hebrew ת mapping to QWERTY , (comma), Hebrew , mapping to QWERTY ' (apostrophe), and Hebrew ' mapping to QWERTY w.

My solution to this problem has been to break the query into words and if there is any Hebrew or Russian present, map from that script to Latin, but map from Latin to the “host” script of the wiki by default.

Phew, not too hard! And the ticket for Georgian transliteration promised that it would be “fairly easy”.⁷

Georgian

There are several ways to Romanize Georgian. Some are official. Some are academic. Some are even unambiguous and lossless. But none of those are popular. Instead, we have the dreaded “unofficial system” that’s ambiguous and inconsistent, but easy to type and probably not particularly difficult if you actually speak Georgian.

A big problem with transliteration can occur when the language/script you are coming from has more sounds/letters than the language/script you are going to.⁸ For example, Georgian has the letter თ, which sounds like /tʰ/, and ტ, which sounds like /tʼ/. English doesn’t make the distinction between those two sounds!

Fortunately, unofficial transliterations of Georgian do sometimes use uppercase and lowercase letters differently—like using t for ტ and T for თ—or use digraphs and trigraphs (ts, dz, sh, ch, tch). Of course, sometimes the uppercase Latin characters in a query on Georgian Wikipedia are there because the query is properly capitalized English, adding to the confusion.

So, if we are careful about case,⁹ we can map the trigraphs and digraphs, then map the distinct upper- and lowercase letters, then lowercase everything and map the leftovers, then hope we get a match. Hey, that actually works pretty well!

Whew! Now what about Georgian transliterated into Cyrillic? [….twenty minutes later….] Oh how I long for the days of the unofficial system!

I could only find one list of systematic mapping from Georgian to Russian Cyrillic, on Russian Wikipedia, from an academic paper from 1972! It’s clearly designed to turn Georgian names into something a Russian speaker can pronounce reasonably well, not to losslessly preserve the original Georgian spelling or pronunciation.

I checked to see if there were any Cyrillic queries on Georgian Wikipedia. There aren’t a lot, but there are some. Some are clearly Russian and some are obviously names, but it was clear that making a decent effort and handling Cyrillicized¹⁰ input would probably help someone.

I ended up ignoring case, handling a few digraphs, and leaning heavily on context clues to choose between ambiguous options.

And by context clues, I mean something like: if you were untransliterating a phrase back into English and you had a letter that could be either an f or a v, you’d be smart to choose f if the next letter was l, because fl is fairly common and vl is very rare. You would get “Flad the Impaler” and “Flassic Pickles” wrong, but you’d get most things right. At the beginning or end of the word, between certain letters, before or after other letters—all can help you pick. Fortunately, only т vs თ & ტ and п vs პ & ფ needed that level of careful disambiguation. The results are nowhere near 100% accurate, but they are so much better than nothing.

Hindi

The challenges that Hindi offers are similar to those of Georgian, but dialed up to eleven. Like Georgian, Hindi has a lot more sounds than the Latin alphabet has letters. There are also a number of potentially lossless academic and official transliteration systems for Devanagari, though they have even more orthographic variety than Georgian transliteration systems. Here’s a semi-random sample of transliterations for single characters, most of which no one is ever going to just type into a search bar: ī R^i .ll r̥̄ lṛī ṉ ṁ ḥ m̐.

The “unofficial system”—shudder!—use for Romanization is even more difficult for Hindi, possibly in part because so many Hindi speakers also speak and write English, and sometimes freely mix the two. Rather than stick to a single Romanization scheme, some Hindi speakers seem to use their greater familiarity with English to transliterate in ways that follow awful English spelling conventions. And if an English word has been borrowed into Hindi it might be Romanized phonetically or just written in English. (As a native English speaker, I was usually able to get a good sense of the Hindi pronunciation from the transliterations—so the system works fine for humans. It’s just hard on computers.)

There are machine learning models designed to handle transliterated Hindi, and some LLMs can probably do a good job, but those are too computationally expensive for as-you-type suggestions, where we need to generate a suggestion on every keystroke. So, I set out to make something much more lightweight.

Despite my goal of a lightweight system, the idea of context-dependent disambiguation of a couple of characters—as used for Cyrillicized Georgian above—exploded into a full-on set of rewrite rules for Romanized Hindi.

Fortunately, I was able to find an open-source, human-generated Romanized Hindi transliteration data set, which made it much easier to evaluate my rewrite rules as I was working on them. Looking at the data also verified the complexity and ambiguity. The mappings are many-to-many, meaning that not only are there often multiple ways to Romanize a given Hindi word, some Romanizations can map to multiple Hindi words. So without an understanding of context, it might be impossible to accurately transliterate Romanized Hindi back to Devanagari.

There are two saving graces with Hindi. First, statistics. While there are sometimes multiple Hindi words that a given transliteration could map to, one is often statistically much more likely than the other. Second, the multiple words are at least somewhat similar. It’s not like the equivalent of Fred and Maynard having the same transliteration; more like Fred and fried. This is where the ability of as-you-type suggestions to handle typos comes in. fried smit is close enough to Fred Smith to match as a suggestion.

I also ended up taking a moderate number of the most frequent transliterated words in my sample of Hindi queries, about 1400 of them, and just hard-coding their most likely Hindi equivalents. This also insulated common words from tweaks to the rewrite rules that might improve a lot of individual words, but get worse on a few really common words. It might also be faster to just look up really common words instead of running them through all the rewrite rules.

As I said before, the results are nowhere near 100% accurate, but they are so much better than nothing.

A Latin transliteration (middle) of the name of the author’s adopted home state that works for both Hindi (above) and Georgian (below), with corresponding characters colored the same, using colors from the relevant country and state flags—including the ridiculously awesome and awesomely ridiculous flag of Maryland. Note that the full name of Maryland in Georgian transliterates as merilendi.

Future Fulltext Directions

So far, we have replicated (and hopefully somewhat improved) the original DWIM wrong-keyboard functionality for Hebrew and Russian, and expanded the scope to more complex mappings for transliteration in Georgian and Hindi.

However, like the original DWIM gadget, we’re currently still limited to as-you-type suggestions. Submitting a wrong-keyboard query to the fulltext search will typically get few if any results because the query often looks like gibberish. Submitting a transliterated query to the fulltext search is more likely to match something, since the mapping is phonetic, but it’s not a good way to search.

One direction for expanding second-try searching is to consider showing “did you mean” suggestions for fulltext queries. It’s not just a matter of turning it on, though—as always, there are complications.

Because as-you-type suggestions are limited to matching titles, those suggestions can handle one or two typos per query, which can compensate for minor errors in the DWIM or transliteration mapping. Plus, the searcher never sees the exact remapped query used to generate the suggestions; as long as the suggestions are good, it doesn’t matter. And if the remapped query is a complete disaster,¹¹ it generates no suggestions, and the searcher doesn’t see anything amiss.

For “did you mean” suggestions, showing suggestions with ridiculous typos would only further damage their spotty reputation. We also can’t currently filter “did you mean” suggestions that get zero results. I really don’t want to show a typo-riddled suggestion that gets no results! Part of the problem, though, is that we have to search for the remapped suggestion (or at least get the number of results from the search index) to see if it is plausible, and that can be expensive. There is also the issue of prioritizing the sources of possible “did you mean” suggestions—we currently get suggestions from an internal feature of our underlying search engine, from an internal dataset of queries that look like typo corrections, and now plausibly from second-try remapping. We would need to investigate to decide whether a hard-coded prioritization is sufficient, and which order is best—which may vary across languages!—or come up with a very efficient way to rank individual options. That’s not insurmountable, but it isn’t trivial either.

Whether second-try “did you mean” suggestions are useful will probably vary by language and maybe by wiki, since the accuracy and ambiguity of the mappings varies so much between transliteration and wrong-keyboard mappings, and possibly even more so between languages.

More Languages—Help Us Help You!

Another direction for expanding the usefulness of second-try searching is to apply it to other scripts (for transliteration mappings) or keyboards (for wrong-keyboard mappings), which could allow for immediate improvements to as-you-type suggestions on-wiki, and possible future expansion into “did you mean” suggestions.

If you know of a language community that could use this kind of support, let me know in any of the usual ways, or open a Phab ticket and tag Discovery-Search.

And if you love nerding out about this kind of thing, you can check out the Second-Try Searching section of my Notes page, which has recent links to a lot more details on second-try searching, DWIM for Russian and Hebrew, and transliteration for Georgian and Hindi.

__________

¹ Despite it often being treated as a dead giveaway that AI wrote something, human typography nerds—like me!—can overuse em dashes, too! (Here’s a wiki mailing list message from me from 2015 with about 20 artisanal em dashes in it—I typed every one by hand—to prove I’ve pretty much always been this way.) I’m not gonna let AI take credit for my writing—or steal the joy of em dashes from me!

² Discovering that around 1% of Russian Wikipedia fulltext queries appear to be gibberish but actually aren’t was quite surprising! In search, a 1% improvement is often a pretty big deal. For reasonably well-supported languages, it can be hard to make really big improvements to more than 1% of queries if you have a decent stemmer (for inflected languages) and tokenizer (for spaceless languages like Chinese and Thai). Woe unto the heavily inflected spaceless languages.

³ For example, if you search on English Wikipedia for labdarúgó-világbajnokság, mistrovství světa ve fotbale, or 國際足協世界盃 (“football world cup”) they will be recognized as Hungarian, Czech, or Chinese and cross-language results from Hungarian, Czech, or Chinese Wikipedia will be shown. Interestingly, I couldn’t use Wereldkampioenschap voetbal, ワールドカップ, Copa do Mundo, Världsmästerskapet i fotboll, 월드컵, Dünya Kupası, or Giải vô địch bóng đá thế giới as examples because there is so much on English Wikipedia either about the World Cup or in those languages that those queries got too many results to be eligible for cross-wiki results.

⁴ It’s kind of goofy that English often uses the word “Romanization” for transliterating into the Latin Alphabet, but it’s far from the goofiest thing about English. (“However, unlike with most languages, there are multiple ways to spell every [English] phoneme, and most letters also represent multiple pronunciations depending on their position in a word and the context.” And how!)

⁵ People often talk about how German compounds can be so ridiculously long, but really they aren’t that different from long English noun phrases. German just writes them without spaces. I’m not sure which is better. It’s probably easier to parse the elements in English because the words are broken up, but scoping might be easier in German because pieces that strongly go together are attached to each other. It’s a Kopfzerbrechen to be sure.

⁶ There is some exact match logic in there that will re-rank precise matches higher, so that typing Cinemas in the search box on English Wikipedia gives the top suggestion Movie theater (with redirect from Cinemas), but searching for CinemaS (or wild case CiNeMaS or anything that is not an exact match) will give the top suggestion CinemaScope.

⁷ It was not easy. It was not peasy. It was not lemon squeezy. Yet, I didn’t know how easy I had it!

⁸ To be fair, the Latin alphabet doesn’t really have enough letters for all the sounds of English, which is a big contributing factor as to why English spelling is so goofy.

⁹ That is, if we assume sh is being used as a digraph, we would expect it to be uppercased as Sh or SH, but not sH. Things like sH happen more often than I expected—not a lot, but more than I expected—because words sometimes get run together. We can’t catch every corner case, but it’s fun to try.

¹⁰ Cyrillicized is a fun word. Cyrillicization is even better.

¹¹ In fact, this effective filtering of very poor queries is part of how second-try as-you-type suggestions are able to function. For reasons of speed, we have to issue the original query and the remapped query at the same time. So if you search for Википедия (“Wikipedia”) on Russian Wikipedia, there’s a simultaneous second-try search for Dbrbgtlbz, which doesn’t match any titles, so no additional suggestions are generated. By the way, if you do a fulltext search for Dbrbgtlbz on Russian Wikipedia, there currently are two results, both about other software that helps correct wrong-keyboard mistakes and use Википедия/Dbrbgtlbz as an example.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

Diff

Do What I Mean! Second-Try Suggestions for Wrong-Keyboard and Transliterated Search

What The Keyboard?

Notes From the DWIM Dim Past

Transliteration / transliteratsia / Lipyantaran

By the Numbers