Considering Romanization Schemes for Danbooru for non-Japanese languages

Given Danbooru's traditional focus, a lot has been discussed on-site in regards to the romanization scheme used for Japanese, as the vast majority of content on Danbooru necessitates it. As of late though, that discussion seems to have largely come to its final conclusion, with the 2020 decision (forum #168056, topic #17011) to stick with the Hepburn system used by the Ministry of Foreign Affairs of Japan for their passports, and normalized sticking with official romanizations when sensible.

However, the past decade has seen an increasing amount of Chinese, and to a lesser extent, Korean properties appearing on-site, and having a brief style guide and tips on how to handle either (and otherwise just linking to the relevant Wikipedia pages) on Danbooru's howto:romanize page would be useful, especially as there are some unresolved questions when it comes to handling names of copyrights and characters, per topic #23510. At least having something written down that's better than just "use Google Translate, lol" would be better than nothing.

To some extent, this applies to languages that use Cyrillic like Russian as well, though trying to establish a style guide for those is difficult, mainly because of inconsistent standards across multiple languages, differing letter uses across languages, and baffling official romanizations (such as current Russian passport standards rendering the Cyrillic letter 'ё', pronounced 'yo', as Latin 'e' when romanizing, giving us Mikhail Gorbachev instead of Gorbachov, despite it still being pronounced like the latter - and at that point that starts resembling the diacritic/accent question for other languages and what standards we have for rendering those, whether we drop them or represent them, ala Pozyomka [Позёмка/Pozëmka] and Bluecher [Blücher] for representation, or Senkishi Brunnhilde [Senkishi Brünnhilde], Nurburgring [Nürburgring], FC Bayern Munchen [FC Bayern München] and Kraina Grzybow [Kraina Grzybów] for dropping).

If we can get some written-down guidelines going for non-Japanese languages, I think that'd save us a lot of headaches going forward.

Updated 17 days ago

evazion

17 days ago

We should just use Wikipedia's policy, which is more or less the same as our own: use the most commonly accepted or recognized spelling of the name in English, if one exists, otherwise use the most commonly accepted romanization system for the language in question (see [1], [2], [3]).

The main difference between Wikipedia and us is that Wikipedia allows diacritical marks in names, while we don't, and that they mainly deal with real names for real people, whereas we mainly deal with made-up names for fictional characters, which can have all kinds of weirdness, especially when they're made up by non-native speakers.

Originally the main goal of our romanization system was reversibility, that is, you should be able to unambiguously reverse the romanization to get back to the original language. I think this is basically hopeless and not the most important concern for most of our users. In real life, languages are messy and there are always exceptions to the rules or things that are spelled differently for historical reasons. Instead the goal should be for tag names to first be recognizable, and second be consistent according to some system.

For Chinese, the most common romanization system is pinyin with the diacritical marks dropped ([4]). So for example, Beijing is 北京 in Chinese, which is Běijīng in pinyin, which becomes Beijing. On the other hand, Chiang Kai-shek is Chiang Kai-shek because that's what he's conventionally called in English, even though in pinyin his name would be Jiǎng Jièshí.

For South Korea, the official standard is the Revised Romanization system, which appears to be straightforward, except apparently personal names often use ad hoc spellings instead of the official system ([5]). For example, 정 is variously Jung, Jeong, or Chung. I don't know what to do here.

For Cyrillic, the standard appears to be the BGN/PCGN system, but there are apparently many variants depending on the language in question ([6]), and Wikipedia uses their own modified version for Russian ([7]). I don't know enough to form an opinion here.

For German, umlauts are usually transliterated as 'e'. So for example, ä becomes ae, ö becomes oe, and ü becomes ue. But sometimes it's dropped or spelled differently, so for example, über becomes uber in English, while Nürnberg becomes Nuremberg. There doesn't appear to be any consistent standard here.

The common theme here seems to be "it's complicated" and "follow the rules, except when you don't". I don't know that we can come up with anything better than that.

Updated 17 days ago

Damian0358

16 days ago

evazion said:
We should just use Wikipedia's policy, which is more or less the same as our own: use the most commonly accepted or recognized spelling of the name in English, if one exists, otherwise use the most commonly accepted romanization system for the language in question.
The main difference between Wikipedia and us is that Wikipedia allows diacritical marks in names, while we don't, and that they mainly deal with real names for real people, whereas we mainly deal with made-up names for fictional characters, which can have all kinds of weirdness, especially when they're made up by non-native speakers.
Originally the main goal of our romanization system was reversibility, that is, you should be able to unambiguously reverse the romanization to get back to the original language. I think this is basically hopeless and not the most important concern for most of our users. In real life, languages are messy and there are always exceptions to the rules or things that are spelled differently for historical reasons. Instead the goal should be for tag names to first be recognizable, and second be consistent according to some system.

Given I had mentioned just linking the relevant Wikipedia pages on howto:romanize, I agree with everyone said in principle, and as you say, it basically aligns with the approach that has been established since 2020. I don't think that that the difference in real names and fictional names is as relevant to point out though, especially as real names can have their own weirdnesses even in their own languages: thinking here, for instance, about the traditional Chinese characters that still exist in Japan exclusively in names, the jinmeiyou kanji, which can then be combined with an unorthodox kanji reading (kirakira or not). The weird fucking names we see in anime didn't just pop out of thin air.

For Chinese, the most common romanization system is pinyin with the diacritical marks dropped. So for example, Beijing is 北京 in Chinese, which is Běijīng in pinyin, which becomes Beijing. On the other hand, Chiang Kai-shek is Chiang Kai-shek because that's what he's conventionally called in English, even though in pinyin his name would be Jiǎng Jièshí.

Yeah, with Chiang Kai-shek and similar, there you're not just dealing with what he's conventionally called in English, but what romanization standard was being used at the time when those names entered the public record. Chiang Kai-Shek isn't even fully Wade-Giles, which would be Chiang Chieh-shih, but partially Cantonese, where Kai-Shek stems from. Had the PRC not pushed for standardized pinyin from the 70s onward, English would still be calling Beijing Peking, which also isn't fully Wade-Giles, as that'd be Peiching.

So that's a perfect example of the "recognizability first, consistency second" policy we have so far, as I would imagine most cases just using pinyin would be fine for stuff that hasn't received anything officially in English yet; the only thing I could recommend to this is providing links to Wiktionary or resources such as the Chinese Text Project, to allow folks to double-check how they could render a character or copyright's name in Latin characters (without diacritics) if the normal pinyin by itself doesn't seem to fit with the way the name is pronounced, especially if it turns out the common pronunciation in Chinese is explicitly non-Mandarin, like Cantonese.

For South Korea, the official standard is the Revised Romanization system, which appears to be straightforward, except apparently personal names often use ad hoc spellings instead of the official system. For example, 정 is variously Jung, Jeong, or Chung. I don't know what to do here.

Given that here we're dealing mostly with different spellings as opposed to different pronunciations, as you'd run into with Japanese or Chinese, unless a character is explicitly given a different spelling, the most sensible practice would just be to go with the most common spelling per Wikipedia; in 정's case, that'd be Jung, as that can always be updated later if it turns out that the given character's name is actually romanized as Chyung. What do you mean that name has over 15 different ad hoc spellings in Latin?!

For Cyrillic, the standard appears to be the BGN/PCGN system, but there are apparently many variants depending on the language in question, and Wikipedia uses their own modified version for Russian. I don't know enough to form an opinion here.

Yeah, the issue with Cyrillic is that, despite being tailor-made for Slavic languages, you still ended up having divergences develop in letter usage. As someone who uses Cyrillic, I personally think that, for most Cyrillic letters, you can establish a fairly consistent romanized form, but then you run into the same issues I raised above and similar to the Chiang Kai-Shek case. Mikhail Gorbachev is set in stone as the way his name is rendered in English, and the spelling is supported by the Russian government based on its own standards, even if it isn't intuitive with how it is actually pronounced what so ever. Us straying from conventional romanization hasn't stopped us from going with Pozyomka in forum #229904, especially as, per usual practices with diacritics, one would assume the search results between Pozemka and Pozyomka would show the former in the lead by a mile (though actually searching it seems to show that community interest in properly presenting her name and the very fact they included the diacritic in the game in the first place has Pozyomka narrowly winning out in search results).

But to bring up a letter example, Е - in the Slavic languages, what Е represents varies subtly: for Russian and Belarusian, it represents "ye", while for Serbo-Croatian, Macedonian, Bulgarian, Rusyn and Ukrainian, it just means "e". Russian and Belarusian "e" is rendered as Э. Meanwhile, while Serbo-Croatian, Bulgarian and Macedonian dropped "ye" (instead rendering it with the individual words instead, йе/је), Ukrainian and Rusyn continue to have it, but because Е is already used for "e", they instead use Є for "ye"; so you cannot consistently render Е as just "e" as you'd potentially drop the "y" sound for Russian and Belarusian. Of course, most letters don't have this issue, and for some romanizations there's easy overlap with how we do it with Japanese (the usage of ' for Ь, an example), but it does make one want to steer away from concrete Cyrillic standardization - unless we just declare one the only valid one, at which point bias can seep in (like how Serbo-Croatian Cyrillic is the only Slavic Cyrillic script with a 1:1 standardized conversion to Latin characters). And the Е example doesn't even mention the non-Slavic Cyrillic languages, like Mongolian, Tajik, Kazakh, Kyrgyz, Uzbek, etc.

For German, umlauts are usually transliterated as 'e'. So for example, ä becomes ae, ö becomes oe, and ü becomes ue. But sometimes it's dropped or spelled differently, so for example, über becomes uber in English, while Nürnberg becomes Nuremberg. There doesn't appear to be any consistent standard here.

If nothing else, given that we already have cases of transliterated umlauts, it is at least worth seeing what sort of standard there is here. The impression I've gotten, for instance, is that users who are fans of military stuff seem to be way more willing to transliterate the umlaut. Azur Lane and their corresponding ship tags all have their umlauts transliterated, and even German military leaders like Goering are rendered with transliteration, even when it is more common in English to drop the umlaut for names... or at least that's the impression I get. Compare Google's search results between Goring and Goering and the difference is striking.

In turn, outside of tags related to military stuff, there seems to be a greater preference for just dropping, as if the policy decision taken in topic #7587 stands this entire time. It makes the decision taken for Azur Lane's chartags in forum #190494 stick out like a sore thumb. I imagine some German users find the dropping frustrating though, even if that seems to be more common in English. Plus, for other languages that also use diacritics but not umlauts, there doesn't appear to be an alternate transcribing option for those letters except rendering them without the diacritic and letting the reader figure out whether it was there or not (the example of Kraina Grzybow, for Polish, is a case for that - stuff like ó, ł, etc. have no transcribing alts, and you just have them go "bald" if you can't use the diacritic; something that seems to have been entrenched in the 90s and 2000s with SMS; Serbo-Croatian both has and doesn't have transcribing alts if we want to get wacky with it).

On 'uber' and 'Nuremberg' though, those are slightly different cases. The former is a loanword, and typically English doesn't borrow the diacritics when it adopts foreign words, while the latter is because it was standardized back when the city's Latin name was Nuremberga, and often times, unless forced to, the names of cities and even countries from other languages do not change to reflect how they are referred to today (example in another language would be the earlier-mentioned Peking/Beijing, but also Prague, Praha in Czech, reflecting a time when Czech didn't yet turn its Gs into Hs). Look at Cologne for example, its name in English reflects the influence French had on foreign city names for the language (as it'd be Köln, or Koeln, otherwise; same thing applies to Florence in Italy, as that's Firenze in Italian). That is all to say, these don't count for romanization at this point. For cities, recognizability matters first and foremost, same for loanwords.

The common theme here seems to be "it's complicated" and "follow the rules, except when you don't". I don't know that we can come up with anything better than that.

The best is just to link to Wikipedia, for the most part, highlighting the common means of romanization and warning against potential pitfalls (variation in pronunciation, unusual cases ala jinmeiyou kanji and kirakira names, running into foreign names in other languages, etc). Anything is better than just saying nothing and having folks just rely on Google Translate.

Updated 16 days ago

tamuraakemi

16 days ago

i think the most important/useful place to have a guideline provided is for artist names to have something better than xinjinjumins and stack names, as a lot of artists do not provide a real romanized name and very rarely is there any sort of existing "fan consensus"

viliml

16 days ago

I remember there was an artist who complained that danbooru attempted to romanize their kanji display name because they don't recognize any pronunciation of it, so the tag had to be changed to their handle name, or something like that.