One thing that has always struck me as surprising is the large number of homophones – words with the same pronunciation – in the Japanese language. Those new to Japanese typically discover this by looking at the number of dictionary entries for a given pronunciation, or number of kanji conversions when hitting spacebar while typing.
In the spoken world, pitch accent (discussed here) helps to distinguish these, but there are many regional variations and some words can be pronounced with more than one pitch pattern. In the written world, things are made manageable by kanji which differentiates the meaning, like 橋 (bridge) vs 箸 (chopsticks), which are both pronounced as “hashi”. In this example the pitch accent is different, but for non-natives it can be very hard to distinguish that. Some people have also claimed that one of the reasons subtitles are so common certain types of Japanese TV is that it helps differentiate the meaning of certain words which have homonyms, though I don’t think this is the only reason.
One thing that always caught my interest was exactly how frequent these homonyms really are. Are they actually more frequent that in English, or is it just that I can naturally differentiate those in my native language so I don’t realize them as much?
I stumbled upon Jim Breen’s EDICT, which is a freely available file containing around 170,000 entries of Japanese words with their readings and meanings. This gave me the idea that with a little scripting this file could be data mined for exactly what I was looking for – the frequency of homophones in Japanese.
In my analysis, I generated both a histogram of the number of homonyms, as well as a list of the pronunciations with the highest number of homonyms.
Histogram of homonym frequency (excerpt)
Words with highest number of homonyms (excerpt)
Note: In both graphs, I have only showed a few data points. If anyone is interested in the rest of the details let me know and I can try to prepare a more detailed report.
From the first graph, we can see that roughly 94% of all words in Japanese do not have a homophone, which is significantly less than I expected. There are around 3% (~6000) words which have one homophone and 1% (~2000) which have two. The curve goes down fast and there are only a total of 55 words (0.03% of total) which have 10 homophones.
Though there are only a few words in Japanese which have a large number of homophones, it you look at the second graph above you’ll see that most of those are short and contain sounds which are commonly used in Japanese. “こう” is the winner for the most homophones, with a whopping 45! “かん” and “しょう” have 38 and 31, respectively.
The good news, which this data doesn’t really capture, is that many of the homophones are not frequent in Japanese, especially everyday conversation. More advanced topics like science, government, and academic subjects tend to use more of these homophones, and if you include older words when are no longer in common use you get more still. Fortunately, more homophones tend to be used in Japanese writing (on advanced topics), but the kanji there helps make up for it.
Although this little exercise did temporarily quench my research thirst about homophones in Japanese, there is still much more work to be done in this area. Firstly, a similar sort of analysis needs to be done to English and only then can we compare these two languages on (semi) equal ground. To get even more meaningful results, the word frequency in modern language would need to be factored into use, filtering out those homophones which are never or almost never used in practice. This might be able to be done using some targeted Google searches (hopefully in a programmatic way with available public APIs), but that would take a good bit of time.