ChatGPT is available in a number of languages but Japanese researchers say the popular AI chatbot has a poor grasp of the intricacies of Japanese language and culture.
A number of big tech companies like NEC, Fujitsu, and SoftBank are creating LLMs trained purely on Japanese datasets to overcome this challenge.
Keisuke Sakaguchi, a natural language processing researcher at Tohoku University in Japan explained that “Current public LLMs, such as GPT, excel in English, but often fall short in Japanese due to differences in the alphabet system, limited data, and other factors.”
Why does ChatGPT have such a tough time responding in Japanese?
Lost in translation
The main reason that ChatGPT struggles with Japanese is because the majority of the dataset it was trained on consisted of English material. And the English language is a lot less complicated than Japanese.
English words are made up of combinations of the 26 letters of the English alphabet. The Japanese language uses 48 basic characters, plus 2,136 regularly used kanji or Chinese characters. And most of those kanji have multiple pronunciations.
There are also an additional 50,000 kanji that are technically part of the Japanese language but are very rarely used.
When a Japanese person uses ChatGPT their prompt is translated into English, ChatGPT generates an output in English and then translates it into Japanese. It’s not surprising then that when a Japanese person reads a response from ChatGPT it sounds a little off.
Sakaguchi explained that during this translation process ChatGPT “sometimes generates extremely rare characters that most people have never seen before, and weird unknown words result.”
ChatGPT bias and Japanese culture
Because ChatGPT was trained mainly on English data, there is implicit Western cultural bias in the way it responds. Culture shapes the way we speak, and things that are considered polite or acceptable in English may not be appropriate in the Japanese culture.
If you use ChatGPT to write a job application or an investment pitch then the output is going to sound pretty awful to a Japanese person because it will miss a lot of the standard expressions of politeness.
There are already some smaller Japanese LLM’s but they’re a long way off from the performance of even GPT-3.5, not to mention GPT-4.
The RIKEN group, Tohoku University, Fujitsu, and the Tokyo Institute of Technology are working on changing that. Their project is using the Japanese Fugaku supercomputer to train an LLM almost exclusively on Japanese language data.
At 30B parameters it’s still a lot smaller than models like GPT-3.5 but it will be open source and a lot better aligned with the language and culture of Japan.
Japan lags some distance behind countries like the US and China in AI development. If it is to achieve its AI ambitions Japan will need to overcome a number of industry challenges.
According to the Japanese Ministry of Economy Trade and Industry the country will have a deficit of 789,000 software engineers by 2030. Also, the lack of advanced computing platforms means that its homegrown AI models will be heavily reliant on the government owned Fugaku supercomputer.
Earlier this year Sam Altman said OpenAI planned to open an office in Japan and told Japanese Prime Minister Kishida that OpenAI hopes to “build something great for Japanese people, make the models better for Japanese language and Japanese culture.”
With a tech-hungry population of over 120 million people, Japan presents an appealing, if complicated market for AI developers.