The problem: transliterating Balinese
In December 2021, a friend introduced me to the Balinese WikiPustaka (Wikisource) community, which has been relentlessly working on photographing, transcribing and uploading Balinese palm manuscript (lontar) to Wikisource (example)
While manuscripts are written using the Balinese alphabet, today most Balinese use the Latin alphabet when reading or writing in the Balinese language. Text written in the Balinese alphabet can be automatically transliterated to the Latin alphabet (using Benny Lin's translitator), however since words in the Balinese alphabet are not separated by spaces, after transliteration we're missing spaces between words. To produce legible Balinese in the Latin alphabet, we need to guess where spaces are after transliteration.
Our solution: a space fixer + a spell checker
So we embarked on a small project teaching a model to guess spaces in Balinese text. Using text from Balinese Wikipedia, we train SymSpell to put together a first version of our webapp, until then called "Balinese space fixer":
Later on, we expanded this tool to also make spelling corrections: it doesn't just add missing spaces, but also fixes the spelling of individual words. To let the app improve itself over time, we added a way for users to suggest a better spelling.
The tool can be accessed at https://balinese-spell.netlify.app/, with its source code living at https://github.com/raphaelmerx/ban-spellcheck.
To assist the WikiPustaka community, which is so far manually transcribing lontar images, we would love to create a quality HTR (handwriting text recognition) engine for Balinese manuscripts. I'm looking into using eScriptorium and kraken to do so.
In the longer-term, a translation engine between Indonesian and Balinese would be great. While I'm trying to keep the hopes of the WikiPustaka community low on this front, I'm actively following research on very low-resource machine translation, especially that coming from the Masakhane (NLP for African languages) community.