Back Transliteration of Romanized Assamese Social Media Texts (Corpus, Analysis and Models)
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Natural Language Processing (NLP) research has largely focused on resource-rich languages, leaving low-resource ones like Assamese underrepresented. Assamese language, spoken by millions in northeast India, faces challenges due to its linguistic diversity and lack of standardized resources. This thesis tackles back-transliteration of Romanized Assamese-common on social media platforms like Facebook, YouTube, and Twitter (X)-where informal, noisy, and code-mixed content complicates processing. Transliteration converts text between scripts while preserving phonetics; back-transliteration reverses this process. These tasks are increasingly relevant in multilingual contexts like India. Assamese poses unique difficulties due to inconsistent Romanization, phonetic variation, and orthographic diversity. This work presents a detailed analysis of grapheme-level and phoneme-level variations and introduces a new dataset of 60,312 sentence pairs and 65,614 word pairs from social media. Various transliteration models-including statistical, neural, transformer and LLM-based-are benchmarked, with a focus on word-level vs. sentence-level performance. Results show the importance of phonetic and contextual factors in accuracy. The thesis also demonstrates how back-transliteration improves downstream tasks like sentiment analysis, offering valuable tools and insights for advancing NLP in low-resource languages.
Description
Supervisors: Singh, Sanasam Ranbir and Sarmah, Priyankoo
Keywords
Citation
Endorsement
Review
Supplemented By
Referenced By
Creative Commons license
Except where otherwised noted, this item's license is described as https://creativecommons.org/licenses/by-nc-sa/4.0/

