Development of Natural Language Processing Tools and Resources for Assamese Text

Pathak, Dhrubajyoti2024-07-012024-07-012024ROLL NO.166155103https://gyan.iitg.ac.in/handle/123456789/2652Supervisors: Nandi, Sukumar and Sarmah, PriyankooNatural Language Processing (NLP) is a discipline of computer science concerned with the interaction of computers and humans in natural language. The field of NLP has expanded significantly over the last decade and now plays an important role in numerous sectors, such as IT industry, Education, Health care, Banking, Stock market, Entertainment, etc. Over the years, researchers in NLP have developed a wide range of tools and resources to support various NLP tasks such as text classification, sentiment analysis, machine translation, language understanding, question answering, and speech recognition. These models require a large amount of dataset and computing power. The models achieve state-of-the-art performance in many NLP areas for high-resource languages. On the other hand, it is observed that these models do not cover low-resource language such as Assamese. It happens due to limited language resources, which leads to not receiving as much attention, unlike resource-rich languages in NLP research. Although there are previous studies on various types of tools and resources for various tasks. It is also observed that existing resources, such as raw text corpus and annotated dataset, are not sufficient in size for deep learning models. Hence, it is necessary to enhance and increase the data size of the resources as well. This dissertation presents four contributions to language tools and resources specifically focused on low-resource languages, emphasizing Assamese language.The first contribution in this direction focuses on identifying and modeling reduplication in Assamese text. Reduplication is a productive morphological process widely used in the Assamese language. Addressing reduplication plays a vital role in the e iciency of POS tagger, fficiency of POS tagger, sentiment analysis, sentiment analysis, as well as other downstream NLP tasks. A Deep learning (DL)-based Assamese word embedding model is proposed in the second contribution. Word embedding model is a crucial component in DL-based downstream sequence modeling tasks such as POS tagging, NER, etc. A large amount of text corpus is required to train a DL model. On the other hand, the available Assamese text corpus is not sufficient for a word embedding model. Therefore, we also prepared an Assamese text corpus to train the word embedding model. In the third contribution, an Assamese POS tagger is proposed using an ensemble approach. Two Deep learning-based and one rule-based POS taggers are used in the ensemble process. The fourth contribution of this dissertation is comprised of two parts. First is the creation of a NER dataset for the Assamese language. Prior to this dataset, there was no publicly available NER dataset for Assamese. The second is the development of an Assamese Named Entity Recognition model. In summary, this dissertation makes significant contributions to research areas of Assamese text corpus, word embedding models, POS tagging, and NER model.enLanguage ResourceNLP ToolsPOS TaggingNER TaggingReduplicationAssamese LanguageWord EmbeddingDevelopment of Natural Language Processing Tools and Resources for Assamese TextThesis