Development of Natural Language Processing Tools and Resources for Assamese Text

Pathak, Dhrubajyoti

Development of Natural Language Processing Tools and Resources for Assamese Text

dc.contributor.author	Pathak, Dhrubajyoti
dc.date.accessioned	2024-07-01T11:40:59Z
dc.date.available	2024-07-01T11:40:59Z
dc.date.issued	2024
dc.description	Supervisors: Nandi, Sukumar and Sarmah, Priyankoo	en_US
dc.description.abstract	Natural Language Processing (NLP) is a discipline of computer science concerned with the interaction of computers and humans in natural language. The field of NLP has expanded significantly over the last decade and now plays an important role in numerous sectors, such as IT industry, Education, Health care, Banking, Stock market, Entertainment, etc. Over the years, researchers in NLP have developed a wide range of tools and resources to support various NLP tasks such as text classification, sentiment analysis, machine translation, language understanding, question answering, and speech recognition. These models require a large amount of dataset and computing power. The models achieve state-of-the-art performance in many NLP areas for high-resource languages. On the other hand, it is observed that these models do not cover low-resource language such as Assamese. It happens due to limited language resources, which leads to not receiving as much attention, unlike resource-rich languages in NLP research. Although there are previous studies on various types of tools and resources for various tasks. It is also observed that existing resources, such as raw text corpus and annotated dataset, are not sufficient in size for deep learning models. Hence, it is necessary to enhance and increase the data size of the resources as well. This dissertation presents four contributions to language tools and resources specifically focused on low-resource languages, emphasizing Assamese language.The first contribution in this direction focuses on identifying and modeling reduplication in Assamese text. Reduplication is a productive morphological process widely used in the Assamese language. Addressing reduplication plays a vital role in the e iciency of POS tagger, fficiency of POS tagger, sentiment analysis, sentiment analysis, as well as other downstream NLP tasks. A Deep learning (DL)-based Assamese word embedding model is proposed in the second contribution. Word embedding model is a crucial component in DL-based downstream sequence modeling tasks such as POS tagging, NER, etc. A large amount of text corpus is required to train a DL model. On the other hand, the available Assamese text corpus is not sufficient for a word embedding model. Therefore, we also prepared an Assamese text corpus to train the word embedding model. In the third contribution, an Assamese POS tagger is proposed using an ensemble approach. Two Deep learning-based and one rule-based POS taggers are used in the ensemble process. The fourth contribution of this dissertation is comprised of two parts. First is the creation of a NER dataset for the Assamese language. Prior to this dataset, there was no publicly available NER dataset for Assamese. The second is the development of an Assamese Named Entity Recognition model. In summary, this dissertation makes significant contributions to research areas of Assamese text corpus, word embedding models, POS tagging, and NER model.	en_US
dc.identifier.other	ROLL NO.166155103
dc.identifier.uri	https://gyan.iitg.ac.in/handle/123456789/2652
dc.language.iso	en	en_US
dc.relation.ispartofseries	TH-3389;
dc.subject	Language Resource	en_US
dc.subject	NLP Tools	en_US
dc.subject	POS Tagging	en_US
dc.subject	NER Tagging	en_US
dc.subject	Reduplication	en_US
dc.subject	Assamese Language	en_US
dc.subject	Word Embedding	en_US
dc.title	Development of Natural Language Processing Tools and Resources for Assamese Text	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 2 of 2

Name:: Abstract-TH-3389_166155103.pdf
Size:: 63.57 KB
Format:: Adobe Portable Document Format
Description:: ABSTRACT

Download

Name:: TH-3389_166155103.pdf
Size:: 1.6 MB
Format:: Adobe Portable Document Format
Description:: THESIS

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

PhD Theses (Linguistic Science and Technology)