PhD Theses (Linguistic Science and Technology)

Browse

Recent Submissions

Now showing 1 - 3 of 3
  • Item
    Development of Natural Language Processing Tools and Resources for Assamese Text
    (2024) Pathak, Dhrubajyoti
    Natural Language Processing (NLP) is a discipline of computer science concerned with the interaction of computers and humans in natural language. The field of NLP has expanded significantly over the last decade and now plays an important role in numerous sectors, such as IT industry, Education, Health care, Banking, Stock market, Entertainment, etc. Over the years, researchers in NLP have developed a wide range of tools and resources to support various NLP tasks such as text classification, sentiment analysis, machine translation, language understanding, question answering, and speech recognition. These models require a large amount of dataset and computing power. The models achieve state-of-the-art performance in many NLP areas for high-resource languages. On the other hand, it is observed that these models do not cover low-resource language such as Assamese. It happens due to limited language resources, which leads to not receiving as much attention, unlike resource-rich languages in NLP research. Although there are previous studies on various types of tools and resources for various tasks. It is also observed that existing resources, such as raw text corpus and annotated dataset, are not sufficient in size for deep learning models. Hence, it is necessary to enhance and increase the data size of the resources as well. This dissertation presents four contributions to language tools and resources specifically focused on low-resource languages, emphasizing Assamese language.The first contribution in this direction focuses on identifying and modeling reduplication in Assamese text. Reduplication is a productive morphological process widely used in the Assamese language. Addressing reduplication plays a vital role in the e iciency of POS tagger, fficiency of POS tagger, sentiment analysis, sentiment analysis, as well as other downstream NLP tasks. A Deep learning (DL)-based Assamese word embedding model is proposed in the second contribution. Word embedding model is a crucial component in DL-based downstream sequence modeling tasks such as POS tagging, NER, etc. A large amount of text corpus is required to train a DL model. On the other hand, the available Assamese text corpus is not sufficient for a word embedding model. Therefore, we also prepared an Assamese text corpus to train the word embedding model. In the third contribution, an Assamese POS tagger is proposed using an ensemble approach. Two Deep learning-based and one rule-based POS taggers are used in the ensemble process. The fourth contribution of this dissertation is comprised of two parts. First is the creation of a NER dataset for the Assamese language. Prior to this dataset, there was no publicly available NER dataset for Assamese. The second is the development of an Assamese Named Entity Recognition model. In summary, this dissertation makes significant contributions to research areas of Assamese text corpus, word embedding models, POS tagging, and NER model.
  • Item
    (A) Structure-preserving Document Conversion System for Manipuri Documents in Bengali Script to Meetei Script
    (2023) Thiyam, Jennil
    Manipuri, or Meeteilon, is one of the resource-poor languages of India and the lingua franca of the Indian state of Manipur. Though the Meetei Mayek (Manipuri script) is known to use for writing Manipuri documents since the early \ nth{6} AD, it was banned and replaced with the Bengali script} by the then king of Manipur in the 18th century. Since the late 70s, the Government of Manipur has made an effort to reintroduce Meetei Mayek and included it in Unicode in the year 2009. Meetel Mayek is progressively replacing the Bengali script in schools, colleges, offices, and other places. During the era of using Bengali script as Manipuri writing script (more than 300 years), a huge volume of Manipuri documents has been created in Bengali script. Almost all of the Manipuri literary materials are in Bengali script, and the population in Manipur is broadly divided into - Bengali script literate and Meetei Mayek literate. After a few decades, the majority of the Manipuri population will not be able to read/write Bengali script, creating a huge gap in accessing literary materials. Therefore, there is an urgent need to develop an effective system to convert Manipuri documents written in the Bengali script to Meetei Mayek to bridge the script divide. Motivated by the above concern, this thesis focuses on the following three research problems associated with the development of an automatic document conversion system (DCS) for the Manipuri documents in the Bengali script to Meetei Mayek.
  • Item
    Automatic Taxonomy Expansion in IndoWordNet (With special reference to Assamese WordNet)
    (2023) Phukon, Bornali
    The task of automatic taxonomy expansion plays a significant role in natural language processing (NLP), as it helps to overcome the issue of low coverage in taxonomies. By effectively performing this task, various NLP applications like information retrieval, text classification, and natural language understanding can achieve better accuracy and efficacy. While numerous studies have explored the challenges of automatic taxonomic expansion, the methods and techniques used in these studies may be less effective for taxonomies like WordNet due to their unique structure and organization. WordNet is a widely used lexical taxonomy of concepts in a language that comprises not only a hierarchical organization of concepts but also information regarding other semantic relations such as synonymy, meronymy, and troponymy among the concepts, which distinguish it from other taxonomies. The creation of WordNets typically involves manual methods; however, currently, a substantial number of WordNets are generated through the expansion approach, such as those included in Indo-WordNet. Despite its widespread usage, creating WordNet is challenging, with two significant problems being limited coverage and missing relations. The manual creation process of WordNets can result in limited coverage, while the use of the expansion approach for creating WordNets may result in missing relations between concepts and words. While previous studies have sought to address the issue of limited coverage, the problem of missing relations has yet to receive adequate attention. Furthermore, while automatic taxonomy expansion approaches have been proposed to resolve the issue of limited coverage, their effectiveness for WordNet expansion remains in question. The primary reason is that the expansion of WordNet requires not only inserting new concepts (attach operation) but also extending existing ones (merge operation). However, most existing studies on taxonomy expansion only focus on the attach operation. Furthermore, WordNet taxonomies, especially those in Indian languages, tend to have multi-root structure. It makes it more challenging to utilize traditional methods for the expansion of WordNet taxonomy as these methods are not designed to handle the challenges of a multi-root structure, which may limit their usefulness in expanding WordNet taxonomies. In light of these challenges, this thesis work aims to address the problem of automatic taxonomy expansion by addressing the challenges in WordNet, especially in IndoWordNet. The objective is to develop a solution that can be extended to other taxonomies beyond WordNet. This thesis first studies the problem of missing synonymy relations in WordNet taxonomy. It considers Assamese Wordnet as a case study. It investigates the effectiveness of Link prediction methods. As WordNets can be visualized as a network of unique words connected by synonymy relations, link prediction in complex network analysis is an effective way of predicting missing relations in a network. Hence, in order to predict the missing synonyms in the Assamese WordNet, link prediction methods were used in the current work that proved effective. It is also observed that for discovering missing relations in the Assamese WordNet, simple local proximity-based methods might be more effective as compared to global and complex supervised models using network embedding. Second, a novel multi-task learning-based deep learning method known as Taxonomy Expansion with Attach and Merge (TEAM) is proposed, which performs both the merge and attach operations. This is the first study that integrates both the merge and attach operations in a single model to the best of our knowledge. The proposed models have been evaluated on three separate WordNet taxonomies, viz., Assamese, Bangla, and Hindi. From the various experimental setups, it is shown that TEAM outperforms its state-of-the-art counterparts for attach operation and also provides highly encouraging performance for the merge operation. Third, As TEAM considers local context, it faces challenges when it is applied to multi-root taxonomies. To address the limitations in TEAM, this thesis proposes another approach, LG-TEAM, which combines both the local and global context of taxonomy in an integrated attach-merge expansion environment, providing a more robust solution to the problem of taxonomy expansion. Extensive experiments on English, Assamese, Bengali, and Hindi WordNets demonstrate both the effectiveness and the efficiency of LG-TEAM for automatic taxonomy expansion.