Cross-Lingual Embedding between Non-Isomorphic Language Pairs (Special Focus on English and Manipuri)

dc.contributor.authorNaorem, Deepen
dc.date.accessioned2025-12-01T05:32:49Z
dc.date.issued2025
dc.descriptionSupervisor: Singh, Sanasam Ranbir and Sarmah, Priyankoo
dc.description.abstractResearch in Natural Language Processing (NLP) has primarily focused on resource rich languages like English, leaving low-resource languages underrepresented and contributing to a phenomenon known as the digital divide. This disparity limits the development of NLP tools for low-resource languages, such as Manipuri, a morphologically rich Tibeto-Burman language. Transfer learning, leveraging resource-rich languages, has emerged as a solution to this challenge, with cross lingual embeddings playing a pivotal role in aligning lexical units between languages. This thesis first presents a comprehensive empirical evaluation of cross lingual embeddings between English and Manipuri, distant language pairs, in BDI using state-of-the-art supervised and unsupervised approaches. The findings highlight that the non-isomorphic nature of the language pairs degrades the cross lingual embedding quality, making dictionary pair selection crucial, and the morphological richness of the target language further impacts BDI performance. This thesis proposes two novel approaches to address the challenges posed by structural and morphological disparities in distant language pairs. First, a ridge regression-based orthogonal mapping method is introduced, incorporating graph centrality for improved dictionary alignment, outperforming conventional orthogonal mapping techniques, particularly for structurally distant languages like English-Manipuri. Second, a contrastive learning-based method is developed to leverage the morphological richness of Manipuri. Experimental results across several language pairs show significant improvements in BDI, machine translation, and cross-lingual sentence retrieval tasks, outperforming baseline methods. Furthermore, with the increasing advancement of Large Language Models (LLMs), this thesis evaluates the performance of unsupervised, supervised, and few-shot prompting approaches using large language models (LLMs) for BDI across distant language pairs. The findings reveal that few-shot prompting, leveraging minimal examples, consistently outperforms unsupervised and supervised methods, demonstrating robustness against over-fitting and cost-effectiveness for low resource languages. These results suggest that few-shot prompting is a powerful alternative for multilingual BDI tasks, with future work focusing on prompt optimization.
dc.identifier.otherROLL NO.186155102
dc.identifier.urihttps://gyan.iitg.ac.in/handle/123456789/3038
dc.language.isoen
dc.relation.ispartofseriesTH-3776
dc.rightshttps://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.urihttps://creativecommons.org/licenses/by-nc-sa/4.0/
dc.titleCross-Lingual Embedding between Non-Isomorphic Language Pairs (Special Focus on English and Manipuri)
dc.typeThesis

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Abstarct-TH-3776_186155102.pdf
Size:
73.69 KB
Format:
Adobe Portable Document Format
Description:
ABSTRACT
Loading...
Thumbnail Image
Name:
TH-3776_186155102.pdf
Size:
14.77 MB
Format:
Adobe Portable Document Format
Description:
THESIS

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
227 B
Format:
Item-specific license agreed to upon submission
Description: