(An) Online Semi Automated Part of Speech Tagging Technique Applied To Assamese

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
Developing annotated tagged corpora for a language with limited electronic resources can be very demanding. Although Assamese is a language spoken by about 15 million people in the Indian state of Assam as a first language, the development of electronic resources for the language has been lagging behind other Indian languages. Also, there has not been much work done in POS tagging for Assamese. In order to fill this gap, we have designed a POS Tagger for Assamese. Our approach is to use a combination of methods to try and get good results. Further, we amortise the manual intervention over the period of tagging rather than doing all manual work at the beginning. This allows us to quickly start the tagging system. But it also means that what we have is a semi-automatic tagger and not an automatic tagger. Our method requires only native speakers intervention in stages other than the beginning making the system amenable to some form of with a few experts for moderation. This will enable our system to create very large tagged corpora in the language. We first create a knowledge base using a number of methods. This knowledge base is then used to automatically tag sentences in the language. This tagging uses a combination of stemming, application of a few grammatical rules, and a bigram tagger. The tagged sentences are then shown to non-expert native speakers for verification and correction. Before starting the actual tagging process, the knowledge base was tuned by examining the results on small data sets using experts instead of native speakers. The design of a user friendly interface plays an important role in reducing the time taken by native speakers in their examination.
Supervisor: Gautam Barua