Skip to main content

Part of Speech tagging of Assamese

Part of Speech (POS) tagging is the process of marking up words and punctuation characters in a text with appropriate POS labels. The problems faced in POS tagging are many. Many words that occur in natural language texts are not listed in any catalog or lexicon. A large percentage of words also show ambiguity regarding lexical category. 

The challenges of our work on POS tagging for Assamese, an Indo-European language, are compounded by the fact that very little prior computational linguistic exists for the language, though it is a national language of India and spoken by over 30 million people. Assamese is a morphologically rich, free word order, inflectional language. Although POS tagged annotated corpus for some of the Indian languages such as Hindi, Bengali, and Telegu have become available lately, a POS tagged corpus for Assamese  was unavailable till we started creating one for the work presented here. Another problem was that a clearly defined POS tagset for Assamese was unavailable to us. As a part of the work reported in this paper, we have developed a tagset consisting of 172 tags, using this tagset we have manually tagged a corpus of about ten thousand Assamese words.

In Assamese, secondary forms of words are formed through three processes: affixation, derivation and compounding. Affixes play a very important role in word formation. Affixes are used in the formation of relational nouns and pronouns, and in the inflection of verbs with respect to number, person, tense, aspect and mood. 

We have obtained an average tagging accuracy of 87% using a training corpus of just 10000 words. Our main achievement is the creation of the Assamese tagset that was not available before starting this project. We have implemented an existing method for POS tagging but our  work is for a new language where an annotated corpora and a pre-defined tagset were not available.

We are currently working on developing a small and more compact tagset. We propose the following additional work for improved performance.  First, the size of the manually tagged part of the corpus will have to be increased. Second, a suitable procedure for handling unknown proper nouns will have to be developed. Third, if this system can be expanded to trigrams or even n-grams using a larger training corpus, we believe that the tagging accuracy will increase.

For more details about the report Part of speech tagger for Assamese text



Comments

Popular posts from this blog

Word Formation Process of Assamese

Word formation is the process of creating new words. In other words the ways in which new words are made on the basis of other words or morphemes. Most English vocabulary arises by making new lexemes out of old ones--either by adding an affix to previously existing forms, altering their word class, or combining them to produce compounds. These processes of construction are of interest to grammarians [1]. In this report I will try to write something on “Word formation of Assamese (Asomiya)”. The word Assamese is an English one, built on the same principle like Japanese, Chinese etc. The history and philology of the Assamese language was scientifically studied and presented for the first time by Dr. Banikanta Kakati[2]. The first Assamese grammar, “A Grammar of the Assamese Language” by William Robinson was published in 1839. In 1848, Nathan Brown published an Assamese Grammar, and in 1894, Prof. Nicholl published his “Sketch of Assamese Grammar”. After the work of Dr. Banikant Kakati, a

Assamse Loan words : অসমীয়া ধাৰ কৰা শব্দ

Loanwords are words adopted by the speakers of one language ( target language ) from a different language (the source language ). A loanword can also be called a borrowing words . It is a consequence of sociocultural contact between two language communities. Borrowing of words can go in both directions between the two languages in contact, but often there is an asymmetry, such that more words go from one side to the other. In this case the source language community has some advantage of power, prestige and/or wealth that makes the objects and ideas it brings desirable and useful to the borrowing language community. The following are some example of Assamese loan words and source language. TRANSLATED LOAN WORD বাতৰি কাকত bat ɔri kakɔt (news paper) স্বৰ্ণ যুগ sɔbərnɔ zug (golden era) ৰূপালী জয়ন্তী rupali zɔjənti (silver jubilee) কন্দুৱা গেছ kɔndua ges (tear gas) দূৰদৰ্শন durdɔrχɔn (television) MARATHI WORDS তাঙৰণ taŋɔrɔn (edition) বৰঙণি bɔr

মিত্ৰাক্ষৰ ছন্দৰ কবিতা

বহুদিনৰ আগতে মিত্ৰাক্ষৰ ছন্দত কবিতা লিখিবলৈ চেষ্টা কৰিছিলোঁ। ব্ৰহ্মপুত্ৰ লাচিতৰ অসমত     পুত্ৰ শান্তনুৰ। বৈ আছে অহৰহ     কুলুকুলু সুৰ।। হৰিনাম ৰসে        আলোকময় প্ৰকাশে। গহীন গম্ভীৰ        গতি চিৰপ্ৰবাহমান।। তোৰ লহৰত        সোন ফুলে পথাৰত। অসমীয়ে ৰচে        সপোন সুখৰ।। ক’ত কাহিনী        সমিল-মিলৰ। বিজড়িত তোৰ         বিশাল বুকুত।। সৱ সোৱৰণি        প্ৰতিবিম্বিত আজি। অনন্ত কালৰ         সজল চকুত।।  *************** তোৰ লহৰত             অসমৰ চহৰত         নিনাদিত হয় ঐক্যতান। জাতি উপজাতি            মিলনৰ পৃষ্ঠভূমি         গায় সমন্বয়ৰ গান।। তোৰ পাৰতেই            কামাখ্যা মহাভৈৰবী         মহাতীৰ্থ অসমৰ। দুয়োপাৰে তোৰ         হৰিনামৰ ঝংকাৰ         কল্লোলিত পূণ্য জল।। ***************      ক’ত সভ্যতাৰ সংস্কৃতিৰ         গৌৰৱময় ইতিহাস। ক’ত বিৰহৰ হা-হুতাশাৰ          কালৰ কি পৰিহাস।। শান্ত সমাহিত অতীতৰ ব্ৰহ্মপুত্ৰ         প্ৰবল প্ৰতাপ তোৰ। অসমক আজি প্লাবিত কৰি তই         কিহৰ তুলিলি হোৰ।। অসমীয়া আজি ভয়ত তোৰ         থৰি-হৰি কম্প