Skip to main content

Posts

Showing posts with the label Stemming

Stemming of Assamese

Stemming is the process of automatically extracting the base form of a given word of a language. Assamese is a morphologically rich, relatively free word order, Indo-Aryan language spoken in North-Eastern part of India that uses Assamese-Bengali script for writing. As it is among the less computationally studied languages, our aim is to extract stem from a given word. We adopt the suffix stripping approach along with a rule engine that generates all the possible suffix sequences. Algorithm-I Read a line from the corpus file.  Extract words (from this point we called it as token ) from the line, clean the token, that is remove punctuation marker attached with token if there is one. Look up suffix-list generated manually from the end of the token. If matched with the suffix-list extract and exit.  Go to step 1 until the end of the corpus. Algorithm-II Read a  line from the corpus file. Extract words (from this point we called it as token from the line, cl

Suffix based Noun and Verb Identifier for Assamese