Stemming is the process of automatically extracting the base form of a 
given word of a language. Assamese is a morphologically rich, relatively
 free word order, Indo-Aryan language spoken in North-Eastern part of 
India that uses Assamese-Bengali script for writing. As it is among the 
less computationally studied languages, our aim is to extract stem from a
 given word. We adopt the suffix stripping approach along with a rule 
engine that generates all the possible suffix sequences. 
Algorithm-I
- Read a line from the corpus file.
 - Extract words (from this point we called it as token) from the line, clean the token, that is remove punctuation marker attached with token if there is one.
 - Look up suffix-list generated manually from the end of the token. If matched with the suffix-list extract and exit.
 - Go to step 1 until the end of the corpus.
 
Algorithm-II 
- Read a line from the corpus file.
 - Extract words (from this point we called it as token from the line, clean the token, that is remove punctuation marker attached with token if there is one.
 - Check the dictionary. If a dictionary entry matches with the token, mark token as root word and exit otherwise execute the next step.
 - Look up suffix-list generated manually from the end of the token. If there is a match with the suffix-list extract and exit.
 - Go to step 1 until the end of the corpus.
 
We found 82% 
accuracy with the suffix stripping approach after adding a root-word 
list of size 20,000 approximately.
For more information Paper 
Comments
Post a Comment