Stemming is the process of automatically extracting the base form of a
given word of a language. Assamese is a morphologically rich, relatively
free word order, Indo-Aryan language spoken in North-Eastern part of
India that uses Assamese-Bengali script for writing. As it is among the
less computationally studied languages, our aim is to extract stem from a
given word. We adopt the suffix stripping approach along with a rule
engine that generates all the possible suffix sequences.
Algorithm-I
- Read a line from the corpus file.
- Extract words (from this point we called it as token) from the line, clean the token, that is remove punctuation marker attached with token if there is one.
- Look up suffix-list generated manually from the end of the token. If matched with the suffix-list extract and exit.
- Go to step 1 until the end of the corpus.
Algorithm-II
- Read a line from the corpus file.
- Extract words (from this point we called it as token from the line, clean the token, that is remove punctuation marker attached with token if there is one.
- Check the dictionary. If a dictionary entry matches with the token, mark token as root word and exit otherwise execute the next step.
- Look up suffix-list generated manually from the end of the token. If there is a match with the suffix-list extract and exit.
- Go to step 1 until the end of the corpus.
We found 82%
accuracy with the suffix stripping approach after adding a root-word
list of size 20,000 approximately.
For more information Paper
Comments
Post a Comment