Title: Handling Real-Word Errors of Hindi Language using N-gram and Confusion Set
Abstract: The two major typographic errors of any language are non-word errors and real-word errors. The researchers have worked rigorously for the former error but the latter has not been given the much attention all this while. In this paper, a technique to identify the real-word error of Hindi language has been proposed which combines the bigram, trigram and confusion set (CS) methods. Left bigrams, right bigrams and trigrams are calculated by taking in to account the immediate left, immediate right word with the Hindi test word and the immediate left, the Hindi test word and the immediate right word respectively. A group of most confusable words is created using Levenstein edit distance method. After that, a composite score is calculated for all the members of the CS using bigram and trigram probabilities. The calculated composite score is used to prepare the suggestion list for the erroneous word. A Hindi text file of 2000 words has been used to evaluate and verify the proposed method which offers considerably good results. It gives the precision, recall and F-score as .70-.75, .80-.85 and .70-.80 respectively. In future, the research can be done to improve the results by considering the whole sentence at once.
Publication Year: 2019
Publication Date: 2019-02-01
Language: en
Type: article
Indexed In: ['crossref']
Access and Citation
Cited By Count: 4
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot