Title: N-gram Based Two-Step Algorithm for Word Segmentation
Abstract: This paper describes an n-gram based reinforcement approach to the closed track of word segmentation in the third Chinese word segmentation bakeoff. Character n-gram features of unigram, bigram, and trigram are extracted from the training corpus and its frequencies are counted. We investigated a step-by-step methodology by using the n-gram statistics. In the first step, relatively definite segmentations are fixed by the tight threshold value. The remaining tags are decided by considering the left or right space tags that are already fixed in the first step. Definite and loose segmentation are performed simply based on the bigram and trigram statistics. In order to overcome the data sparseness problem of bigram data, unigram is used for the smoothing.
Publication Year: 2006
Publication Date: 2006-07-01
Language: en
Type: article
Access and Citation
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot