Baidu word segmentation method
Chinese is a very complex language, so that the computer is more difficult to understand Chinese language. In the process of Chinese word segmentation, there are two major problems have not been completely broken.
1, ambiguity recognition
ambiguity refers to the same sentence, there may be two or more segmentation methods. For example, the surface, because the "surface" and "face" are words, then the phrase can be divided into "surface" and "table "." surface". Cross ambiguity. Like this kind of cross ambiguity is very common, in front of the "Korean drama" example, in fact, because of cross ambiguity caused by the wrong. "Korean drama" can be divided into "Korean " and TV series" or "Korean TV " " drama". It is difficult for a computer to know which program is correct because no one knows it.
cross ambiguity is relatively easy to deal with in terms of combinatorial ambiguity, and combinatorial ambiguity must be judged according to the whole sentence. For example, in the sentence "the door handle is broken", "hand" is a word, but the sentence "hands off", "hand" is not a word; in the sentence "the appointment of a lieutenant general", "will" is a word, but the sentence "yield three years, an increase of two times", "will" is no longer a word. How do we identify these words
if the cross ambiguity and combinatorial ambiguity can be solved by computer, there is a problem in ambiguity. The true meaning of ambiguity is to give a sentence that is judged by a person, and which is not a word. For example: "legend landers", can be cut into " legendary " " landers ", can also be cut into the" legend "" Sifu "" lander "if no other context sentence, I am afraid that no one knows" auction "here is not a word.
2, new word recognition
new words, jargon called unknown words. That is, those that are not included in the dictionary, but can really be called words. The most typical is the name, it is easy to understand the sentence "Wang Junhu went to Guangzhou", "Wang" is a word, because it is a person’s name, but if the computer to identify difficult. If the "Wang Junhu" as a word included in the dictionary, there are so many names around the world, and every time there are new names, including these names themselves is a huge project. Even if the work can be completed, or there will be a problem, for example, in the sentence "Wang Jun dignified and strong", "Wang Junhu" can be the word
?In addition to
new names, and names, place names, product name, brand name, abbreviation, ellipsis etc. is a very difficult problem, and it is these people often use the word, so to search engine, word segmentation system new words in recognition of ten important points. At present, the accuracy of new word recognition has been >