Chinese is a unique and magical language. Natural language processing Chinese data , we will encounter difficulties that many other languages do not have, such as segmentation. Between Chinese words,there are no spaces. So, how can computer know this sentence: Married and not married youth must practice family planning.
This is the so-called problem of segmentation ambiguities. But now a lot of the language model has more beautiful method to solve this problem. However, in the field of Chinese word segmentation, there is a kind of words making us confused- unknown words just as “给力”.
The last decade, the Chinese word segmentation field are concentrated to overcome this difficulty.
So let’s see some interesting ways to solve this problem. In order to extract words from a text , our first question is, what kind of text fragments are considered one word? A standard that we think firstly may be the number of times to see this word is large enough. However, just high appearing frequency are not enough , text fragments may not be a word, but a phrase with more words. "the movie" appears 389 times in the state of all network users at renren.com, "cinema" appears only 175 times, however, we are more inclined to the "cinema" as a word, because "movie" and "courtyard" relate tighter.
In order to prove that the word "cinema" internal solidification is indeed high, we can calculate to prove that if the "movie" and "courtyard" appear independently in the text, they both just spell together probability will be more small. From 24 million characters of data, we can easily find that the probability of “cinema” is more than 300 times the predicted value which equal to the probability of “movie” products the probability of “courtyard”By the same method, the probability of “the movie” is 8.5 times the predicted value. The results show that "cinema" is an interesting mix of these two components of the "movie" which is more like "the" and "movie" occasionally