Hidden Markov Model (Hidden Markov Model, HMM), Maximum Entropy Markov Model (MEMM) and Conditional Random Field (Conditional Random Field, CRF) are the three most commonly used and most basic in sequence labeling model.
HMM comes first, MEMM comes second, and CRF comes last. The main ideas of the three algorithms are as follows:
1) The HMM model directly models the transition probability and the performance probability, and counts the co-occurrence probability. HMM is a typical probability directed graph, which is the calculation probability method of the probability directed graph, but the front node of the probability directed graph There will be multiple nodes, and there is only one node in front of the hidden Markov .
2) The MEMM model establishes a joint probability for the transition probability and the performance probability. The statistics are conditional probabilities. However, MEMM tends to fall into local optimality because MEMM only performs local normalization.
3) In the CRF model, the global probability is counted. When the normalization is performed, the global distribution of the data is considered, rather than only the local normalization, which solves the label bias in the MEMM problem.
Examples are as follows:
For a labeling task, "I love Beijing Tiananmen Square", label it as "ss b ebce".
- For HMM, the probability of judging the establishment of this label is P=P(s transferred to s)*P('I' represented as s)* P(s transferred to b)*P('love' represented as s)* … During *P() training, statistics of state transition probability matrix and performance matrix are required .
- For MEMM, the probability of judging the establishment of this label is P = P (s transfer to s|'I' is represented as s)*P('I' is represented as s)* P(s is transferred to b|'Love' performance It is s)*P('爱'appears as s)*.. During training, the conditional state transition probability matrix and the performance matrix should be counted . Compared with HMM, the state transition probability matrix becomes the conditional state probability matrix.
- For CRF, the probability of judging the establishment of this label is P = F (s is transferred to s, and'I' is represented as s)... F is a function that counts the normalized probability in the global scope, not like MEMM. The probability of local statistical normalization, the so-called local normalization of MEMM, my understanding is that you add a probability under the precondition, that is, under the precondition, the probability must satisfy the sum of all probabilities to be 1. It is such a local Normalized. At present, the last CRF has achieved a dominant performance in multiple tasks, so if you re-enter the application, you can choose CRF.
Essentially, CRF has the following three advantages:
1) Compared with HMM, CRF does not have strict independence assumptions like HMM, so it can accommodate arbitrary context information. Flexible feature design (same as ME)
2) Compared with MEMM, because CRF calculates the conditional probability of the global optimal output node, it also overcomes the shortcomings of the maximum entropy Markov model label bias (Label-bias).
3) CRF is to calculate the joint probability distribution of the entire marking sequence under the condition of a given observation sequence that needs to be marked , instead of defining the state distribution of the next state under the given current state condition. Everything has two sides. Because of these advantages, CRF requires more training parameters. Compared with MEMM and HMM, it has the disadvantages of high training cost and high complexity.
CRF VS dictionary statistical word segmentation
- Dictionary-based word segmentation relies too much on dictionaries and rule bases, so the ability to recognize ambiguous words and unregistered words is low; its advantages are fast speed and high efficiency
- CRF represents a new generation of machine learning technology for word segmentation. The basic idea is to label Chinese characters, that is, to form words (group words). It not only considers the frequency information of the words and words, but also considers the context, and has better learning. Ability, so it has a good effect on the recognition of ambiguous words and unregistered words; its shortcomings are that the training period is long, the calculation amount is large during operation, and the performance is not as good as dictionary word segmentation
Put the three together to make a summary:
- HMM -> MEMM: There are two assumptions in the HMM model: one is that the output observations are strictly independent, and the other is that the current state is only related to the previous state during the state transition. But in fact, the sequence labeling problem is not only related to a single word, but also to the length of the observation sequence, the context of the word, and so on. MEMM solves the problem of HMM output independence assumption. Because HMM is only limited to the dependence between observations and states, MEMM introduces a custom feature function, which can not only express the dependence between observations, but also express the complex dependence between the current observation and multiple states before and after.
- MEMM -> CRF:
- CRF not only solves the problem of HMM output independence hypothesis, but also solves the problem of MEMM labeling bias. MEMM is easy to fall into local optimal because it only performs local normalization, while CRF counts global probability and is doing normalization. When considering the global distribution of the data, rather than just local normalization, this solves the problem of label bias in MEMM. Make the decoding of sequence annotation become the optimal solution.
- HMM and MEMM are directed graphs, so the influence of x and y are considered, but x is not considered as a whole (this problem should only be HMM). CRF is an undirected graph and has no such dependency to overcome this problem.