Python 3.6
For binarization, I followed the steps on the textbook. In order to save some space, there is no new grammar created, and all modifications are on the original grammar:
- Rules with right-hand side longer than 2 are normalized. For example, S -> Aux NP VP is written as S-> X1 VP and X1 -> Aux NP. The probabilities are re-calculated so that the sum of probabilities starting with S (or other tags) is 1. If there are more than 3 tags on the right-hand side, there is a while loop to deal with this situation.
- Rules with single non-terminal on the right (Unit productions) are eliminated by connecting the original left-hand side with all the non-unit production rules that the original right-hand side leads to. Besides, the probabilities are also re-calculated by multiplying the probability of the unit production to the probability of every child rule.
- There is no stemming or lemmatizing on the vocabulary.
- As a result, due to the limited vocabulary of the lexicon, if the input sentence contains unseen words, the tree won't be generated