<< natural_language_processing
A mapping \(h\) from input data \(x\) (drawn from instance space \(\mathcal X\)) to a label (or labels) \(y\) from some enumberable output space \(\mathcal Y\).
Hand-coded rules.
pros: high acc
cons: low recall, hard to construct and maintain rule.
Supervised learning. Use a training set of \(m\) hand-labeled documents \((d_1, c_1), \ldots, (d_m, c_m)\).
graph LR;
tp[Text Preprocessing]
tr[Text Representation]
fs[Feature Selection]
model[Model Construction]
tp --> tr --> fs --> model
To represent the document as computer-understandable, we shall express document as the vector in vector space. The distance between vector should indicate the semantic distance.
One classical method of vectorize a document using the bag of word, which is simply the counts of words:
= {
bag_of_word "word": 5,
"hello": 2,
# ...
}
Problems of bag of word:
The word of bags usually are too large to use, we should do features selection based on some statistical correlation indicators between the feature and type like frequency, mutual information, informaytion gain, \(\chi^2\) etc.
Convential machine learning models for text classification:
Recall Naïve Bayes Algorithm which simply applies the Bayes’ Rule to compute all the inverse probability and pick the maxmium one as prediction: \[ \hat y = \arg\max_{y\in \mathcal Y} P(Y|X) \] where the inverse probability is computed as \[ P(Y|X) = \cfrac{P(XY)}{P(X)} = \cfrac{P(Y=y)P(X=x|Y=y)}{\displaystyle\sum_{y\in \mathcal Y}P(Y=y)P(X=x|Y=y)} \] where terms \(P(Y=y), P(X=x|y=y)\) are computed(learned) from training test.
In the context of document classification that vectorizing document by , the parameters are interpreted as:
We implement Naive Bayes text_classification_naive_bayes.py.