<< natural_language_processing
The following basic steps are included in natural language processing:
graph TB;
ss[Sentence Segmentation]
wt[Word Tokenization]
ss --> wt
The process converting the sequences into token is tokenization, the lexical analyzer classify the token during this process.
Let \(N\) denotes the number of tokens, and \(V\) denotes vocabulary(set of types), \(|V|\) is the size of vocabulary. The relation \(|V|> O(\sqrt N)\) has been proved by ==TODO==.
import nltk.tokenize as tk
= "Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences, with punctuations to see it in action."
doc
= tk.sent_tokenize(doc)
sent_tokens = tk.word_tokenize(doc) word_tokens
The tk.sent_tokenize
method split the document as sentences, and tk.word_tokenize
as words. Specially, the class with its method WordPunctTokenizer.tokenize
is provided to regard the punct as a word.
= tk.WordPunctTokenizer()
tokenizer = tokenizer.tokenize(doc) tokens
Stemming removes the syntax prefix/suffix to get the stem of a word, and lemmatisation converts the word’s complex form ito its original form. Sometimes ==TODO== is required
stateDiagram-v2
state stemming{
plays --> play
played --> play
playing --> play
}
state lemmatisation {
is --> be
are --> be
been --> be
}
import nltk
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
for word in words:
= pt.PorterStemmer().stem(word)
stem_pt = lc.LancasterStemmer().stem(word)
stem_lc = sb.Snowball().stem(word) stem_sb
import nltk.stem as ns
= ns.WordNetLemmatizer()
lemmatizer for word in words:
= lemmatizer.lemmatize(word, pos='n')
n_lemma = lemmatizer.lemmatize(word, pos='v') v_lemma
import re
from collections import Counter
with open() as file:
= file.read()
content
= re.compile()
pattern =
content
= tk.word_tokenize
tokens = []
tokens =
tokens_pos
result.append(PorterStemmer().stem(token))=postag))
result.append(lemmatizer.lemmatize(token, pos100) Counter(result).most_common(