Gensim (Generate Similar) is a open source natural language processing library used for unsupervised topic modelling, using academic models and modern statistical machine learning.
Four core concepts of gensim
:
gensim.models.Word2Vec
gensim.models.Word2Vec
gensim.models.Word2Vec
applys word2vec model to training text.
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
= Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4) model
where the sentences shall be presented by list of sentences(which is presented by list of words).
After training, the model can be saved as a file and load in the future:
"word2vec.model")
model.save(= Word2Vec.load("word2vec.model") model
The training is incremental and streamed, you can train the model later with new sentences:
'hello', 'world']], total_examples=1, epochs=1) model.train([[
gensim.models.Word2Vec
The trained word vectors are stored in a KeyedVectors
instance:
= model.wv['computer'] computer_vec
We can store the trained word vector as an independent file and disgard the model if it’s no longer needed:
from gensim.models import KeyedVectors
= model.wv
word_vectors "wordvectors")
word_vectors.save(= KeyedVectors.load("wordvectors", mmap='r') wv