Gensim (Generate Similar) is a open source natural language processing library used for unsupervised topic modelling, using academic models and modern statistical machine learning.
Four core concepts of gensim
applys word2vec model to training text.
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
= Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4) model
where the sentences shall be presented by list of sentences(which is presented by list of words).
After training, the model can be saved as a file and load in the future:
"word2vec.model") Word2Vec.load("word2vec.model") model
The training is incremental and streamed, you can train the model later with new sentences:
'hello', 'world']], total_examples=1, epochs=1) model.train([[
The trained word vectors are stored in a KeyedVectors
= model.wv['computer'] computer_vec
We can store the trained word vector as an independent file and disgard the model if it’s no longer needed:
from gensim.models import KeyedVectors
= model.wv
word_vectors "wordvectors") KeyedVectors.load("wordvectors", mmap='r') wv