Answers:
是的,您可以利用预先训练的模型。最著名的是经过GoogleNewsData训练的模型,您可以在这里找到。
预先训练的单词和短语向量https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
然后,您可以使用gensim将二进制格式的向量加载到模型中,如下所示。
>>> model = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False) # C text format
>>> model = Word2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True) # C binary format
这是英语维基百科的另一种预构建模型:
资料来源:https : //github.com/idio/wiki2vec/
使用预建模型
Get python 2.7
Install gensim: pip install gensim
uncompress downloaded model: tar -xvf model.tar.gz
Load model in gensim:
from gensim.models import Word2Vec
model = Word2Vec.load("path/to/word2vec/en.model")
model.similarity('woman', 'man')
您也可以使用Stanford NLP手套
这是预训练的word2vec模型的出色汇编。
一些其他的预训练模型:
有关gensim和代码的更多信息,请访问:https://radimrehurek.com/gensim/models/word2vec.html
Quora论坛有类似问题
model = Word2Vec.load(fname) # you can continue training with the loaded model!
可以从Stanford NLP组直接获得基于大型语料库训练的分布式表示(Glove)。您可以直接在应用程序中使用这些单词嵌入(而不是使用1个热编码向量,然后训练网络以获取嵌入)。如果您的任务不太专业,那么从这套嵌入开始将在实践中很好地工作。
from gensim.models import Word2Vec
# Word2Vec is full model which is trainable but takes larger memory
from gensim.models import KeyedVectors
# KeyedVectors is reduced vector model which is NOT trainable but takes less memory
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) #load pretrained google w2v
sen1 = 'w1 w2 w3'
sen2 = 'word1 word2 word3'
sentences = [[word for word in sen1.split()],[word for word in sen2.split()]]
total_examples = model_2.corpus_count
model_2 = Word2Vec(size=300, min_count=1) #initiate a full model
model_2.build_vocab(sentences) #add words in training dataset
#load words from pretrained google dataset
model_2.build_vocab([list(model.vocab.keys())], update=True)
model_2.intersect_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, lockf=1.0)
#retrain pretrained w2v from new dataset
model_2.train(sentences, total_examples=total_examples, epochs=model_2.iter)