如何使用Gensim加载FastText预训练模型？

21

我试图从这里的Fasttext模型加载fastText预训练模型。我正在使用wiki.simple.en

from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)

但是，它显示以下错误

Traceback (most recent call last):
  File "nltk_check.py", line 28, in <module>
    word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)
  File "P:\major_project\venv\lib\sitepackages\gensim\models\keyedvectors.py",line 206, in load_word2vec_format
     header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "P:\major_project\venv\lib\site-packages\gensim\utils.py", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

问题1如何在Gensim中加载fasttext模型？

问题2另外，加载模型后，我想找到两个词之间的相似性

 model.find_similarity('teacher', 'teaches')
 # Something like this
 Output : 0.99

我该怎么做呢？

nlp gensim

— 萨比乌·沙（Sabbiu Shah）
source

17

这是gensim fasttext.py中可用于快速文本实现的方法的链接

from gensim.models.wrappers import FastText

model = FastText.load_fasttext_format('wiki.simple')

print(model.most_similar('teacher'))
# Output = [('headteacher', 0.8075869083404541), ('schoolteacher', 0.7955552339553833), ('teachers', 0.733420729637146), ('teaches', 0.6839243173599243), ('meacher', 0.6825737357139587), ('teach', 0.6285147070884705), ('taught', 0.6244685649871826), ('teaching', 0.6199781894683838), ('schoolmaster', 0.6037642955780029), ('lessons', 0.5812176465988159)]

print(model.similarity('teacher', 'teaches'))
# Output = 0.683924396754

— 萨比乌·沙（Sabbiu Shah）
source

我懂了DeprecationWarning: Call to deprecated `load_fasttext_format` (use load_facebook_vectors。所以我正在使用 from gensim.models.fasttext import load_facebook_model

— Hrushikesh Dhumal

8

对于.bin使用：（load_fasttext_format()通常包含带有参数，ngram等的完整模型）。

对于.vec，请使用：（load_word2vec_format仅包含单词向量->没有ngrams +您无法更新模型）。

注意 :: ::如果遇到内存问题或无法加载.bin模型，请检查pyfasttext模型是否相同。

学分：Ivan Menshikh（Gensim维护者）

— 阿卡什·坎帕尔（Akash Kandpal）
source

1

“对于.bin ....，您可以在加载后继续训练。” 事实并非如此，因为文档指出：“由于FastText API的限制，您不能继续使用这种方式加载的模型进行训练。” radimrehurek.com/gensim/models/...

— 舍甫琴科Drozdyuk

不再是这样：DeprecationWarning：已弃用。请改用gensim.models.KeyedVectors.load_word2vec_format。

— mickythump，

2

FastText二进制格式（看起来像是您要加载的word2vec格式）与Gensim的格式不兼容。前者包含有关子词单位的其他信息，而后者word2vec没有利用。

在FastText Github页面上对此问题进行了一些讨论（以及解决方法）。简而言之，您将必须加载文本格式（可从https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md获得）。

加载文本格式后，可以使用Gensim将其保存为二进制格式，这将大大减小模型大小，并加快将来的加载速度。

https://github.com/facebookresearch/fastText/issues/171#issuecomment-294295302

— 弗雷德
source