这更像是一般的NLP问题。训练单词嵌入即Word2Vec的适当输入是什么?属于文章的所有句子是否应该是语料库中的单独文档?还是每个文章都应该是所述语料库中的文档?这只是使用python和gensim的示例。
语料库按句子拆分:
SentenceCorpus = [["first", "sentence", "of", "the", "first", "article."],
["second", "sentence", "of", "the", "first", "article."],
["first", "sentence", "of", "the", "second", "article."],
["second", "sentence", "of", "the", "second", "article."]]
语料库按文章划分:
ArticleCorpus = [["first", "sentence", "of", "the", "first", "article.",
"second", "sentence", "of", "the", "first", "article."],
["first", "sentence", "of", "the", "second", "article.",
"second", "sentence", "of", "the", "second", "article."]]
用Python训练Word2Vec:
from gensim.models import Word2Vec
wikiWord2Vec = Word2Vec(ArticleCorpus)