如何计算两个文本文档之间的相似度？

207

我正在研究以任何编程语言编写的NLP项目（尽管我会优先选择Python）。

我想拿两个文件并确定它们有多相似。

nlp

— 赖恩·伯恩
source

1

类似的问题在这里stackoverflow.com/questions/101569/…女巫一些不错的答案

292

这样做的常见方法是将文档转换为TF-IDF矢量，然后计算它们之间的余弦相似度。任何有关信息检索（IR）的教科书都涵盖了这一点。参见特别是。免费提供在线信息检索简介。

计算成对相似度

TF-IDF（和类似的文本转换）在Python包Gensim和scikit-learn中实现。在后一个包中，计算余弦相似度就像

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

或者，如果文档是纯字符串，

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

尽管Gensim可以为此类任务提供更多选择。

另请参阅此问题。

[免责声明：我参与了scikit-learn TF-IDF的实现。]

解释结果

从上方pairwise_similarity是一个方形的Scipy 稀疏矩阵，行和列的数量等于语料库中文档的数量。

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

您可以通过.toarray()或将稀疏数组转换为NumPy数组.A：

>>> pairwise_similarity.toarray()                                                                                                                                                                                                                            
array([[1.        , 0.17668795, 0.27056873, 0.        , 0.        ],
       [0.17668795, 1.        , 0.15439436, 0.        , 0.        ],
       [0.27056873, 0.15439436, 1.        , 0.19635649, 0.16815247],
       [0.        , 0.        , 0.19635649, 1.        , 0.54499756],
       [0.        , 0.        , 0.16815247, 0.54499756, 1.        ]])

假设我们要查找与最终文档最相似的文档“ scikit学习文档是橙色和蓝色”。该文档的索引为4 corpus。您可以通过获取该行的argmax来找到最相似文档的索引，但是首先您需要屏蔽1，代表每个文档与其自身的相似性。您可以通过来完成后者np.fill_diagonal()，而前者可以通过np.nanargmax()：

>>> import numpy as np     

>>> arr = pairwise_similarity.toarray()     
>>> np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            

>>> input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
>>> input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
>>> input_idx                                                                                                                                                                                                                                                
4

>>> result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
>>> corpus[result_idx]                                                                                                                                                                                                                                       
'I prefer scikit-learn to Orange'

注意：使用稀疏矩阵的目的是为了节省大量的语料库和词汇（占用大量空间）。您可以执行以下操作，而不是转换为NumPy数组：

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

— 弗雷德·富
source

1

@larsmans如果可以的话，您能否解释一下数组，我应该如何读取该数组。前两列之间是否相似？前两句话？

— 加分号2012年

1

@空假设：在位置（i，j），您发现文档i和文档j之间的相似性得分。因此，位置（0,2）是第一个文档和第三个文档之间的相似度值（使用从零开始的索引），该值与您在（2,0）处发现的相同，因为余弦相似度是可交换的。

— 弗雷德·富

1

如果我要对1的对角线以外的所有值求平均值，那将是一种获得四个文档彼此相似程度的单个分数的好方法吗？如果不是，是否有更好的方法来确定多个文档之间的整体相似性？

— user301752

2

@ user301752：您可以使用来获取tf-idf向量的元素平均值（如k-means一样）X.mean(axis=0)，然后根据该平均值计算平均/最大/中值（*）欧几里得距离。（∗）选择任何您喜欢的。

— Fred Foo

1

@curious：我将示例代码更新为当前的scikit-learn API；您可能想尝试新代码。

— Fred Foo 2013年

87

与@larsman相同，但需进行一些预处理

import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt') # if necessary...


stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

'''remove punctuation, lowercase, stem'''
def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]


print cosine_sim('a little bird', 'a little bird')
print cosine_sim('a little bird', 'a little bird chirps')
print cosine_sim('a little bird', 'a big dog barks')

— 雷诺
source

@Renaud，真的很好，答案很清楚！我有两个疑问：I）在tfidf * tfidf.T之后合并的[0,1]是什么？II）反向文档频率是由所有文章或仅由两个文章构成的（假设您有两个以上）？

— Economist_Ayahuasca

2

@AndresAzqueta [0,1]是相似性在矩阵中的位置，因为两个文本输入将创建一个2x2对称矩阵。

— 菲利普·伯格斯特伦（PhilipBergström）

1

@Renaud，谢谢您的完整代码。对于遇到错误询问nltk.download（）的用户，您可以轻松地执行nltk.download（'punkt'）。您不需要下载所有内容。

— 1man

@Renaud我没有更根本的问题。哪些文本字符串应fit以及transform？

— John Strood '18年

@JohnStrood我不明白您的问题，对不起，您可以改写吗？

— 雷诺

45

这是一个古老的问题，但是我发现使用Spacy可以轻松完成。读取文档后，similarity可以使用简单的api 查找文档向量之间的余弦相似度。

import spacy
nlp = spacy.load('en')
doc1 = nlp(u'Hello hi there!')
doc2 = nlp(u'Hello hi there!')
doc3 = nlp(u'Hey whatsup?')

print doc1.similarity(doc2) # 0.999999954642
print doc2.similarity(doc3) # 0.699032527716
print doc1.similarity(doc3) # 0.699032527716

— 库斯塔夫·辛哈（Koustuv Sinha）
source

2

我不知道为什么doc1和doc2之间的相似度是0.999999954642而不是1.0

— JordanBelf

4

@JordanBelf浮点数在大多数语言中确实会徘徊-因为它们在数字表示中不能具有无限的精度。例如，对浮点运算或产生无理数，总会产生很小的舍入误差，然后乘以。从规模上讲，这是灵活表示的缺点。

— scipilot'7

2

在这种情况下，相似性方法使用的距离函数是什么？

— ikel

如果您发现“ en”有问题，请运行以下pip install spacy && python -m spacy download zh

— Cybernetic

1

@Cybernetic看看如何在SpaCy的.similarity方法计算

— 瓦尔特

17

通常，两个文档之间的余弦相似度用作文档的相似性度量。在Java中，可以使用Lucene（如果您的集合很大）或LingPipe来执行此操作。基本概念是对每个文档中的项进行计数并计算项向量的点积。这些库确实提供了对该通用方法的一些改进，例如，使用反向文档频率和计算tf-idf向量。如果您想做一些copmlex，LingPipe还提供了一些方法来计算文档之间的LSA相似度，这比余弦相似度更好。对于Python，可以使用NLTK。

— Pulkit Goyal
source

4

请注意，没有“ LSA相似性”。LSA是一种减少矢量空间维数的方法（可以加快速度或建模主题而不是术语）。与BOW和tf-idf一起使用的相同相似性度量标准可以与LSA一起使用（余弦相似度，欧几里得相似度，BM25等）。

— Witiko

16

如果您正在寻找非常准确的信息，则需要使用比tf-idf更好的工具。通用句子编码器是找到任意两段文本之间相似度的最准确的编码器之一。Google提供了预训练的模型，您可以将其用于自己的应用程序，而无需从头开始训练任何东西。首先，您必须安装tensorflow和tensorflow-hub：

    pip install tensorflow
    pip install tensorflow_hub

下面的代码使您可以将任何文本转换为固定长度的矢量表示形式，然后可以使用点积来找出它们之间的相似性

import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/1?tf-hub-format=compressed"

# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)

# sample text
messages = [
# Smartphones
"My phone is not good.",
"Your cellphone looks great.",

# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",

# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]

similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    message_embeddings_ = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages})

    corr = np.inner(message_embeddings_, message_embeddings_)
    print(corr)
    heatmap(messages, messages, corr)

和绘图代码：

def heatmap(x_labels, y_labels, values):
    fig, ax = plt.subplots()
    im = ax.imshow(values)

    # We want to show all ticks...
    ax.set_xticks(np.arange(len(x_labels)))
    ax.set_yticks(np.arange(len(y_labels)))
    # ... and label them with the respective list entries
    ax.set_xticklabels(x_labels)
    ax.set_yticklabels(y_labels)

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", fontsize=10,
         rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    for i in range(len(y_labels)):
        for j in range(len(x_labels)):
            text = ax.text(j, i, "%.2f"%values[i, j],
                           ha="center", va="center", color="w", 
fontsize=6)

    fig.tight_layout()
    plt.show()

结果将是：

正如您所看到的，最相似的是文本之间的相互关系，然后是其含义紧密的文本。

重要信息：第一次运行代码时，它会很慢，因为它需要下载模型。如果要防止它再次下载模型并使用本地模型，则必须创建一个用于缓存的文件夹并将其添加到环境变量中，然后在第一次运行后使用该路径：

tf_hub_cache_dir = "universal_encoder_cached/"
os.environ["TFHUB_CACHE_DIR"] = tf_hub_cache_dir

# pointing to the folder inside cache dir, it will be unique on your system
module_url = tf_hub_cache_dir+"/d8fbeb5c580e50f975ef73e80bebba9654228449/"
embed = hub.Module(module_url)

— 罗霍拉·赞迪（Rohola Zandie）
source

您好，感谢您的示例鼓励我尝试TF-对象“ np”应该从哪里来？

— 开放食品经纪公司，

1

UPD好的，我已经安装了numpy，matplotlib以及用于绘图的系统TK Python绑定，并且可以使用！

— 开放食品经纪公司

1

以防万一（对不起缺少换行符）：导入tensorflow作为tf导入tensorflow_hub作为集线器import matplotlib.pyplot as plt导入numpy作为np

— dinnouti

5

这是一个小应用程序，可帮助您入门...

import difflib as dl

a = file('file').read()
b = file('file1').read()

sim = dl.get_close_matches

s = 0
wa = a.split()
wb = b.split()

for i in wa:
    if sim(i, wb):
        s += 1

n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)

— 本
source

4

如果您要使用大量文档，difflib非常慢。

— Phyo Arkar Lwin 2012年

2

您可能要尝试此在线服务以了解余弦文档的相似性http://www.scurtu.it/documentSimilarity.html

import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)    
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)  
print responseObject

— 叶卡捷琳娜·戈尔钦斯基（Ekaterina Gorchinsky）
source

Api是否使用差分顺序匹配器？如果是，那么python中的一个简单函数将从difflib import SequenceMatcher def isStringSimilar（a，b）：ratio = SequenceMatcher（None，a，b）.ratio（）返回率____________________________________ ______________________________

— Rudresh Ajgaonkar

2

如果您对测量两段文本的语义相似性更感兴趣，建议您看一下这个gitlab项目。您可以将其作为服务器运行，还有一个预先构建的模型，可以轻松地使用它来测量两段文本的相似性。即使它主要是用来测量两个句子的相似度的，但仍然可以在您的情况下使用它。它是用Java编写的，但可以作为RESTful服务运行。

另一个选择是DKPro相似度，它是一个具有各种算法的库，用于测量文本的相似度。但是，它也是用Java编写的。

代码示例：

// this similarity measure is defined in the dkpro.similarity.algorithms.lexical-asl package
// you need to add that to your .pom to make that example work
// there are some examples that should work out of the box in dkpro.similarity.example-gpl 
TextSimilarityMeasure measure = new WordNGramJaccardMeasure(3);    // Use word trigrams

String[] tokens1 = "This is a short example text .".split(" ");   
String[] tokens2 = "A short example text could look like that .".split(" ");

double score = measure.getSimilarity(tokens1, tokens2);

System.out.println("Similarity: " + score);

— 穆罕默德·阿里
source

2

要使用很少的数据集查找句子相似度并获得较高的准确性，您可以使用下面的python包，该包使用预训练的BERT模型，

pip install similar-sentences

— 香卡·加内什·贾亚拉曼（Shankar Ganesh Jayaraman）
source

我只是尝试过，但是它使每个句子与一个主要句子相似，但是有什么方法可以将所有的句子.txt训练数据作为一个类来创建，并获得与所有示例匹配的置信度的分数？

— Guru Teja

1

是的，您可以尝试.batch_predict（BatchFile，NumberOfPrediction），该输出将以Columns ['Sentence'，'Suggestion'，'Score']的形式输出为Results.xls

— Shankar Ganesh Jayaraman

1

对于语法相似性，可以有3种简单的方法来检测相似性。

Word2Vec
手套
Tfidf或countvectorizer

对于语义相似性，可以使用BERT嵌入并尝试不同的词池策略来获取文档嵌入，然后将余弦相似度应用于文档嵌入。

先进的方法可以使用BERT SCORE获得相似性。

研究论文链接：https : //arxiv.org/abs/1904.09675

— 肖里娅·乌帕尔
source