如果您想对Levenshtein和Difflib的相似性进行快速的视觉比较,我计算了约230万本书的书名:
import codecs, difflib, Levenshtein, distance
with codecs.open("titles.tsv","r","utf-8") as f:
title_list = f.read().split("\n")[:-1]
for row in title_list:
sr = row.lower().split("\t")
diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
lev = Levenshtein.ratio(sr[3], sr[4])
sor = 1 - distance.sorensen(sr[3], sr[4])
jac = 1 - distance.jaccard(sr[3], sr[4])
print diffl, lev, sor, jac
然后,我用R绘制结果:
出于好奇,我还比较了Difflib,Levenshtein,Sørensen和Jaccard相似度值:
library(ggplot2)
require(GGally)
difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")
ggpairs(difflib)
结果:
Difflib / Levenshtein的相似性确实很有趣。
2018编辑:如果您要识别相似的字符串,还可以查看minhashing-这里有一个很棒的概述。Minhashing在线性时间内在大型文本集合中发现相似之处非常了不起。我的实验室在这里组装了一个应用程序,该应用程序使用minhashing检测和可视化文本重用:https://github.com/YaleDHLab/intertext