Java中的相似字符串比较


111

我想相互比较几个字符串,并找到最相似的字符串。我想知道是否有任何库,方法或最佳实践会返回我哪些字符串与其他字符串更相似的字符串。例如:

  • “狐狸跳了”->“狐狸跳了”
  • “狐狸跳了”->“狐狸”

该比较将返回第一个比第二个更相似。

我想我需要一些方法,例如:

double similarityIndex(String s1, String s2)

某处有这样的东西吗?

编辑:为什么我要这样做?我正在编写一个脚本,该脚本将MS Project文件的输出与处理任务的某些旧系统的输出进行比较。由于传统系统的字段宽度非常有限,因此在添加值时将省略描述。我想要一种半自动化的方法来查找MS Project中的哪些条目与系统上的条目相似,以便获得生成的密钥。它有缺点,因为它仍然必须手动检查,但是会节省很多工作

Answers:


82

是的,有许多文献证明的算法,例如:

  • 余弦相似度
  • Jaccard相似度
  • 骰子系数
  • 匹配相似度
  • 重叠相似度

可以在此处找到一个很好的摘要(“ Sam's String Metrics”)(原始链接已失效,因此它链接到Internet Archive)

还要检查以下项目:


18
+1 Simmetrics网站似乎不再处于活动状态。但是,我在sourceforge上找到了代码:sourceforge.net/projects/simmetrics感谢您的指导。
Michael Merchant

7
“您可以检查此”链接已断开。
Kiril 2014年

1
这就是Michael Merchant在上面发布正确链接的原因。
emilyk 2014年

2
sourceforge上用于Simmetrics的jar有点过时了,github.com/mpkorstanje/simmetrics是带有maven工件的已更新github页面
tom91136 2015年

为了增加@MichaelMerchant的评论,该项目也可以在github上找到。尽管那里不是很活跃,但是比sourceforge更新了一些。
Ghurdyl '18

163

在许多库中,以0%-100%的方式计算两个字符串之间相似度的常用方法是测量必须更改较长的字符串以使其变为较短的字符串的百分比(%):

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below


计算editDistance()

editDistance()上面的函数可以计算两个字符串之间的编辑距离。此步骤有几种实现,每种实现可能更适合特定的情况。最常见的是Levenshtein距离算法,我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能会表现更好)。

这是两个用于计算编辑距离的选项:


工作示例:

在此处查看在线演示。

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

输出:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

11
Levenshtein距离方法在中可用org.apache.commons.lang3.StringUtils
Cleankod 2014年

@Cleankod现在它是commons-text的一部分:commons.apache.org/proper/commons-text/javadocs/api-release/org/…–
Luiz

15

我将Levenshtein距离算法翻译成JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};

11

您可以使用Levenshtein距离来计算两个字符串之间的差异。 http://en.wikipedia.org/wiki/Levenshtein_distance


2
Levenshtein非常适合一些字符串,但不能扩展到大量字符串之间的比较。
支出者

我在Java中使用Levenshtein取得了一些成功。我没有对大量列表进行比较,因此可能会对性能造成影响。而且它有点简单,可以使用一些调整来提高较短单词(例如3或4个字符)的阈值,这些单词通常看起来比应有的相似(从猫到狗只有3次编辑)请注意,编辑距离以下建议的内容大致相同-Levenshtein是编辑距离的一种特殊实现。
大黄2009年

以下文章显示了如何将Levenshtein与有效的SQL查询结合使用:literatejava.com/sql/fuzzy-string-search-sql
Thomas W

10

确实有很多字符串相似性度量:

  • Levenshtein编辑距离;
  • Damerau-Levenshtein距离;
  • Jaro-Winkler相似度;
  • 最长公共子序列编辑距离;
  • Q-Gram(Ukkonen);
  • n-格拉姆距离(Kondrak);
  • 贾卡德指数;
  • Sorensen-Dice系数;
  • 余弦相似度;
  • ...

您可以在这里找到这些的解释和Java实现:https : //github.com/tdebatty/java-string-similarity





3

如果您的字符串变成文档,听起来对我来说就像是窃者。也许搜索该词会带来一些好处。

“编程集体智慧”中有一章确定两个文档是否相似。该代码是使用Python编写的,但是很干净并且易于移植。


3

多亏了第一个回答者,我认为computeEditDistance(s1,s2)有2个计算。由于花费大量时间,因此决定提高代码的性能。所以:

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.