变长字符串的一种更好的相似度排序算法

152

我正在寻找一种字符串相似性算法，该算法在可变长度字符串上比通常建议的字符串（levenshtein距离，soundex等）产生更好的结果。

例如，

给定字符串A：“ Robert”，

然后字符串B：“艾米·罗伯逊”

会比这更好

字符串C：“理查德”

同样，最好该算法应该是语言不可知的（也可以使用英语以外的其他语言）。

— 马扎高
source

在.NET相似：stackoverflow.com/questions/83777/...

— 西罗桑蒂利郝海东冠状病六四事件法轮功

另请

— 参阅

155

Catalysoft的Simon White写了一篇关于非常聪明的算法的文章，该算法比较了相邻字符对，对于我的目的而言确实非常有效：

http://www.catalysoft.com/articles/StrikeAMatch.html

Simon具有该算法的Java版本，下面我编写了它的PL / Ruby版本（摘自Mark Wong-VanHaren在相关论坛条目注释中完成的纯红宝石版本），以便可以在PostgreSQL查询中使用它：

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '

str1.downcase! 
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
  |pair| pair.include? " "}
str2.downcase! 
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
  |pair| pair.include? " "}
union = pairs1.size + pairs2.size 
intersection = 0 
pairs1.each do |p1| 
  0.upto(pairs2.size-1) do |i| 
    if p1 == pairs2[i] 
      intersection += 1 
      pairs2.slice!(i) 
      break 
    end 
  end 
end 
(2.0 * intersection) / union

' LANGUAGE 'plruby';

奇迹般有效！

— 马扎高
source

32

您找到答案并在4分钟内写下所有内容？令人印象深刻！

— Matt J

28

经过一些研究和实施，我准备了答案。我把它放在这里的好处是，无论有谁来寻求使用替代算法的实用答案，因为相关问题中的大多数答案似乎都围绕着levenshtein或soundex。

— marzagao

18

正是我一直在寻找的东西。你愿意嫁给我吗？

— BlackTea 2010年

6

@JasonSundram是正确的-实际上，这是字符级双字母组合上众所周知的Dice系数，正如作者在“附录”（页面底部）中所写。

— Fred Foo 2012年

4

比较具有单个孤立字母作为差异的字符串时，这将返回“分数” 1（100％匹配），例如以下示例：string_similarity("vitamin B", "vitamin C") #=> 1，是否有一种简单的方法来防止这种行为？

— MrYoshiji

77

marzagao的答案很好。我将其转换为C＃，所以我想将其发布在这里：

粘贴链接

/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
    /// <summary>
    /// Compares the two strings based on letter pair matches
    /// </summary>
    /// <param name="str1"></param>
    /// <param name="str2"></param>
    /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
    public double CompareStrings(string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }
}

— 迈克尔·拉沃伊
source

2

如果可以的话，+ 100，您就救了我一个辛苦的工作伴侣！干杯。

— vvohra87

1

非常好！我唯一的建议就是将其扩展。

— Levitikon

+1！很好用，对Java也做了些微修改。而且它确实比莱文施泰因返回更好的响应。

— Xyene 2012年

1

我在下面添加了将其转换为扩展方法的版本。感谢您的原始版本和出色的翻译。

— Frank Rundatz

@Michael La Voie谢谢，这真是太好了！尽管存在一些问题(2.0 * intersection) / union-比较两个空字符串时得到Double.NaN。

— Vojtěch多纳尔

41

这是marzagao答案的另一个版本，这个版本是用Python编写的：

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    s = string.lower()
    return [s[i:i+2] for i in list(range(len(s) - 1))]

def string_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

if __name__ == "__main__":
    """
    Run a test using the example taken from:
    http://www.catalysoft.com/articles/StrikeAMatch.html
    """
    w1 = 'Healed'
    words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']

    for w2 in words:
        print('Healed --- ' + w2)
        print(string_similarity(w1, w2))
        print()

— 约翰·拉特里奇
source

2

当一个单词中有重复的ngram时，string_likeity存在一个小错误，导致相同字符串的得分> 1。在“ hit_count + = 1”之后添加“ break”即可解决该问题。

— jbaiter 2013年

1

@jbaiter：好收获。我更改它以反映您的更改。

— John Rutledge

3

在西蒙·怀特（Simon White）的文章中，他说：“请注意，只要找到匹配项，该字符对就会从第二个数组列表中删除，以防止我们多次对同一个字符对进行匹配。（否则，'GGGGG'将获得完美匹配反对“ GG”。）”我将更改此陈述，以表示它会提供比完美匹配更高的匹配。不考虑这一点，似乎还导致该算法不是可传递的（相似性（x，y）= / =相似性（y，x））。在hit_count + = 1行之后添加pairs2.remove（y）解决了该问题。

— NinjaMeTimbers

17

这是Simon White建议的StrikeAMatch算法的PHP实现。优点（如链接中所述）是：

词汇相似性的真实体现 -差异很小的字符串应被视为相似。特别是，明显的子字符串重叠应指向字符串之间的高度相似性。
单词顺序更改的稳健性 -包含相同单词但顺序不同的两个字符串应被视为相似。另一方面，如果一个字符串只是另一个字符串中包含的随机字符，则（通常）应将其识别为不相似。
语言独立性 -算法不仅应以英语运行，还应以多种不同语言运行。

<?php
/**
 * LetterPairSimilarity algorithm implementation in PHP
 * @author Igal Alkon
 * @link http://www.catalysoft.com/articles/StrikeAMatch.html
 */
class LetterPairSimilarity
{
    /**
     * @param $str
     * @return mixed
     */
    private function wordLetterPairs($str)
    {
        $allPairs = array();

        // Tokenize the string and put the tokens/words into an array

        $words = explode(' ', $str);

        // For each word
        for ($w = 0; $w < count($words); $w++)
        {
            // Find the pairs of characters
            $pairsInWord = $this->letterPairs($words[$w]);

            for ($p = 0; $p < count($pairsInWord); $p++)
            {
                $allPairs[] = $pairsInWord[$p];
            }
        }

        return $allPairs;
    }

    /**
     * @param $str
     * @return array
     */
    private function letterPairs($str)
    {
        $numPairs = mb_strlen($str)-1;
        $pairs = array();

        for ($i = 0; $i < $numPairs; $i++)
        {
            $pairs[$i] = mb_substr($str,$i,2);
        }

        return $pairs;
    }

    /**
     * @param $str1
     * @param $str2
     * @return float
     */
    public function compareStrings($str1, $str2)
    {
        $pairs1 = $this->wordLetterPairs(strtoupper($str1));
        $pairs2 = $this->wordLetterPairs(strtoupper($str2));

        $intersection = 0;

        $union = count($pairs1) + count($pairs2);

        for ($i=0; $i < count($pairs1); $i++)
        {
            $pair1 = $pairs1[$i];

            $pairs2 = array_values($pairs2);
            for($j = 0; $j < count($pairs2); $j++)
            {
                $pair2 = $pairs2[$j];
                if ($pair1 === $pair2)
                {
                    $intersection++;
                    unset($pairs2[$j]);
                    break;
                }
            }
        }

        return (2.0*$intersection)/$union;
    }
}

— 伊加尔·阿尔孔（Igal Alkon）
source

17

John Rutledge的答案的简短版本：

def get_bigrams(string):
    '''
    Takes a string and returns a list of bigrams
    '''
    s = string.lower()
    return {s[i:i+2] for i in xrange(len(s) - 1)}

def string_similarity(str1, str2):
    '''
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    '''
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))

— 量子
source

即使intersection变量是行浪费。

— Chibueze Opata

14

这次讨论真的很有帮助，谢谢。我将算法转换为VBA以与Excel一起使用，并编写了几个版本的工作表函数，一个版本用于简单比较一对字符串，另一个版本用于比较一个字符串与字符串范围/数组。strSimLookup版本返回最后的最佳匹配作为字符串，数组索引或相似性度量。

这种实现产生的结果与西蒙·怀特（Simon White）网站上的亚马逊示例中列出的结果相同，但低得分比赛中有一些次要的例外。不知道差异在哪里蔓延，可能是VBA的Split函数，但是我没有对其进行研究，因为它可以很好地实现我的目的。

'Implements functions to rate how similar two strings are on
'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar)
'Source:   http://www.catalysoft.com/articles/StrikeAMatch.html
'Author: Bob Chatham, bob.chatham at gmail.com
'9/12/2010

Option Explicit

Public Function stringSimilarity(str1 As String, str2 As String) As Variant
'Simple version of the algorithm that computes the similiarity metric
'between two strings.
'NOTE: This verision is not efficient to use if you're comparing one string
'with a range of other values as it will needlessly calculate the pairs for the
'first string over an over again; use the array-optimized version for this case.

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection

    Set sPairs1 = New Collection
    Set sPairs2 = New Collection

    WordLetterPairs str1, sPairs1
    WordLetterPairs str2, sPairs2

    stringSimilarity = SimilarityMetric(sPairs1, sPairs2)

    Set sPairs1 = Nothing
    Set sPairs2 = Nothing

End Function

Public Function strSimA(str1 As Variant, rRng As Range) As Variant
'Return an array of string similarity indexes for str1 vs every string in input range rRng
    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim arrOut As Variant
    Dim l As Long, j As Long

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    l = rRng.Count
    ReDim arrOut(1 To l)
    For j = 1 To l
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(j)), sPairs2
        arrOut(j) = SimilarityMetric(sPairs1, sPairs2)
        Set sPairs2 = Nothing
    Next j

    strSimA = Application.Transpose(arrOut)

End Function

Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant
'Return either the best match or the index of the best match
'depending on returnTYype parameter) between str1 and strings in rRng)
' returnType = 0 or omitted: returns the best matching string
' returnType = 1           : returns the index of the best matching string
' returnType = 2           : returns the similarity metric

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim metric, bestMetric As Double
    Dim i, iBest As Long
    Const RETURN_STRING As Integer = 0
    Const RETURN_INDEX As Integer = 1
    Const RETURN_METRIC As Integer = 2

    If IsMissing(returnType) Then returnType = RETURN_STRING

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    bestMetric = -1
    iBest = -1

    For i = 1 To rRng.Count
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(i)), sPairs2
        metric = SimilarityMetric(sPairs1, sPairs2)
        If metric > bestMetric Then
            bestMetric = metric
            iBest = i
        End If
        Set sPairs2 = Nothing
    Next i

    If iBest = -1 Then
        strSimLookup = CVErr(xlErrValue)
        Exit Function
    End If

    Select Case returnType
    Case RETURN_STRING
        strSimLookup = CStr(rRng(iBest))
    Case RETURN_INDEX
        strSimLookup = iBest
    Case Else
        strSimLookup = bestMetric
    End Select

End Function

Public Function strSim(str1 As String, str2 As String) As Variant
    Dim ilen, iLen1, ilen2 As Integer

    iLen1 = Len(str1)
    ilen2 = Len(str2)

    If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1

    strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen))

End Function

Sub WordLetterPairs(str As String, pairColl As Collection)
'Tokenize str into words, then add all letter pairs to pairColl

    Dim Words() As String
    Dim word, nPairs, pair As Integer

    Words = Split(str)

    If UBound(Words) < 0 Then
        Set pairColl = Nothing
        Exit Sub
    End If

    For word = 0 To UBound(Words)
        nPairs = Len(Words(word)) - 1
        If nPairs > 0 Then
            For pair = 1 To nPairs
                pairColl.Add Mid(Words(word), pair, 2)
            Next pair
        End If
    Next word

End Sub

Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant
'Helper function to calculate similarity metric given two collections of letter pairs.
'This function is designed to allow the pair collections to be set up separately as needed.
'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection
'if this is not the desired behavior.
'Also assumes that collections will be deallocated somewhere else

    Dim Intersect As Double
    Dim Union As Double
    Dim i, j As Long

    If sPairs1.Count = 0 Or sPairs2.Count = 0 Then
        SimilarityMetric = CVErr(xlErrNA)
        Exit Function
    End If

    Union = sPairs1.Count + sPairs2.Count
    Intersect = 0

    For i = 1 To sPairs1.Count
        For j = 1 To sPairs2.Count
            If StrComp(sPairs1(i), sPairs2(j)) = 0 Then
                Intersect = Intersect + 1
                sPairs2.Remove j
                Exit For
            End If
        Next j
    Next i

    SimilarityMetric = (2 * Intersect) / Union

End Function

— Bchatham
source

@bchatham这看起来非常有用，但是我是VBA的新手，并且对代码有些挑战。您是否可以发布一个使用您的贡献的Excel文件？出于我的目的，我希望使用它来匹配Excel中单个列中的大约1000个条目的相似名字（此处摘录：dropbox.com/s/ofdliln9zxgi882/first-names-excerpt.xlsx）。然后，我将这些匹配项用作人物搜索中的同义词。（参见softwarerecs.stackexchange.com/questions/38227/...）

— bjornte

12

抱歉，答案不是作者发明的。这是Digital Equipment Corporation最初提出的众所周知的算法，通常称为重叠。

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf

10

我将西蒙·怀特的算法翻译成PL / pgSQL。这是我的贡献。

<!-- language: lang-sql -->

create or replace function spt1.letterpairs(in p_str varchar) 
returns varchar  as 
$$
declare

    v_numpairs integer := length(p_str)-1;
    v_pairs varchar[];

begin

    for i in 1 .. v_numpairs loop
        v_pairs[i] := substr(p_str, i, 2);
    end loop;

    return v_pairs;

end;
$$ language 'plpgsql';

--===================================================================

create or replace function spt1.wordletterpairs(in p_str varchar) 
returns varchar as
$$
declare
    v_allpairs varchar[];
    v_words varchar[];
    v_pairsinword varchar[];
begin
    v_words := regexp_split_to_array(p_str, '[[:space:]]');

    for i in 1 .. array_length(v_words, 1) loop
        v_pairsinword := spt1.letterpairs(v_words[i]);

        if v_pairsinword is not null then
            for j in 1 .. array_length(v_pairsinword, 1) loop
                v_allpairs := v_allpairs || v_pairsinword[j];
            end loop;
        end if;

    end loop;


    return v_allpairs;
end;
$$ language 'plpgsql';

--===================================================================

create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY)
returns anyarray as 
$$
    select array(select unnest($1) intersect select unnest($2))
$$ language 'sql';

--===================================================================

create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar)
returns float as
$$
declare
    v_pairs1 varchar[];
    v_pairs2 varchar[];
    v_intersection integer;
    v_union integer;
begin
    v_pairs1 := wordletterpairs(upper(p_str1));
    v_pairs2 := wordletterpairs(upper(p_str2));
    v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1); 

    v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1);

    return (2.0 * v_intersection / v_union);
end;
$$ language 'plpgsql';

— Fabiolimace
source

在不支持plruby的PostgreSQL上运行！谢谢！

— hostnik

谢谢！您将如何在Oracle SQL中执行此操作？

— olovholm 2013年

该端口不正确。精确的字符串不返回1

— 布兰登Wigfield

9

字符串相似性度量标准包含字符串比较中使用的许多不同度量标准的概述（维基百科也有概述）。这些度量标准中的许多是在库模拟中实现的。

度量的另一个示例（未包含在给定的概述中）例如是压缩距离（试图逼近Kolmogorov的复杂度），它可以用于比您所呈现的文本更长的文本。

您还可以考虑研究自然语言处理的更广泛的主题。这些 R软件包可以帮助您快速入门（或至少提供一些想法）。

最后编辑-在SO上搜索与此主题有关的其他问题，有很多相关问题。

— 匿名
source

9

该算法的更快的PHP版本：

/**
 *
 * @param $str
 * @return mixed
 */
private static function wordLetterPairs ($str)
{
    $allPairs = array();

    // Tokenize the string and put the tokens/words into an array

    $words = explode(' ', $str);

    // For each word
    for ($w = 0; $w < count($words); $w ++) {
        // Find the pairs of characters
        $pairsInWord = self::letterPairs($words[$w]);

        for ($p = 0; $p < count($pairsInWord); $p ++) {
            $allPairs[$pairsInWord[$p]] = $pairsInWord[$p];
        }
    }

    return array_values($allPairs);
}

/**
 *
 * @param $str
 * @return array
 */
private static function letterPairs ($str)
{
    $numPairs = mb_strlen($str) - 1;
    $pairs = array();

    for ($i = 0; $i < $numPairs; $i ++) {
        $pairs[$i] = mb_substr($str, $i, 2);
    }

    return $pairs;
}

/**
 *
 * @param $str1
 * @param $str2
 * @return float
 */
public static function compareStrings ($str1, $str2)
{
    $pairs1 = self::wordLetterPairs(mb_strtolower($str1));
    $pairs2 = self::wordLetterPairs(mb_strtolower($str2));


    $union = count($pairs1) + count($pairs2);

    $intersection = count(array_intersect($pairs1, $pairs2));

    return (2.0 * $intersection) / $union;
}

对于我拥有的数据（约2300次比较），使用Igal Alkon解决方案的运行时间为0.58秒，而使用我的运行时间为0.35 秒。

— 安迪
source

9

美丽的Scala中的一个版本：

  def pairDistance(s1: String, s2: String): Double = {

    def strToPairs(s: String, acc: List[String]): List[String] = {
      if (s.size < 2) acc
      else strToPairs(s.drop(1),
        if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2)))
    }

    val lst1 = strToPairs(s1.toUpperCase, List())
    val lst2 = strToPairs(s2.toUpperCase, List())

    (2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size)

  }

— 不知所措
source

6

这是R版本：

get_bigrams <- function(str)
{
  lstr = tolower(str)
  bigramlst = list()
  for(i in 1:(nchar(str)-1))
  {
    bigramlst[[i]] = substr(str, i, i+1)
  }
  return(bigramlst)
}

str_similarity <- function(str1, str2)
{
   pairs1 = get_bigrams(str1)
   pairs2 = get_bigrams(str2)
   unionlen  = length(pairs1) + length(pairs2)
   hit_count = 0
   for(x in 1:length(pairs1)){
        for(y in 1:length(pairs2)){
            if (pairs1[[x]] == pairs2[[y]])
                hit_count = hit_count + 1
        }
   }
   return ((2.0 * hit_count) / unionlen)
}

— 雷纳
source

对于大数据，此算法更好，但速度很慢。我的意思是，如果必须将10000个单词与15000个其他单词进行比较，那太慢了。我们可以提高速度方面的表现吗？

— indra_patil 2014年

6

受这些算法的启发，在C99中发布marzagao的答案

double dice_match(const char *string1, const char *string2) {

    //check fast cases
    if (((string1 != NULL) && (string1[0] == '\0')) || 
        ((string2 != NULL) && (string2[0] == '\0'))) {
        return 0;
    }
    if (string1 == string2) {
        return 1;
    }

    size_t strlen1 = strlen(string1);
    size_t strlen2 = strlen(string2);
    if (strlen1 < 2 || strlen2 < 2) {
        return 0;
    }

    size_t length1 = strlen1 - 1;
    size_t length2 = strlen2 - 1;

    double matches = 0;
    int i = 0, j = 0;

    //get bigrams and compare
    while (i < length1 && j < length2) {
        char a[3] = {string1[i], string1[i + 1], '\0'};
        char b[3] = {string2[j], string2[j + 1], '\0'};
        int cmp = strcmpi(a, b);
        if (cmp == 0) {
            matches += 2;
        }
        i++;
        j++;
    }

    return matches / (length1 + length2);
}

根据原始文章进行的一些测试：

#include <stdio.h>

void article_test1() {
    char *string1 = "FRANCE";
    char *string2 = "FRENCH";
    printf("====%s====\n", __func__);
    printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}


void article_test2() {
    printf("====%s====\n", __func__);
    char *string = "Healed";
    char *ss[] = {"Heard", "Healthy", "Help",
                  "Herded", "Sealed", "Sold"};
    int correct[] = {44, 55, 25, 40, 80, 0};
    for (int i = 0; i < 6; ++i) {
        printf("%2.f%% == %d%%\n", dice_match(string, ss[i]) * 100, correct[i]);
    }
}

void multicase_test() {
    char *string1 = "FRaNcE";
    char *string2 = "fREnCh";
    printf("====%s====\n", __func__);
    printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);

}

void gg_test() {
    char *string1 = "GG";
    char *string2 = "GGGGG";
    printf("====%s====\n", __func__);
    printf("%2.f%% != 100%%\n", dice_match(string1, string2) * 100);
}


int main() {
    article_test1();
    article_test2();
    multicase_test();
    gg_test();

    return 0;
}

— 剧本
source

5

根据Michael La Voie出色的C＃版本的构建，根据要求使其成为扩展方法，这就是我想到的。这样做的主要好处是可以按百分比匹配对通用列表进行排序。例如，假设您的对象中有一个名为“ City”的字符串字段。用户搜索“ Chester”，并且您要按匹配的降序返回结果。例如，您希望切斯特的字面匹配出现在罗切斯特之前。为此，向您的对象添加两个新属性：

    public string SearchText { get; set; }
    public double PercentMatch
    {
        get
        {
            return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper());
        }
    }

然后在每个对象上，将SearchText设置为用户搜索的内容。然后，您可以使用以下内容轻松对其进行排序：

    zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch);

这是对其进行一些细微的修改，使其成为扩展方法：

    /// <summary>
    /// This class implements string comparison algorithm
    /// based on character pair similarity
    /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
    /// </summary>
    public static double PercentMatchTo(this string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private static List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private static  string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }

— 弗兰克·朗达兹
source

我想你会过得更好使用布尔isCaseSensitive用默认值false -即使这是真的实行更干净

— 乔丹

5

我的JavaScript实现采用一个字符串或字符串数组，以及一个可选的下限（默认下限为0.5）。如果将字符串传递给它，它将根据字符串的相似性分数是否大于或等于下限来返回true或false。如果将字符串数组传递给它，它将返回相似度分数大于或等于下限的那些字符串的数组，并按分数排序。

例子：

'Healed'.fuzzy('Sealed');      // returns true
'Healed'.fuzzy('Help');        // returns false
'Healed'.fuzzy('Help', 0.25);  // returns true

'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']);
// returns ["Sealed", "Healthy"]

'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0);
// returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"]

这里是：

(function(){
  var default_floor = 0.5;

  function pairs(str){
    var pairs = []
      , length = str.length - 1
      , pair;
    str = str.toLowerCase();
    for(var i = 0; i < length; i++){
      pair = str.substr(i, 2);
      if(!/\s/.test(pair)){
        pairs.push(pair);
      }
    }
    return pairs;
  }

  function similarity(pairs1, pairs2){
    var union = pairs1.length + pairs2.length
      , hits = 0;

    for(var i = 0; i < pairs1.length; i++){
      for(var j = 0; j < pairs2.length; j++){
        if(pairs1[i] == pairs2[j]){
          pairs2.splice(j--, 1);
          hits++;
          break;
        }
      }
    }
    return 2*hits/union || 0;
  }

  String.prototype.fuzzy = function(strings, floor){
    var str1 = this
      , pairs1 = pairs(this);

    floor = typeof floor == 'number' ? floor : default_floor;

    if(typeof(strings) == 'string'){
      return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase();
    }else if(strings instanceof Array){
      var scores = {};

      strings.map(function(str2){
        scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase());
      });

      return strings.filter(function(str){
        return scores[str] >= floor;
      }).sort(function(a, b){
        return scores[b] - scores[a];
      });
    }
  };
})();

— Gruppler
source

1

错误/错别字！for(var j = 0; j < pairs1.length; j++){应该是for(var j = 0; j < pairs2.length; j++){

— Searle '18

3

Dice系数算法（Simon White / marzagao的答案）在Ruby中以amatch gem中的pair_distance_like方法实现

https://github.com/flori/amatch

该gem还包含许多近似匹配和字符串比较算法的实现：Levenshtein编辑距离，Sellers编辑距离，汉明距离，最长公共子序列长度，最长公共子字符串长度，对距离度量，Jaro-Winkler度量。

— 01
source

2

Haskell版本-可以建议修改内容，因为我没有做太多Haskell。

import Data.Char
import Data.List

-- Convert a string into words, then get the pairs of words from that phrase
wordLetterPairs :: String -> [String]
wordLetterPairs s1 = concat $ map pairs $ words s1

-- Converts a String into a list of letter pairs.
pairs :: String -> [String]
pairs [] = []
pairs (x:[]) = []
pairs (x:ys) = [x, head ys]:(pairs ys)

-- Calculates the match rating for two strings
matchRating :: String -> String -> Double
matchRating s1 s2 = (numberOfMatches * 2) / totalLength
  where pairsS1 = wordLetterPairs $ map toLower s1
        pairsS2 = wordLetterPairs $ map toLower s2
        numberOfMatches = fromIntegral $ length $ pairsS1 `intersect` pairsS2
        totalLength = fromIntegral $ length pairsS1 + length pairsS2

— 马修·帕尔默
source

2

Clojure：

(require '[clojure.set :refer [intersection]])

(defn bigrams [s]
  (->> (split s #"\s+")
       (mapcat #(partition 2 1 %))
       (set)))

(defn string-similarity [a b]
  (let [a-pairs (bigrams a)
        b-pairs (bigrams b)
        total-count (+ (count a-pairs) (count b-pairs))
        match-count (count (intersection a-pairs b-pairs))
        similarity (/ (* 2 match-count) total-count)]
    similarity))

— 肖恩·勒布朗
source

1

Levenshtein距离除以第一个琴弦的长度（或者除以我的两个琴弦的最小/最大/平均长度）会怎样？到目前为止，这对我有用。

— 特万
source

但是，引用此主题的另一篇文章，它返回的内容通常是“不稳定的”。它把“回声”与“狗”非常相似。

— Xyene

@Nox：此回复的“除以第一个字符串的长度”部分很重要。而且，对于错别字和换位错误，甚至是常见的共轭（例如，考虑比较“游泳”和“游泳”），这比广受好评的Dice算法的效果要好。

— Logan Pickup

1

大家好，我在javascript中进行了尝试，但是我是新手，有人知道更快的方法吗？

function get_bigrams(string) {
    // Takes a string and returns a list of bigrams
    var s = string.toLowerCase();
    var v = new Array(s.length-1);
    for (i = 0; i< v.length; i++){
        v[i] =s.slice(i,i+2);
    }
    return v;
}

function string_similarity(str1, str2){
    /*
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    */
    var pairs1 = get_bigrams(str1);
    var pairs2 = get_bigrams(str2);
    var union = pairs1.length + pairs2.length;
    var hit_count = 0;
    for (x in pairs1){
        for (y in pairs2){
            if (pairs1[x] == pairs2[y]){
                hit_count++;
            }
        }
    }
    return ((2.0 * hit_count) / union);
}


var w1 = 'Healed';
var word =['Heard','Healthy','Help','Herded','Sealed','Sold']
for (w2 in word){
    console.log('Healed --- ' + word[w2])
    console.log(string_similarity(w1,word[w2]));
}

— 用户名
source

此实现不正确。bigram函数针对长度为0的输入而中断。string_likeity方法在第二个循环内未正确中断，这可能导致多次计数对，从而导致返回值超过100％。而且，您也忘记了声明x和y，并且您不应该使用for..in..循环（请for(..;..;..)改用）来循环。

— 罗布W

1

这是基于Sørensen–Dice索引（marzagao的答案）的相似性的另一个版本，该版本用C ++ 11编写：

/*
 * Similarity based in Sørensen–Dice index.
 *
 * Returns the Similarity between _str1 and _str2.
 */
double similarity_sorensen_dice(const std::string& _str1, const std::string& _str2) {
    // Base case: if some string is empty.
    if (_str1.empty() || _str2.empty()) {
        return 1.0;
    }

    auto str1 = upper_string(_str1);
    auto str2 = upper_string(_str2);

    // Base case: if the strings are equals.
    if (str1 == str2) {
        return 0.0;
    }

    // Base case: if some string does not have bigrams.
    if (str1.size() < 2 || str2.size() < 2) {
        return 1.0;
    }

    // Extract bigrams from str1
    auto num_pairs1 = str1.size() - 1;
    std::unordered_set<std::string> str1_bigrams;
    str1_bigrams.reserve(num_pairs1);
    for (unsigned i = 0; i < num_pairs1; ++i) {
        str1_bigrams.insert(str1.substr(i, 2));
    }

    // Extract bigrams from str2
    auto num_pairs2 = str2.size() - 1;
    std::unordered_set<std::string> str2_bigrams;
    str2_bigrams.reserve(num_pairs2);
    for (unsigned int i = 0; i < num_pairs2; ++i) {
        str2_bigrams.insert(str2.substr(i, 2));
    }

    // Find the intersection between the two sets.
    int intersection = 0;
    if (str1_bigrams.size() < str2_bigrams.size()) {
        const auto it_e = str2_bigrams.end();
        for (const auto& bigram : str1_bigrams) {
            intersection += str2_bigrams.find(bigram) != it_e;
        }
    } else {
        const auto it_e = str1_bigrams.end();
        for (const auto& bigram : str2_bigrams) {
            intersection += str1_bigrams.find(bigram) != it_e;
        }
    }

    // Returns similarity coefficient.
    return (2.0 * intersection) / (num_pairs1 + num_pairs2);
}

— chema989
source

1

我一直在寻找@marzagao的答案所指出的算法的纯红宝石实现。不幸的是，@ marzagao指示的链接已断开。在@ s01ipsist答案中，他指出了红宝石宝石匹配项，其中实现不是纯红宝石。所以，我的searchd一点，发现宝石fuzzy_match其中有纯Ruby实现（尽管这种宝石用amatch在）这里。我希望这会对像我这样的人有所帮助。

— 工程师哈桑努沙曼·苏蒙（Hasanuzzaman Sumon）
source