9

两个字符串之间的编辑（或Levenshtein）距离是将一个字符串转换为另一个字符串所需的最小单字符插入，删除和替换次数。如果两个字符串的长度均为n，则众所周知可以通过动态编程在O（n ^ 2）时间内完成。以下Python代码针对两个字符串s1和进行此计算s2。

def edit_distance(s1, s2):
    l1 = len(s1)
    l2 = len(s2)

    matrix = [range(l1 + 1)] * (l2 + 1)
    for zz in range(l2 + 1):
      matrix[zz] = range(zz,zz + l1 + 1)
    for zz in range(0,l2):
      for sz in range(0,l1):
        if s1[sz] == s2[zz]:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
        else:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
    return matrix[l2][l1]

在此任务中，您必须尽可能地计算出编辑距离，但要严格地限制内存。您的代码允许定义一个包含1000个32位整数的数组，这将是您在计算中使用的唯一临时存储。所有变量和数据结构都将包含在此数组中。特别是，对于长度为1000的字符串，您将无法实现上述算法，因为它将需要您至少存储1,000,000个数字。如果您的语言没有自然的32位整数（例如Python），则只需要确保您在数组中存储的数字不要大于2 ^ 32-1。

您可以使用所选的任何标准库读取数据，而不必担心该部分中的内存限制。为了使代码的主要部分公平竞争，您只能使用功能上与C编程语言相同的操作，而不能使用任何外部库。

需要特别清楚的是，用于存储输入数据或您的语言的解释器，JVM等使用的内存不计入您的限制，并且您可能不会在磁盘上写入任何内容。您必须假定输入数据在内存中时是只读的，因此您不能重复使用该数据以获得更多的工作空间。

我必须执行什么？

您的代码应以以下格式读入文件。它将有三行。第一行是真实的编辑距离。第二个是字符串1，第三个是字符串2。我将在https://bpaste.net/show/6905001d52e8上使用示例数据对其进行测试，其中字符串的长度为10,000，但不应专门用于此数据。它应该输出两个字符串之间可以找到的最小编辑距离。

您还需要证明您的编辑距离实际上来自一组有效的编辑。您的代码应具有一个开关，该开关将其转换为可能使用更多内存（尽可能多）的模式，并输出给出您的编辑距离的编辑操作。

得分了

您的分数将是(optimal edit distance/divided by the edit distance you find) * 100。首先，请注意，只需计算两个字符串之间的不匹配数即可获得分数。

您可以使用自己喜欢的任何语言，这些语言可以免费获得，并且可以在Linux中轻松安装。

抢七

如果出现抢七的情况，我将在Linux机器上运行您的代码，以最快的速度获胜。

code-challenge string optimization

将for(int i=0;i<=5;i++)被允许的，因为它存储数据i？

— Beta Decay

2

@BetaDecay是的，尽管要更严格地遵循规则，您将执行类似的操作，{ uint32_t foo[1000]; for (foo[0] = 0; foo[0] < 5; ++foo[0]) printf("%d ", foo[0]); } 这是假设将调用32位整数数组foo。

在文件中具有真正的编辑距离有什么意义？该程序实际上应该阅读吗？还是（看起来更明智）是让您了解该计划的成功程度？

— feersum

@feersum确实如此。它就在那里，您可以轻松查看自己的得分。

bpaste.net/show/6905001d52e8给了我404页面！

— sergiol

4

C ++，得分92.35

估计算法：该算法找到两个字符串不同的第一位，然后尝试所有可能的N个操作排列（插入，删除，替换-匹配的字符被跳过而不消耗任何操作）。它根据该组操作与两个字符串成功匹配的距离，再加上使字符串长度收敛的数量，对每个可能的操作组进行评分。在确定N个操作的最高得分集合之后，应用集合中的第一个操作，找到下一个不匹配项，然后重复该过程，直到到达字符串的末尾。

该程序尝试从1到10的所有N值，然后选择给出最佳结果的水平。由于计分方法考虑了字符串长度，因此N = 10通常是最好的。较高的N值可能会更好，但是会花费更多的时间。

内存使用情况：由于该程序是完全迭代的，因此只需要很少的内存。仅使用19个变量来跟踪程序状态。这些由#defines设置以用作全局变量。

用法：该程序的使用方式与feersum相同：假定第一个参数为文件，并且任何其他参数指示应显示编辑内容。程序总是打印估计的编辑距离和分数。

验证输出：将验证输出格式化为三行：

11011111100101100111100110100 110 0 0000   0 01101
R I          IR     R        D   D D    DDD D     D
01 1111110010 0001110001101000110101000011101011010

最上面一行是目标字符串，中间一行是操作，最下面一行是正在编辑的字符串。操作行中的空格表示字符匹配。“ R”表示编辑字符串在该位置的字符已替换为目标字符串的字符。“ I”表示编辑字符串在该位置插入了目标字符串的字符。“ D”表示编辑字符串在该位置的字符已删除。当编辑和目标字符串中的另一个插入或删除字符时，它们将插入空格，以便它们对齐。

#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <math.h>
#include <fstream>

int memory[1000];
#define first (*(const char **)&memory[0])
#define second (*(const char **)&memory[1])
#define block_ia memory[2]
#define block_ib memory[3]
#define block_n memory[4]
#define block_op memory[5]
#define block_o memory[6]
#define block_x memory[7]
#define n memory[8]
#define opmax memory[9]
#define best_op memory[10]
#define best_score memory[11]
#define score memory[12]
#define best_counter memory[13]
#define la memory[14]
#define lb memory[15]
#define best memory[16]
#define bestn memory[17]
#define total memory[18]

// verification variables
char printline1[0xffff]={};
char *p1=printline1;
char printline2[0xffff]={};
char *p2=printline2;
char printline3[0xffff]={};
char *p3=printline3;


// determine how many characters match after a set of operations
int block(){
    block_ia=0;
    block_ib=0;
    for ( block_x=0;block_x<block_n;block_x++){
        block_o = block_op%3;
        block_op /= 3;
        if ( block_o == 0 ){ // replace
            block_ia++;
            block_ib++;
        } else if ( block_o == 1 ){ // delete
            block_ib++;
        } else { // insert
            if ( first[block_ia] ){ 
                block_ia++;
            }
        }
        while ( first[block_ia] && first[block_ia]==second[block_ib] ){ // find next mismatch
            block_ia++;
            block_ib++;
        }
        if ( first[block_ia]==0 ){
            return block_x;
        }
    }
    return block_n;
}

// find the highest-scoring set of N operations for the current string position
void bestblock(){
    best_op=0;
    best_score=0;
    la = strlen(first);
    lb = strlen(second);
    block_n = n;
    for(best_counter=0;best_counter<opmax;best_counter++){
        block_op=best_counter;
        score = n-block();
        score += block_ia-abs((la-block_ia)-(lb-block_ib));
        if ( score > best_score ){
            best_score = score;
            best_op = best_counter;
        }
    }
}

// prepare edit confirmation record
void printedit(const char * a, const char * b, int o){
    o%=3;
    if ( o == 0 ){ // replace
        *p1 = *a;
        if ( *b ){
            *p2 = 'R';
            *p3 = *b;
            b++;
        } else {
            *p2 = 'I';
            *p3 = ' ';
        }
        a++;
    } else if ( o == 1 ){ // delete
        *p1 = ' ';
        *p2 = 'D';
        *p3 = *b;
        b++;
    } else { // insert
        *p1 = *a;
        *p2 = 'I';
        *p3 = ' ';
        a++;
    }
    p1++;
    p2++;
    p3++;
    while ( *a && *a==*b ){
        *p1 = *a;
        *p2 = ' ';
        *p3 = *b;
        p1++;
        p2++;
        p3++;
        a++;
        b++;
    }
}


int main(int argc, char * argv[]){

    if ( argc < 2 ){
        printf("No file name specified\n");
        return 0;
    }

    std::ifstream file(argv[1]);
    std::string line0,line1,line2;
    std::getline(file,line0);
    std::getline(file,line1);
    std::getline(file,line2);

    // begin estimating Levenshtein distance
    best = 0;
    bestn = 0;
    for ( n=1;n<=10;n++){ // n is the number of operations that can be in a test set
        opmax = (int)pow(3.0,n);
        first = line1.c_str();
        second = line2.c_str();
        while ( *first && *first == *second ){
            first++;
            second++;
        }
        total=0;
        while ( *first && *second ){
            bestblock();
            block_n=1;
            block_op=best_op;
            block();
            total ++;
            first += block_ia;
            second += block_ib;
        }
        // when one string is exhausted, all following ops must be insert or delete
        while(*second){
            total++;
            second++;
        }
        while(*first){
            total++;
            first++;
        }
        if ( !best || total < best ){
            best = total;
            bestn = n;
        }
    }
    // done estimating Levenshtein distance

    // dump info to prove the edit distance actually comes from a valid set of edits
    if ( argc >= 3 ){
        p1 = printline1;
        p2 = printline2;
        p3 = printline3;
        n = bestn;
        opmax = (int)pow(3.0,n);
        first = line1.c_str();
        second = line2.c_str();
        while ( *first && *first == *second ){
            *p1 = *first;
            *p2 = ' ';
            *p3 = *second;
            p1++;
            p2++;
            p3++;
            first++;
            second++;
        }
        while ( *first && *second){
            bestblock();
            block_n=1;
            block_op=best_op;
            block();
            printedit(first,second,best_op);
            first += block_ia;
            second += block_ib;
        }
        while(*second){
            *p1=' ';
            *p2='D';
            *p3=*second;
            p1++;
            p2++;
            p3++;
            second++;
        }
        while(*first){
            *p1=*first;
            *p2='I';
            *p3=' ';
            p1++;
            p2++;
            p3++;
            first++;
        }

        p1 = printline1;
        p2 = printline2;
        p3 = printline3;
        int ins=0;
        int del=0;
        int rep=0;
        while ( *p1 ){
            int a;
            for ( a=0;a<79&&p1[a];a++)
                printf("%c",p1[a]);
            printf("\n");
            p1+=a;
            for ( a=0;a<79&&p2[a];a++){
                ins += ( p2[a] == 'I' );
                del += ( p2[a] == 'D' );
                rep += ( p2[a] == 'R' );
                printf("%c",p2[a]);
            }
            printf("\n");
            p2+=a;
            for ( a=0;a<79&&p3[a];a++)
                printf("%c",p3[a]);
            printf("\n\n");
            p3+=a;
        }
        printf("Best N=%d\n",bestn);
        printf("Inserted = %d, Deleted = %d, Replaced=%d, Total = %d\nLength(line1)=%d, Length(Line2)+ins-del=%d\n",ins,del,rep,ins+del+rep,line1.length(),line2.length()+ins-del);
    }

    printf("%d, Score = %0.2f\n",best,2886*100.0/best);
    system("pause");
    return 0;
}

— 拉格萨洛特爵士
source

7

C ++ 75.0

该程序旨在使用任意文本字符串。它们的长度可以不同，只要不超过13824个字符即可。它使用1897个16位整数，相当于949个32位整数。最初我是用C编写的，但是后来意识到没有读取行的功能。

第一个命令行参数应该是文件名。如果存在第二个参数，则会打印编辑摘要。文件中的第一行将被忽略，而第二和第三行是字符串。

该算法是常规算法的双重阻止版本。它执行的操作次数基本相同，但是准确度当然要低得多，因为如果将公共子序列拆分到块的边缘，则会损失很多潜在的节省。

#include <cstring>
#include <inttypes.h>
#include <iostream>
#include <fstream>

#define M 24
#define MAXLEN (M*M*M)
#define SETMIN(V, X) if( (X) < (V) ) { (V) = (X); }
#define MIN(X, Y) ( (X) < (Y) ? (X) : (Y) )

char A[MAXLEN+1], B[MAXLEN+1];
uint16_t d0[M+1][M+1], d1[M+1][M+1], d2[M+1][M+1];

int main(int argc, char**argv)
{

    if(argc < 2)
        return 1;

    std::ifstream fi(argv[1]);

    std::string Astr, Bstr;
    for(int i = 3; i--;)
        getline(fi, i?Bstr:Astr);
    if(!fi.good()) {
        printf("Error reading file");
        return 5;
    }
    if(Astr.length() > MAXLEN || Bstr.length() > MAXLEN) {
        printf("String too long");
        return 7;
    }

    strcpy(A, Astr.c_str());
    strcpy(B, Bstr.c_str());

    uint16_t lA = Astr.length(), lB = Bstr.length();
    if(!lA || !lB) {
        printf("%d\n", lA|lB);
        return 0;
    }
    uint16_t nbA2, nbB2, bA2, bB2, nbA1, nbB1, bA1, bB1, nbA0, nbB0, bA0, bB0; //block, number of blocks
    uint16_t iA2, iB2, iA1, iB1, jA2, jB2, jA1, jB1; //start, end indices of block

    nbA2 = MIN(M, lA);
    nbB2 = MIN(M, lB);
    for(bA2 = 0; bA2 <= nbA2; bA2++) {
        iA2 = lA * (bA2-1)/nbA2,  jA2 = lA * bA2/nbA2;
        for(bB2 = 0; bB2 <= nbB2; bB2++) {
            if(!(bA2|bB2)) {
                d2[0][0] = 0;
                continue;
            }
            iB2 = lB * (bB2-1)/nbB2,  jB2 = lB * bB2/nbB2;
            d2[bA2][bB2] = ~0;
            if(bB2)
                SETMIN(d2[bA2][bB2], d2[bA2][bB2-1] + (jB2-iB2));
            if(bA2)
                SETMIN(d2[bA2][bB2], d2[bA2-1][bB2] + (jA2-iA2));

            if(bA2 && bB2) {
                nbA1 = MIN(M, jA2-iA2);
                nbB1 = MIN(M, jB2-iB2);
                for(bA1 = 0; bA1 <= nbA1; bA1++) {
                    iA1 = iA2 + (jA2-iA2) * (bA1-1)/nbA1, jA1 = iA2 + (jA2-iA2) * bA1/nbA1;
                    for(bB1 = 0; bB1 <= nbB1; bB1++) {
                        if(!(bA1|bB1)) {
                            d1[0][0] = 0;
                            continue;
                        }
                        iB1 = iB2 + (jB2-iB2) * (bB1-1)/nbB1, jB1 = iB2 + (jB2-iB2) * bB1/nbB1;
                        d1[bA1][bB1] = ~0;
                        if(bB1)
                            SETMIN(d1[bA1][bB1], d1[bA1][bB1-1] + (jB1-iB1));
                        if(bA1)
                            SETMIN(d1[bA1][bB1], d1[bA1-1][bB1] + (jA1-iA1));

                        if(bA1 && bB1) {
                            nbA0 = jA1-iA1;
                            nbB0 = jB1-iB1;
                            for(bA0 = 0; bA0 <= nbA0; bA0++) {
                                for(bB0 = 0; bB0 <= nbB0; bB0++) {
                                    if(!(bA0|bB0)) {
                                        d0[0][0] = 0;
                                        continue;
                                    }
                                    d0[bA0][bB0] = ~0;
                                    if(bB0)
                                        SETMIN(d0[bA0][bB0], d0[bA0][bB0-1] + 1);
                                    if(bA0)
                                        SETMIN(d0[bA0][bB0], d0[bA0-1][bB0] + 1);
                                    if(bA0 && bB0)
                                        SETMIN(d0[bA0][bB0], d0[bA0-1][bB0-1] + (A[iA1 + nbA0 - 1] != B[iB1 + nbB0 - 1]));
                                }
                            }
                            SETMIN(d1[bA1][bB1], d1[bA1-1][bB1-1] + d0[nbA0][nbB0]);
                        }
                    }
                }

                SETMIN(d2[bA2][bB2], d2[bA2-1][bB2-1] + d1[nbA1][nbB1]);
            }
        }
    }
    printf("%d\n", d2[nbA2][nbB2]);

    if(argc == 2)
        return 0;

    int changecost, total = 0;
    for(bA2 = nbA2, bB2 = nbB2; bA2||bB2; ) {
        iA2 = lA * (bA2-1)/nbA2,  jA2 = lA * bA2/nbA2;
        iB2 = lB * (bB2-1)/nbB2,  jB2 = lB * bB2/nbB2;
        if(bB2 && d2[bA2][bB2-1] + (jB2-iB2) == d2[bA2][bB2]) {
            total += changecost = (jB2-iB2);
            char tmp = B[jB2];
            B[jB2] = 0;
            printf("%d %d deleted {%s}\n", changecost, total, B + iB2);
            B[jB2] = tmp;
            --bB2;
        } else if(bA2 && d2[bA2-1][bB2] + (jA2-iA2) == d2[bA2][bB2]) {
            total += changecost = (jA2-iA2);
            char tmp = B[jA2];
            A[jA2] = 0;
            printf("%d %d inserted {%s}\n", changecost, total, A + iA2);
            A[jA2] = tmp;
            --bA2;
        } else {
            total += changecost = d2[bA2][bB2] - d2[bA2-1][bB2-1];
            char tmpa = A[jA2], tmpb = B[jB2];
            B[jB2] = A[jA2] = 0;
            printf("%d %d changed {%s} to {%s}\n", changecost, total, B + iB2, A + iA2);
            A[jA2] = tmpa, B[jB2] = tmpb;
            --bA2, --bB2;
        }
    }


    return 0;
}

— 费瑟姆
source

感谢您成为第一个回答者！你的分数是多少？

@Lembik好吧，假设仅基于一个示例，我已经计算出分数。

— feersum

这很棒。您认为有可能获得更高的分数吗？

3

Python，100

我确实设法在分配的内存限制中完美地计算了编辑距离。令人遗憾的是，该条目违反了挑战的两个规则，即使不是实质性的，也应以文字表达。

首先，我实际上没有将数据存储在1000个32位整数中。对于10000个字符的字符串，我的程序创建了两个仅包含+1、0或-1的10000个元素的数组。如果每个三进制数为1.585位，则可以将那些20000 Trits打包为31700位，剩下的300位足够我剩下的7个16位整数。

其次，我尚未实现显示编辑内容所需的模式。或者，我已经实现了一种打印完整编辑矩阵的模式。绝对有可能从该矩阵计算编辑路径，但是我现在没有时间实现它。

#!/usr/bin/env python

import sys

# algorithm originally from
# https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows

print_rows = False
if len(sys.argv) > 2:
    print_rows = True

def LevenshteinDistance(s, t):
    # degenerate cases
    if s == t:
        return 0
    if len(s) == 0:
        return len(t)
    if len(t) == 0:
        return len(s)

    # create two work vectors of integer distance deltas

    # these lists will only ever contain +1, 0, or -1
    # so they COULD be packed into 1.585 bits each
    # 15850 bits per list, 31700 bits total, leaving 300 bits for all the other variables

    # d0 is the previous row
    # initialized to 0111111... which represents 0123456...
    d0 = [1 for i in range(len(t)+1)]
    d0[0] = 0        
    if print_rows:
        row = ""
        for i in range(len(t)+1):
            row += str(i) + ", "
        print row

    # d1 is the row being calculated
    d1 = [0 for i in range(len(t)+1)]

    for i in range(len(s)-1):
        # cummulative values of cells north, west, and northwest of the current cell
        left = i+1
        upleft = i
        up = i+d0[0]
        if print_rows:
            row = str(left) + ", "
        for j in range(len(t)):
            left += d1[j]
            up += d0[j+1]
            upleft += d0[j]
            cost = 0 if (s[i] == t[j]) else 1
            d1[j + 1] = min(left + 1, up + 1, upleft + cost) - left
            if print_rows:
                row += str(left+d1[j+1]) + ", "

        if print_rows:
            print row

        for c in range(len(d0)):
            d0[c] = d1[c]

    return left+d1[j+1]

with open(sys.argv[1]) as f:
    lines = f.readlines()

perfect = lines[0]
string1 = lines[1]
string2 = lines[2]
distance = LevenshteinDistance(string1,string2)
print "edit distance: " + str(distance)
print "score: " + str(int(perfect)*100/distance) + "%"

输入示例：

2
101100
011010

详细输出示例：

0, 1, 2, 3, 4, 5, 6,
1, 1, 1, 2, 3, 4, 5,
2, 1, 2, 2, 2, 3, 4,
3, 2, 1, 2, 3, 2, 3,
4, 3, 2, 1, 2, 3, 3,
5, 4, 3, 2, 1, 2, 3,
6, 5, 4, 3, 2, 2, 2,
edit distance: 2
score: 100%

— 斯帕
source

受限内存优化

C ++，得分92.35

C ++ 75.0

Python，100