寻找唯一的一场比赛

挑战在于编写代码以解决以下问题。

给定两个字符串A和B，您的代码应输出具有以下属性的A子字符串的开始和结束索引。

A的子字符串还应该与B的某些子字符串匹配，最多可以替换字符串中的单个字符。
不再有满足第一个属性的A子字符串。

例如：

A = xxxappleyyyyyyy

B = zapllezzz

apple带有索引4 8（索引从1开始）的子字符串将是有效的输出。

得分了

答案的分数将是代码长度（以字节为单位）加上在长度分别为一百万的字符串A和B上运行时在我的计算机上花费的时间（秒）之和。

测试与输入

我将在从http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/中的字符串中提取的长度为100万的两个字符串上运行您的代码

输入将是标准输入，将只是两个字符串，并用新行分隔。

语言和图书馆

您可以使用具有免费编译器/解释器/等的任何语言。适用于Linux以及任何开源的且可免费用于Linux的库。

我的机器 计时将在我的机器上运行。这是在AMD FX-8350八核处理器上的标准ubuntu安装。这也意味着我需要能够运行您的代码。因此，请仅使用易于使用的免费软件，并请提供有关如何编译和运行代码的完整说明。

code-challenge fastest-code

— 伊萨格
source

您需要更多的绝对评分定义。在计算机上运行时间听起来并不像是一种好的评分方法。

— mbomb007'2

@ mbomb007这是衡量代码速度的唯一明智方法，并且始终是PPCG上最快的代码竞赛中使用的一种方法！人们通常在答案中将分数发布在自己的计算机上，然后等待OP产生确定的分数。至少100％不含糊。

@ mbomb007是一种用于最快代码的非常广泛的评分方法。

— Optimizer

if(hash(str1 == test1 && str2 == test2)) print("100,150") else ..-有什么想法吗？

— 约翰·德沃夏克

@FryAmTheEggman在极少发生的平局中，第一个答案获胜。 appley需要两个替换来匹配apllez。也许您想念它apll在B中而不是appl？

Answers:

C ++时间：O（n ^ 2），额外空间：O（1）

完成我的计算机上的15K数据需要0.2秒。

要编译它，请使用：

g++ -std=c++11 -O3 code.cpp -o code

要运行它，请使用：

./code < INPUT_FILE_THAT_CONTAINS_TWO_LINES_SPERATED_BY_A_LINE_BREAK

讲解

这个想法很简单，对于string s1和s2，我们尝试s2通过i以下方式抵消：

s1: abcabcabc
s2: bcabcab

当offset为3时：

s1: abcabcabc
s2:    bcabcab

然后，对于每个偏移量i，我们在s1[i:]和上执行动态编程扫描s2。对于每一个j，令f[j, 0]最大长度d为s1[j - d:j] == s2[j - i - d: j - i]。同样，设f[j, 1]最大长度d，以使字符串s1[j - d:j]和之间s2[j - i - d:j - i]相差最多1个字符。

因此s1[j] == s2[j - i]，我们有：

f[j, 0] = f[j - 1, 0] + 1  // concat solution in f[j - 1, 0] and s1[j]
f[j, 1] = f[j - 1, 1] + 1  // concat solution in f[j - 1, 1] and s1[j]

除此以外：

f[j, 0] = 0  // the only choice is empty string
f[j, 1] = f[j - 1, 0] + 1  // concat solution in f[j - 1, 0] and s1[j] (or s2[j - i])

和：

f[-1, 0] = f[-1, 1] = 0

由于我们只需要f [j-1，：]来计算f [j，：]，因此仅使用O（1）多余的空间。

最后，最大长度为：

max(f[j, 1] for all valid j and all i)

码

#include <string>
#include <cassert>
#include <iostream>

using namespace std;

int main() {
    string s1, s2;
    getline(cin, s1);
    getline(cin, s2);
    int n1, n2;
    n1 = s1.size();
    n2 = s2.size();
    int max_len = 0;
    int max_end = -1;
    for(int i = 1 - n2; i < n1; i++) {
        int f0, f1;
        int max_len2 = 0;
        int max_end2 = -1;
        f0 = f1 = 0;
        for(int j = max(i, 0), j_end = min(n1, i + n2); j < j_end; j++) {
            if(s1[j] == s2[j - i]) {
                f0 += 1;
                f1 += 1;
            } else {
                f1 = f0 + 1;
                f0 = 0;
            }
            if(f1 > max_len2) {
                max_len2 = f1;
                max_end2 = j + 1;
            }
        }
        if(max_len2 > max_len) {
            max_len = max_len2;
            max_end = max_end2;
        }
    }
    assert(max_end != -1);
    // cout << max_len << endl;
    cout << max_end - max_len + 1 << " " << max_end << endl;
}

— 射线
source

抱歉，我一直在查看代码，但找不到如何将除一个字符之外的字符串匹配的可能性考虑在内，例如示例“ apple”和“ aplle”。你能解释一下吗？

— rorlork 2015年

@rcrmn这就是动态编程部分的工作。要理解，尝试在某些简单情况下手动计算f [j，0]和f [j，1]很有帮助。先前的代码有一些错误，因此我更新了帖子。

— 2015年

这次真是万分感谢。您是否认为也可能有O（n log n）解决方案？

C ++

我尝试过考虑使用一种好的算法来执行此操作，但是今天我有点分心，无法想到任何可以正常工作的方法。它的运行时间为O（n ^ 3），所以它的运动很慢。我想到的另一种选择在理论上可以更快，但是会占用O（n ^ 2）空间，而如果输入1M，甚至会更糟。

可耻的是，输入15K需要190秒。我会尝试改善它。 编辑：添加了多处理。现在在8个线程上进行15K输入需要37秒。

#include <string>
#include <vector>
#include <sstream>
#include <chrono>
#include <thread>
#include <atomic>
#undef cin
#undef cout
#include <iostream>

using namespace std;

typedef pair<int, int> range;

int main(int argc, char ** argv)
{
    string a = "xxxappleyyyyyyy";
    string b = "zapllezzz";

    getline(cin, a);
    getline(cin, b);

    range longestA;
    range longestB;

    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();

    unsigned cores = thread::hardware_concurrency(); cores = cores > 0 ? cores : 1;

    cout << "Processing on " << cores << " cores." << endl;

    atomic<int> processedCount(0);

    vector<thread> threads;

    range* longestAs = new range[cores];
    range* longestBs = new range[cores];
    for (int t = 0; t < cores; ++t)
    {
        threads.push_back(thread([&processedCount, cores, t, &a, &b, &longestBs, &longestAs]()
        {
            int la = a.length();
            int l = la / cores + (t==cores-1? la % cores : 0);
            int lb = b.length();
            int aS = t*(la/cores);

            for (int i = aS; i < aS + l; ++i)
            {
                int count = processedCount.fetch_add(1);
                if ((count+1) * 100 / la > count * 100 / la)
                {
                    cout << (count+1) * 100 / la << "%" << endl;
                }
                for (int j = 0; j < lb; ++j)
                {
                    range currentB = make_pair(j, j);
                    bool letterChanged = false;
                    for (int k = 0; k + j < lb && k + i < la; ++k)
                    {
                        if (a[i + k] == b[j + k])
                        {
                            currentB = make_pair(j, j + k);
                        }
                        else if (!letterChanged)
                        {
                            letterChanged = true;
                            currentB = make_pair(j, j + k);
                        }
                        else
                        {
                            break;
                        }
                    }
                    if (currentB.second - currentB.first > longestBs[t].second - longestBs[t].first)
                    {
                        longestBs[t] = currentB;
                        longestAs[t] = make_pair(i, i + currentB.second - currentB.first);
                    }
                }
            }
        }));
    }

    longestA = make_pair(0,0);
    for(int t = 0; t < cores; ++t)
    {
        threads[t].join();

        if (longestAs[t].second - longestAs[t].first > longestA.second - longestA.first)
        {
            longestA = longestAs[t];
            longestB = longestBs[t];
        }
    }

    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    cout << "First substring at range (" << longestA.first << ", " << longestA.second << "):" << endl;
    cout << a.substr(longestA.first, longestA.second - longestA.first + 1) << endl;
    cout << "Second substring at range (" << longestB.first << ", " << longestB.second << "):" << endl;
    cout << b.substr(longestB.first, longestB.second - longestB.first + 1) << endl;
    cout << "It took me " << time_span.count() << " seconds for input lengths " << a.length() << " and " << b.length() <<"." << endl;

    char c;
    cin >> c;
    return 0;
}

— 罗洛克
source

非常抱歉，这是一个糟糕的解决方案。我一直在寻找可以在更好的时间完成此操作的算法，但现在什么也没找到……

— rorlork 2015年

好吧，所需任务的复杂度应该在O（n ^ 4）到O（n ^ 5）左右，因此运行时间很长

— hoffmale

我认为在最坏的情况下它应该更像O（n ^ 3），至少在我的算法中是如此。无论如何，我确定可以做一些改进来改进它，例如某种树搜索，但是我不确定如何实现。

— rorlork 2015年

哦，是的，O（n ^ 3）的想法是……本来可以采用O（n ^ 4）的另一种方法，但是现在xD有点用了

— hoffmale，

你可以节省大量的时间少量如果从改变两个外部for循环检查i < a.length()，以i < a.length - (longestA.second - longestA.first)（同为J和b.length个（）），因为你不会需要处理比目前最长的一个小火柴

— hoffmale

[R

似乎我以前的解决方案使问题复杂化了。这比上一个快50％（15k字符串23秒），并且非常简单。

rm(list=ls(all=TRUE))
a="xxxappleyyyyyyy"
b="zapllezzz"
s=proc.time()
matchLen=1
matchIndex=1
indexA = 1
repeat {    
    i = 0
    repeat {
        srch = substring(a,indexA,indexA+matchLen+i)
        if (agrepl(srch,b,max.distance=list(insertions=0,deletions=0,substitutions=1)))
            i = i + 1
        else {
            if (i > 0) {
                matchLen = matchLen + i - 1
                matchIndex = indexA
            }
            break
        }
    }
    indexA=indexA+1
    if (indexA + matchLen > nchar(a)) break
}
c(matchIndex, matchLen + matchIndex)
print (substring(a,matchIndex, matchLen + matchIndex))
print(proc.time()-s)

由于语言的原因，这永远不会成为竞争者，但是我这样做确实很有趣。
不确定它的复杂性，但是在大约1.5万个字符串中，使用单个线程需要43秒。其中最大的部分是数组的排序。我尝试了其他一些库，但没有明显改善。

a="xxxappleyyyyyyy"
b="zapllezzz"
s=proc.time()
N=nchar
S=substring
U=unlist
V=strsplit
A=N(a)
B=N(b)
a=S(a,1:A)
b=S(b,1:B)
a=sort(a,method="quick")
b=sort(b,method="quick")
print(proc.time()-s)
C=D=1
E=X=Y=I=0
repeat{
    if(N(a[C])>E && N(b[D])>E){
        for(i in E:min(N(a[C]),N(b[D]))){
            if (sum(U(V(S(a[C],1,i),''))==U(V(S(b[D],1,i),'')))>i-2){
                F=i
            } else break
        }
        if (F>E) {
            X=A-N(a[C])+1
            Y=X+F-1
            E=F
        }
        if (a[C]<b[D])
            C=C+1
            else
            D=D+1
    } else
        if(S(a[C],1,1)<S(b[D],1,1))C=C+1 else D=D+1
    if(C>A||D>B)break
}
c(X,Y)
print(proc.time()-s)

方法：

为每个字符串创建一个后缀数组
订购后缀数组
以一种交错的方式逐步遍历每个数组，比较每个数组的开头

— 米奇
source

当然，R中最简单的解决方案是使用Bioconductor。

— archaephyrryx

@archaephyrryx一个生物导体解决方案会很有趣。

可能会...但是我对文档的快速阅读让我头疼。也许我能理解以下术语：-)

— MickyT 2015年

我删除了我的第一条评论。当然，您可以使用自己喜欢的任何开源库来应对这一挑战。