电脑：数学

这一挑战部分是算法挑战，涉及一些数学运算，部分只是最快的代码挑战。

对于某个正整数n，请考虑一个长度为1s和0s的均匀随机串，n并将其称为A。现在还要考虑第二个统一选择的长度n为或的随机字符串-1，并将其称为。现在，让我们成为+ 。那是串联的。0,1B_preBB_preB_preB_pre

现在考虑Aand 的内积，并从中B[j,...,j+n-1]调用它Z_j和index 1。

任务

输出应为n+1分数列表。的i在输出第术语应该是准确的概率的所有第一i方面Z_j具有j <= i相等0。

得分了

n在我的机器上，您的代码可在10分钟内为您的代码提供最大的正确输出。

决胜局

如果两个答案得分相同，则第一个提交者获胜。

在（非常非常不可能）的事件中，如果有人找到获得无限分的方法，则将接受这种解决方案的第一个有效证明。

暗示

不要尝试用数学方法解决这个问题，这太难了。我认为最好的方法是回到高中的概率的基本定义，并找到巧妙的方法来获取代码以对各种可能性进行详尽的列举。

语言和图书馆

您可以使用任何具有免费编译器/解释器/等的语言。适用于Linux以及任何可免费用于Linux的库。

我的机器 时间将在我的机器上运行。这是在AMD FX-8350八核处理器上的标准ubuntu安装。这也意味着我需要能够运行您的代码。因此，请仅使用易于使用的免费软件，并请提供有关如何编译和运行代码的完整说明。

一些测试输出。考虑每个的第一个输出n。那是什么时候i=1。对于n1到13，它们应该是。

 1: 4/6
 2: 18/36
 3: 88/216
 4: 454/1296
 5: 2424/7776
 6: 13236/46656
 7: 73392/279936
 8: 411462/1679616
 9: 2325976/10077696
10: 13233628/60466176
11: 75682512/362797056
12: 434662684/2176782336
13: 2505229744/13060694016

您也可以i=1在http://oeis.org/A081671上找到其通用公式。

排行榜（按语言划分）

n = 15。Python+并行python + pypy在1分49秒内由Jakube撰写
n = 17。CeithRandall在3分37秒内完成了C ++
n = 16。C ++在2分38秒内由kuroi neko撰写

— 马丁·恩德
source

@Knerd我怎么说呢。我将尝试弄清楚如何在linux中运行您的代码，但对您的帮助非常感谢。

好的，很抱歉删除评论。对于所有未读的内容，是否允许使用F＃或C＃:)

— Knerd 2014年

另一个问题是，您是否有一个有效输入输出的示例？

— Knerd 2014年

您的图形卡是什么？看起来像是GPU的工作。

— Michael M.

@Knerd我改为在问题中添加了一个概率表。希望对您有所帮助。

Answers:

C ++，在8分钟内9分钟内n = 18

（让我知道它是否在您的计算机上运行不到10分钟。）

我利用B数组中几种对称形式。它们是循环的（移动一个位置），反转（反转元素的顺序）和正负号（取每个元素的负数）。首先，我计算需要尝试的B列表及其权重。然后，对A的所有2 ^ n个值，通过一个快速例程（使用位计数指令）运行每个B。

这是n == 18的结果：

> time ./a.out 18
 1: 16547996212044 / 101559956668416
 2:  3120508430672 / 101559956668416
 3:   620923097438 / 101559956668416
 4:   129930911672 / 101559956668416
 5:    28197139994 / 101559956668416
 6:     6609438092 / 101559956668416
 7:     1873841888 / 101559956668416
 8:      813806426 / 101559956668416
 9:      569051084 / 101559956668416
10:      510821156 / 101559956668416
11:      496652384 / 101559956668416
12:      493092812 / 101559956668416
13:      492186008 / 101559956668416
14:      491947940 / 101559956668416
15:      491889008 / 101559956668416
16:      449710584 / 101559956668416
17:      418254922 / 101559956668416
18:      409373626 / 101559956668416

real    8m55.854s
user    67m58.336s
sys 0m5.607s

编译以下程序 g++ --std=c++11 -O3 -mpopcnt dot.cc

#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <thread>
#include <mutex>
#include <chrono>

using namespace std;

typedef long long word;

word n;

void inner(word bpos, word bneg, word w, word *cnt) {
    word maxi = n-1;
    for(word a = (1<<n)-1; a >= 0; a--) {
        word m = a;
        for(word i = maxi; i >= 0; i--, m <<= 1) {
            if(__builtin_popcount(m&bpos) != __builtin_popcount(m&bneg))
                break;
            cnt[i]+=w;
        }
    }
}

word pow(word n, word e) {
    word r = 1;
    for(word i = 0; i < e; i++) r *= n;
    return r;
}

typedef struct {
    word b;
    word weight;
} Bentry;

mutex block;
Bentry *bqueue;
word bhead;
word btail;
word done = -1;

word maxb;

// compute -1*b
word bneg(word b) {
    word w = 1;
    for(word i = 0; i < n; i++, w *= 3) {
        word d = b / w % 3;
        if(d == 1)
            b += w;
        if(d == 2)
            b -= w;
    }
    return b;
}

// rotate b one position
word brot(word b) {
    b *= 3;
    b += b / maxb;
    b %= maxb;
    return b;
}

// reverse b
word brev(word b) {
    word r = 0;
    for(word i = 0; i < n; i++) {
        r *= 3;
        r += b % 3;
        b /= 3;
    }
    return r;
}

// individual thread's work routine
void work(word *cnt) {
    while(true) {
        // get a queue entry to work on
        block.lock();
        if(btail == done) {
            block.unlock();
            return;
        }
        if(bhead == btail) {
            block.unlock();
            this_thread::sleep_for(chrono::microseconds(10));
            continue;
        }
        word i = btail++;
        block.unlock();

        // thread now owns bqueue[i], work on it
        word b = bqueue[i].b;
        word w = 1;
        word bpos = 0;
        word bneg = 0;
        for(word j = 0; j < n; j++, b /= 3) {
            word d = b % 3;
            if(d == 1)
                bpos |= 1 << j;
            if(d == 2)
                bneg |= 1 << j;
        }
        bpos |= bpos << n;
        bneg |= bneg << n;
        inner(bpos, bneg, bqueue[i].weight, cnt);
    }
}

int main(int argc, char *argv[]) {
    n = atoi(argv[1]);

    // allocate work queue
    maxb = pow(3, n);
    bqueue = (Bentry*)(malloc(maxb*sizeof(Bentry)));

    // start worker threads
    word procs = thread::hardware_concurrency();
    vector<thread> threads;
    vector<word*> counts;
    for(word p = 0; p < procs; p++) {
        word *cnt = (word*)calloc(64+n*sizeof(word), 1);
        threads.push_back(thread(work, cnt));
        counts.push_back(cnt);
    }

    // figure out which Bs we actually want to test, and with which weights
    bool *bmark = (bool*)calloc(maxb, 1);
    for(word i = 0; i < maxb; i++) {
        if(bmark[i]) continue;
        word b = i;
        word w = 0;
        for(word j = 0; j < 2; j++) {
            for(word k = 0; k < 2; k++) {
                for(word l = 0; l < n; l++) {
                    if(!bmark[b]) {
                        bmark[b] = true;
                        w++;
                    }
                    b = brot(b);
                }
                b = bneg(b);
            }
            b = brev(b);
        }
        bqueue[bhead].b = i;
        bqueue[bhead].weight = w;
        block.lock();
        bhead++;
        block.unlock();
    }
    block.lock();
    done = bhead;
    block.unlock();

    // add up results from threads
    word *cnt = (word*)calloc(n,sizeof(word));
    for(word p = 0; p < procs; p++) {
        threads[p].join();
        for(int i = 0; i < n; i++) cnt[i] += counts[p][i];
    }
    for(word i = 0; i < n; i++)
        printf("%2lld: %14lld / %14lld\n", i+1, cnt[n-1-i], maxb<<n);
    return 0;
}

— 基思·兰德尔
source

很好，这使我免除了进一步研究自己的宠物怪物的工作……

谢谢你您拥有当前的获奖作品。我们必须-pthread再次记住。我 n=17上了我的机器。

糟糕，您应该得到全部的赏金。对不起，我错过了最后期限。

@Lembik：没问题。

— Keith Randall 2014年

使用pypy和pp的Python 2：3分钟内n = 15

也只是简单的蛮力。有趣的是，我使用C ++获得的速度几乎与kuroi neko相同。我的代码可以n = 12在大约5分钟内到达。而且我只在一个虚拟内核上运行它。

编辑：将搜索空间减少一个因子 `n`

我注意到，一个循环矢量A*的A作为原始向量产生相同的数字作为概率（相同的数字）A当我叠代B。例如该载体(1, 1, 0, 1, 0, 0)具有作为各矢量的相同概率(1, 0, 1, 0, 0, 1)，(0, 1, 0, 0, 1, 1)，(1, 0, 0, 1, 1, 0)，(0, 0, 1, 1, 0, 1)和(0, 1, 1, 0, 1, 0)选择一个随机时B。因此，我不必遍历这6个向量中的每一个，而只需迭代约1并替换count[i] += 1为count[i] += cycle_number。

这将复杂度从降低Theta(n) = 6^n到Theta(n) = 6^n / n。因此，n = 13它的速度大约是我以前的版本的13倍。计算n = 13大约需要2分钟20秒。因为n = 14它仍然有点太慢。大约需要13分钟。

编辑2：多核编程

对下一步的改进并不十分满意。我决定也尝试在多个内核上执行我的程序。现在，在我的2 + 2内核上，我可以n = 14在大约7分钟内进行计算。只有2倍的改善。

该代码在此github存储库中可用：Link。多核程序设计有点丑陋。

编辑3：减少`A`向量和`B`向量的搜索空间

我注意到与Akuroi neko一样，向量具有相同的镜像对称性。仍然不确定为什么这样做（以及是否适用于每种方法n）。

减少B向量的搜索空间会比较聪明。我用itertools.product自己的函数替换了向量（）的生成。基本上，我从一个空列表开始，然后将其放在堆栈上。直到堆栈为空，我删除了一个列表，如果列表的长度与相同n，则会生成3个其他列表（通过附加-1、0、1）并将其推入堆栈。如果列表的长度与相同n，则我可以求和。

现在，我自己生成了向量，可以根据是否可以达到sum = 0来对其进行过滤。例如，如果我的向量A为(1, 1, 1, 0, 0)，并且我的向量B看起来像是(1, 1, ?, ?, ?)我无法?用值填充，那么A*B = 0。因此，我不必遍历B形式的所有这6个向量(1, 1, ?, ?, ?)。

如果忽略1的值，则可以对此进行改进。如问题中所指出的，因为A的值i = 1是序列A081671。有许多计算方法。我选择简单的重复：a(n) = (4*(2*n-1)*a(n-1) - 12*(n-1)*a(n-2)) / n。由于我们i = 1基本上没有时间可以计算，因此可以为过滤更多的向量B。例如A = (0, 1, 0, 1, 1)和B = (1, -1, ?, ?, ?)。我们可以忽略向量，其中对于所有这些向量，第一个是? = 1，因为是A * cycled(B) > 0。希望您能跟随。这可能不是最好的例子。

有了这个，我可以n = 15在6分钟内计算出。

编辑4：

快速实施kuroi neko的好主意，即说，B并-B产生了相同的结果。加速x2。不过，实现只是一个快速的技巧。n = 153分钟之内

码：

有关完整的代码，请访问Github。以下代码仅代表主要功能。我省去了导入，多核编程，打印结果，...

count = [0] * n
count[0] = oeis_A081671(n)

#generating all important vector A
visited = set(); todo = dict()
for A in product((0, 1), repeat=n):
    if A not in visited:
        # generate all vectors, which have the same probability
        # mirrored and cycled vectors
        same_probability_set = set()
        for i in range(n):
            tmp = [A[(i+j) % n] for j in range(n)]
            same_probability_set.add(tuple(tmp))
            same_probability_set.add(tuple(tmp[::-1]))
        visited.update(same_probability_set)
        todo[A] = len(same_probability_set)

# for each vector A, create all possible vectors B
stack = []
for A, cycled_count in dict_A.iteritems():
    ones = [sum(A[i:]) for i in range(n)] + [0]
    # + [0], so that later ones[n] doesn't throw a exception
    stack.append(([0] * n, 0, 0, 0, False))

    while stack:
        B, index, sum1, sum2, used_negative = stack.pop()
        if index < n:
            # fill vector B[index] in all possible ways,
            # so that it's still possible to reach 0.
            if used_negative:
                for v in (-1, 0, 1):
                    sum1_new = sum1 + v * A[index]
                    sum2_new = sum2 + v * A[index - 1 if index else n - 1]
                    if abs(sum1_new) <= ones[index+1]:
                        if abs(sum2_new) <= ones[index] - A[n-1]:
                            C = B[:]
                            C[index] = v
                            stack.append((C, index + 1, sum1_new, sum2_new, True))
            else:
                for v in (0, 1):
                    sum1_new = sum1 + v * A[index]
                    sum2_new = sum2 + v * A[index - 1 if index else n - 1]
                    if abs(sum1_new) <= ones[index+1]:
                        if abs(sum2_new) <= ones[index] - A[n-1]:
                            C = B[:]
                            C[index] = v
                            stack.append((C, index + 1, sum1_new, sum2_new, v == 1))
        else:
            # B is complete, calculate the sums
            count[1] += cycled_count  # we know that the sum = 0 for i = 1
            for i in range(2, n):
                sum_prod = 0
                for j in range(n-i):
                    sum_prod += A[j] * B[i+j]
                for j in range(i):
                    sum_prod += A[n-i+j] * B[j]
                if sum_prod:
                    break
                else:
                    if used_negative:
                        count[i] += 2*cycled_count
                    else:
                        count[i] += cycled_count

用法：

您必须安装pypy（适用于Python 2 ！！！）。并行python模块未移植到Python3。然后，您必须安装并行python模块pp-1.6.4.zip。将其解压缩cd到该文件夹中，然后调用pypy setup.py install。

然后您可以使用

pypy you-do-the-math.py 15

它将自动确定cpu的数量。程序完成后可能会出现一些错误消息，请忽略它们。n = 16应该可以在您的机器上使用。

输出：

Calculation for n = 15 took 2:50 minutes

 1  83940771168 / 470184984576  17.85%
 2  17379109692 / 470184984576   3.70%
 3   3805906050 / 470184984576   0.81%
 4    887959110 / 470184984576   0.19%
 5    223260870 / 470184984576   0.05%
 6     67664580 / 470184984576   0.01%
 7     30019950 / 470184984576   0.01%
 8     20720730 / 470184984576   0.00%
 9     18352740 / 470184984576   0.00%
10     17730480 / 470184984576   0.00%
11     17566920 / 470184984576   0.00%
12     17521470 / 470184984576   0.00%
13     17510280 / 470184984576   0.00%
14     17507100 / 470184984576   0.00%
15     17506680 / 470184984576   0.00%

注意事项和想法：

我有一个i2-4600m处理器，带有2个核心和4个线程。我使用2个线程还是4个线程都没有关系。2个线程的cpu使用率是50％，4个线程的cpu使用率是100％，但是仍然需要相同的时间。我不知道为什么我检查了一下，每个线程只有一半的数据，当有4个线程时，检查了结果，...
我使用很多清单。Python的存储效率不是很高，我必须复制很多列表，所以我想到了使用整数代替。我可以在向量A中使用位00（对于0）和11（对于1），以及向量B中的位10（对于-1），00（对于0）和01（对于1）。对于乘积对于A和B，我只需要计算A & B和计算01和10块。循环可以通过移动向量和使用遮罩来完成，...我实际上实现了所有这些，您可以在我在Github上的一些较早的提交中找到它。但是事实证明，它比列表要慢。我猜，pypy确实优化了列表操作。

— 雅库比
source

在我的PC上，n = 12运行需要7:25，而我的C ++垃圾大约需要1:23，这使其速度提高了大约5倍。只有两个真正的内核，与单线程应用程序相比，我的CPU将获得2.5倍的性能，因此真正的8内核CPU的运行速度应快3倍左右，这不算基本单核速度的提高。我老化的i3-2100。但是，通过所有这些C ++循环来解决指数级增长的计算时间是否值得值得付出努力，这值得商bat。

我感觉到codegolf.stackexchange.com/questions/41021 / ... ... de Bruijn序列有用吗？

— kennytm 2014年

关于多线程，您可以将每个线程锁定在一个线程中，从而挤出更多2 + 2内核。x2增益是由于每次在系统中移动火柴棍时调度程序都会在线程周围移动。使用核心锁定，您可能会获得x2.5增益。不过，不知道Python是否允许设置处理器相似性。

谢谢，我会调查一下。但是我几乎是多线程领域的新手。

— 雅库布2014年

nbviewer.ipython.org/gist/minrk/5500077对此有所提及，尽管使用了不同的并行工具。

毛茸茸的恶霸-C ++-太慢了

好吧，因为有更好的程序员从事C ++实现，所以我为此呼吁了quits。

#include <cstdlib>
#include <cmath>
#include <vector>
#include <bitset>
#include <future>
#include <iostream>
#include <iomanip>

using namespace std;

/*
6^^n events will be generated, so the absolute max
that can be counted by a b bits integer is
E(b*log(2)/log(6)), i.e. n=24 for a 64 bits counter

To enumerate 3 possible values of a size n vector we need
E(n*log(3)/log(2))+1 bits, i.e. 39 bits
*/
typedef unsigned long long Counter; // counts up to 6^^24

typedef unsigned long long Benumerator; // 39 bits
typedef unsigned long      Aenumerator; // 24 bits

#define log2_over_log6 0.3869

#define A_LENGTH ((size_t)(8*sizeof(Counter)*log2_over_log6))
#define B_LENGTH (2*A_LENGTH)

typedef bitset<B_LENGTH> vectorB;

typedef vector<Counter> OccurenceCounters;

// -----------------------------------------------------------------
// multithreading junk for CPUs detection and allocation
// -----------------------------------------------------------------
int number_of_CPUs(void)
{
    int res = thread::hardware_concurrency();
    return res == 0 ? 8 : res;
}

#ifdef __linux__
#include <sched.h>
void lock_on_CPU(int cpu)
{
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(cpu, &mask);
    sched_setaffinity(0, sizeof(mask), &mask);
}
#elif defined (_WIN32)
#include <Windows.h>
#define lock_on_CPU(cpu) SetThreadAffinityMask(GetCurrentThread(), 1 << cpu)
#else
// #warning is not really standard, so this might still cause compiler errors on some platforms. Sorry about that.
#warning "Thread processor affinity settings not supported. Performances might be improved by providing a suitable alternative for your platform"
#define lock_on_CPU(cpu)
#endif

// -----------------------------------------------------------------
// B values generator
// -----------------------------------------------------------------
struct Bvalue {
    vectorB p1;
    vectorB m1;
};

struct Bgenerator {
    int n;                 // A length
    Aenumerator stop;      // computation limit
    Aenumerator zeroes;    // current zeroes pattern
    Aenumerator plusminus; // current +1/-1 pattern
    Aenumerator pm_limit;  // upper bound of +1/-1 pattern

    Bgenerator(int n, Aenumerator start=0, Aenumerator stop=0) : n(n), stop(stop)
    {
        // initialize generator so that first call to next() will generate first value
        zeroes    = start - 1;
        plusminus = -1;
        pm_limit  = 0;
    }

    // compute current B value
    Bvalue value(void)
    {
        Bvalue res;
        Aenumerator pm = plusminus;
        Aenumerator position = 1;
        int i_pm = 0;
        for (int i = 0; i != n; i++)
        {
            if (zeroes & position)
            {
                if (i_pm == 0)  res.p1 |= position; // first non-zero value fixed to +1
                else         
                {
                    if (pm & 1) res.m1 |= position; // next non-zero values
                    else        res.p1 |= position;
                    pm >>= 1;
                }
                i_pm++;
            }
            position <<= 1;
        }
        res.p1 |= (res.p1 << n); // concatenate 2 Bpre instances
        res.m1 |= (res.m1 << n);
        return res;
    }

    // next value
    bool next(void)
    {
        if (++plusminus == pm_limit)
        {
            if (++zeroes == stop) return false;
            plusminus = 0;
            pm_limit = (1 << vectorB(zeroes).count()) >> 1;
        }
        return true;
    }

    // calibration: produces ranges that will yield the approximate same number of B values
    vector<Aenumerator> calibrate(int segments)
    {
        // setup generator for the whole B range
        zeroes = 0;
        stop = 1 << n;
        plusminus = -1;
        pm_limit = 0;

        // divide range into (nearly) equal chunks
        Aenumerator chunk_size = ((Aenumerator)pow (3,n)-1) / 2 / segments;

        // generate bounds for zeroes values
        vector<Aenumerator> res(segments + 1);
        int bound = 0;
        res[bound] = 1;
        Aenumerator count = 0;
        while (next()) if (++count % chunk_size == 0) res[++bound] = zeroes;
        res[bound] = stop;
        return res;
    }
};

// -----------------------------------------------------------------
// equiprobable A values merging
// -----------------------------------------------------------------
static char A_weight[1 << A_LENGTH];
struct Agroup {
    vectorB value;
    int     count;
    Agroup(Aenumerator a = 0, int length = 0) : value(a), count(length) {}
};
static vector<Agroup> A_groups;

Aenumerator reverse(Aenumerator n) // this works on N-1 bits for a N bits word
{
    Aenumerator res = 0;
    if (n != 0) // must have at least one bit set for the rest to work
    {
        // construct left-padded reverse value
        for (int i = 0; i != 8 * sizeof(n)-1; i++)
        {
            res |= (n & 1);
            res <<= 1;
            n >>= 1;
        }

        // shift right to elimitate trailing zeroes
        while (!(res & 1)) res >>= 1;
    }
    return res;
}

void generate_A_groups(int n)
{
    static bitset<1 << A_LENGTH> lookup(0);
    Aenumerator limit_A = (Aenumerator)pow(2, n);
    Aenumerator overflow = 1 << n;
    for (char & w : A_weight) w = 0;

    // gather rotation cycles
    for (Aenumerator a = 0; a != limit_A; a++)
    {
        Aenumerator rotated = a;
        int cycle_length = 0;
        for (int i = 0; i != n; i++)
        {
            // check for new cycles
            if (!lookup[rotated])
            {
                cycle_length++;
                lookup[rotated] = 1;
            }

            // rotate current value
            rotated <<= 1;
            if (rotated & overflow) rotated |= 1;
            rotated &= (overflow - 1);
        }

        // store new cycle
        if (cycle_length > 0) A_weight[a] = cycle_length;
    }

    // merge symetric groups
    for (Aenumerator a = 0; a != limit_A; a++)
    {
        // skip already grouped values
        if (A_weight[a] == 0) continue;

        // regroup a symetric pair
        Aenumerator r = reverse(a);
        if (r != a)
        {
            A_weight[a] += A_weight[r];
            A_weight[r] = 0;
        }  
    }

    // generate groups
    for (Aenumerator a = 0; a != limit_A; a++)
    {
        if (A_weight[a] != 0) A_groups.push_back(Agroup(a, A_weight[a]));
    }
}

// -----------------------------------------------------------------
// worker thread
// -----------------------------------------------------------------
OccurenceCounters solve(int n, int index, Aenumerator Bstart, Aenumerator Bstop)
{
    OccurenceCounters consecutive_zero_Z(n, 0);  // counts occurences of the first i terms of Z being 0

    // lock on assigned CPU
    lock_on_CPU(index);

    // enumerate B vectors
    Bgenerator Bgen(n, Bstart, Bstop);
    while (Bgen.next())
    {
        // get next B value
        Bvalue B = Bgen.value();

        // enumerate A vector groups
        for (const auto & group : A_groups)
        {
            // count consecutive occurences of inner product equal to zero
            vectorB sliding_A(group.value);
            for (int i = 0; i != n; i++)
            {
                if ((sliding_A & B.p1).count() != (sliding_A & B.m1).count()) break;
                consecutive_zero_Z[i] += group.count;
                sliding_A <<= 1;
            }
        }
    }
    return consecutive_zero_Z;
}

// -----------------------------------------------------------------
// main
// -----------------------------------------------------------------
#define die(msg) { cout << msg << endl; exit (-1); }

int main(int argc, char * argv[])
{
    int n = argc == 2 ? atoi(argv[1]) : 16; // arbitray value for debugging
    if (n < 1 || n > 24) die("vectors of lenght between 1 and 24 is all I can (try to) compute, guv");

    auto begin = time(NULL);

    // one worker thread per CPU
    int num_workers = number_of_CPUs();

    // regroup equiprobable A values
    generate_A_groups(n);

    // compute B generation ranges for proper load balancing
    vector<Aenumerator> ranges = Bgenerator(n).calibrate(num_workers);

    // set workers to work
    vector<future<OccurenceCounters>> workers(num_workers);
    for (int i = 0; i != num_workers; i++)
    {
        workers[i] = async(
            launch::async, // without this parameter, C++ will decide whether execution shall be sequential or asynchronous (isn't C++ fun?).
            solve, n, i, ranges[i], ranges[i+1]); 
    }

    // collect results
    OccurenceCounters result(n + 1, 0);
    for (auto& worker : workers)
    {
        OccurenceCounters partial = worker.get();
        for (size_t i = 0; i != partial.size(); i++) result[i] += partial[i]*2; // each result counts for a symetric B pair
    }
    for (Counter & res : result) res += (Counter)1 << n; // add null B vector contribution
    result[n] = result[n - 1];                           // the last two probabilities are equal by construction

    auto duration = time(NULL) - begin;

    // output
    cout << "done in " << duration / 60 << ":" << setw(2) << setfill('0') << duration % 60 << setfill(' ')
        << " by " << num_workers << " worker thread" << ((num_workers > 1) ? "s" : "") << endl;
    Counter events = (Counter)pow(6, n);
    int width = (int)log10(events) + 2;
    cout.precision(5);
    for (int i = 0; i <= n; i++) cout << setw(2) << i << setw(width) << result[i] << " / " << events << " " << fixed << (float)result[i] / events << endl;

    return 0;
}

生成可执行文件

它是一个独立的C ++ 11源代码，可以在不发出警告的情况下进行编译，并且可以在以下环境中顺利运行：

Win7和MSVC2013
Win7和MinGW-g ++ 4.7
Ubuntu＆g ++ 4.8（在分配了2个CPU的VirtualBox VM中）

如果使用g ++进行编译，请使用：g ++ -O3 -pthread -std = c ++ 11
忘记-pthread将产生友好的内核转储。

最佳化

最后一个Z项等于第一个Z项（在两种情况下均为Bpre x A），因此最后两个结果始终相等，从而无需计算最后一个Z值。
增益可忽略不计，但是对其进行编码不会花费任何代价，因此您不妨使用它。
正如Jakube所发现的，给定A向量的所有循环值都产生相同的概率。
您可以使用A的单个实例来计算这些值，并将结果乘以其可能的转数。旋转组可以很容易地在很短的时间内预先计算出来，因此这是一个巨大的净速度增益。
由于n个长度向量的排列数为n-1，因此复杂度从o（6 ⁿ）降低到o（6 ⁿ /（n-1）），对于相同的计算时间，基本上走得更远。
似乎成对的对称模式也产生相同的概率。例如100101和101001。
我对此没有任何数学证明，但是直观地讲，当显示所有可能的B模式时，对于相同的全局结果，每个对称A值将与对应的对称B值进行卷积。
这样可以重组更多的A向量，从而使A组数目减少约30％。
错误出于某种半神秘的原因，仅设置了一个或两个位的所有模式都会产生相同的结果。这并不代表许多不同的组，但实际上它们可以合并而无需花费任何费用。
向量B和-B（所有分量均乘以-1的B）产生相同的概率。
（例如[1，0，-1，1]和[-1，0，1，-1]）。
除了空向量（所有分量等于0）之外，B和-B形成一对不同的向量。
通过仅考虑每对中的一个，并将其贡献乘以2，可以将B值的数量减少一半，从而将已知B的全局贡献仅添加到每个概率一次。

怎么运行的

B值的数量巨大（3 ⁿ），因此对其进行预先计算将需要不适当的内存量，这将减慢计算速度并最终耗尽可用的RAM。
不幸的是，我找不到一种简单的方法来枚举优化的B值的一半，因此我求助于编码专用生成器。

强大的B生成器给代码带来了很多乐趣，尽管支持收益机制的语言将允许以一种更为优雅的方式对其进行编程。
简而言之，我们将Bpre向量的“骨架”视为二进制向量，其中1代表实际的-1或+1值。
在所有这些+ 1 / -1电位值中，第一个固定为+1（因此选择一个可能的B / -B矢量），并列举所有剩余的可能的+ 1 / -1组合。
最后，一个简单的校准系统可确保每个工作线程将处理大约相同大小的值范围。

将对值进行大量过滤以重新组合成等概率的块。
这是在预计算阶段完成的，该阶段通过蛮力检查所有可能的值。
这部分的执行时间可以忽略不计O（2 ⁿ），并且不需要进行优化（代码已经足够不可读！）。

为了评估内部乘积（只需要针对零进行测试），将B的-1和1分量重新组合为二进制向量。
当（且仅）在与非零A值相对应的B值中存在相等数量的+1和-1时，内积为null。
这可以通过简单的掩码和位计数操作来计算，std::bitset这将产生相当有效的位计数代码，而不必诉诸难看的固有指令。

工作在内核之间平均分配，并具有强制性的CPU亲和力（他们说，一点点帮助）。

结果示例

C:\Dev\PHP\_StackOverflow\C++\VectorCrunch>release\VectorCrunch.exe 16
done in 8:19 by 4 worker threads
 0  487610895942 / 2821109907456 0.17284
 1   97652126058 / 2821109907456 0.03461
 2   20659337010 / 2821109907456 0.00732
 3    4631534490 / 2821109907456 0.00164
 4    1099762394 / 2821109907456 0.00039
 5     302001914 / 2821109907456 0.00011
 6     115084858 / 2821109907456 0.00004
 7      70235786 / 2821109907456 0.00002
 8      59121706 / 2821109907456 0.00002
 9      56384426 / 2821109907456 0.00002
10      55686922 / 2821109907456 0.00002
11      55508202 / 2821109907456 0.00002
12      55461994 / 2821109907456 0.00002
13      55451146 / 2821109907456 0.00002
14      55449098 / 2821109907456 0.00002
15      55449002 / 2821109907456 0.00002
16      55449002 / 2821109907456 0.00002

表演节目

尽管只有“真正的”内核才能完全提高计算速度，但多线程应该可以完美地工作。我的CPU只有4个CPU的2个核心，并且比单线程版本的收益“只有”约3.5。

编译器

多线程的最初问题使我相信GNU编译器的性能要比Microsoft低。

经过更全面的测试之后，g ++似乎又赢得了胜利，产生了大约30％的更快代码（与我在另外两个计算量大的项目中注意到的比率相同）。

值得注意的是，该std::bitset库是通过g ++ 4.8的专用位计数指令实现的，而MSVC 2013仅使用常规位移位的循环。

如人们所料，以32位或64位进行编译没有区别。

进一步完善

我注意到一些A组在所有归约操作之后产生相同的概率，但是我无法确定允许重新组合它们的模式。

这是我在n = 11时发现的对：

  10001011 and 10001101
 100101011 and 100110101
 100101111 and 100111101
 100110111 and 100111011
 101001011 and 101001101
 101011011 and 101101011
 101100111 and 110100111
1010110111 and 1010111011
1011011111 and 1011111011
1011101111 and 1011110111

我认为最后两个概率应该始终相同。这是因为第n + 1个内积实际上与第一个内积相同。

我的意思是，当且仅当前n + 1个为零时，前n个内积为零。最后一个内部产品不会像您以前那样提供任何新信息。因此，给出n个零乘积的字符串数与给出n + 1个零乘积的数字数完全相同。

出于兴趣，您究竟在计算什么呢？

感谢您的更新，但我不明白“ 0 2160009216 2176782336”这一行。在这种情况下，您到底算什么？第一个内部乘积为零的概率远小于该概率。

您能否就如何编译和运行此程序提供一些建议？我尝试了g ++ -Wall -std = c ++ 11 kuroineko.cpp -o kuroineko和./kuroineko 12，但是它给了terminate called after throwing an instance of 'std::system_error' what(): Unknown error -1 Aborted (core dumped)