149

我有用Python / NumPy编写的这段代码

from __future__ import division
import numpy as np
import itertools

n = 6
iters = 1000
firstzero = 0
bothzero = 0
""" The next line iterates over arrays of length n+1 which contain only -1s and 1s """
for S in itertools.product([-1, 1], repeat=n+1):
    """For i from 0 to iters -1 """
    for i in xrange(iters):
        """ Choose a random array of length n.
            Prob 1/4 of being -1, prob 1/4 of being 1 and prob 1/2 of being 0. """
        F = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n)
        """The next loop just makes sure that F is not all zeros."""
        while np.all(F == 0):
            F = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n)
        """np.convolve(F, S, 'valid') computes two inner products between
        F and the two successive windows of S of length n."""
        FS = np.convolve(F, S, 'valid')
        if FS[0] == 0:
            firstzero += 1
        if np.all(FS == 0):
            bothzero += 1

print("firstzero: %i" % firstzero)
print("bothzero: %i" % bothzero)

它计算的是两个随机数组的卷积的次数，其中一个长于另一个随机数组，且具有特定的概率分布，该卷积在第一个位置为0，在两个位置都为0。

我和一个朋友打赌，他说Python是编写这种代码的可怕语言，需要快速。我的电脑上需要9秒钟。他说，如果使用“适当的语言”编写，速度可能会提高100倍。

挑战在于查看该代码是否确实可以使您选择的任何一种语言的速度提高100倍。我将测试您的代码，并且从现在开始最快的一周将获胜。如果有人低于0.09s，那么他们会自动获胜，而我会输。

状态

Python。Alistair Buxon的速度提高了30倍！虽然不是最快的解决方案，但实际上是我的最爱。
八度。@Thethos可以加速100倍。
铁锈。@dbaupp可以加速500倍。
C ++。Guy Sirton的速度提高了570倍。
Ç。@ace可以加快727倍。
C ++。@Stefan令人难以置信的快速。

最快的解决方案现在太快了，无法合理安排时间。因此，我将n增加到10并设置iters = 100000来比较最佳的。在这种措施下最快的是。

Ç。@ace 7.5秒
C ++。@Stefan的1s。

我的机器时间将在我的机器上运行。这是在AMD FX-8350八核处理器上的标准ubuntu安装。这也意味着我需要能够运行您的代码。

发布后续信息由于该竞赛非常容易获得x100的加速，因此我已经发布了针对那些希望利用其速度专家的专家的后续信息。看看Python到底有多慢（第二部分）？

fastest-code

— 社区
source

61

C ++魔术

使用简单RNG时为0.84毫秒，使用c ++ 11 std :: knuth时为1.67毫秒

0.16ms，稍有算法修改（请参见下面的编辑）

python实现在我的装备上运行时为7.97秒。因此，这取决于选择的RNG速度快9488到4772倍。

#include <iostream>
#include <bitset>
#include <random>
#include <chrono>
#include <stdint.h>
#include <cassert>
#include <tuple>

#if 0
// C++11 random
std::random_device rd;
std::knuth_b gen(rd());

uint32_t genRandom()
{
    return gen();
}
#else
// bad, fast, random.

uint32_t genRandom()
{
    static uint32_t seed = std::random_device()();
    auto oldSeed = seed;
    seed = seed*1664525UL + 1013904223UL; // numerical recipes, 32 bit
    return oldSeed;
}
#endif

#ifdef _MSC_VER
uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); }
#else
uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); }
#endif



std::pair<unsigned, unsigned> convolve()
{
    const uint32_t n = 6;
    const uint32_t iters = 1000;
    unsigned firstZero = 0;
    unsigned bothZero = 0;

    uint32_t S = (1 << (n+1));
    // generate all possible N+1 bit strings
    // 1 = +1
    // 0 = -1
    while ( S-- )
    {
        uint32_t s1 = S % ( 1 << n );
        uint32_t s2 = (S >> 1) % ( 1 << n );
        uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
        static_assert( n < 16, "packing of F fails when n > 16.");


        for( unsigned i = 0; i < iters; i++ )
        {
            // generate random bit mess
            uint32_t F;
            do {
                F = genRandom() & fmask;
            } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );

            // Assume F is an array with interleaved elements such that F[0] || F[16] is one element
            // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
            // and  ~MSB(F) & LSB(F) returns 1 for all elements that are negative
            // this results in the distribution ( -1, 0, 0, 1 )
            // to ease calculations we generate r = LSB(F) and l = MSB(F)

            uint32_t r = F % ( 1 << n );
            // modulo is required because the behaviour of the leftmost bit is implementation defined
            uint32_t l = ( F >> 16 ) % ( 1 << n );

            uint32_t posBits = l & ~r;
            uint32_t negBits = ~l & r;
            assert( (posBits & negBits) == 0 );

            // calculate which bits in the expression S * F evaluate to +1
            unsigned firstPosBits = ((s1 & posBits) | (~s1 & negBits));
            // idem for -1
            unsigned firstNegBits = ((~s1 & posBits) | (s1 & negBits));

            if ( popcnt( firstPosBits ) == popcnt( firstNegBits ) )
            {
                firstZero++;

                unsigned secondPosBits = ((s2 & posBits) | (~s2 & negBits));
                unsigned secondNegBits = ((~s2 & posBits) | (s2 & negBits));

                if ( popcnt( secondPosBits ) == popcnt( secondNegBits ) )
                {
                    bothZero++;
                }
            }
        }
    }

    return std::make_pair(firstZero, bothZero);
}

int main()
{
    typedef std::chrono::high_resolution_clock clock;
    int rounds = 1000;
    std::vector< std::pair<unsigned, unsigned> > out(rounds);

    // do 100 rounds to get the cpu up to speed..
    for( int i = 0; i < 10000; i++ )
    {
        convolve();
    }


    auto start = clock::now();

    for( int i = 0; i < rounds; i++ )
    {
        out[i] = convolve();
    }

    auto end = clock::now();
    double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0;

#if 0
    for( auto pair : out )
        std::cout << pair.first << ", " << pair.second << std::endl;
#endif

    std::cout << seconds/rounds*1000 << " msec/round" << std::endl;

    return 0;
}

以64位编译以获得额外的寄存器。使用简单的随机数生成器时，convolve（）中的循环无需任何内存访问即可运行，所有变量都存储在寄存器中。

它是如何工作的：而不是存储S，并F在存储器阵列，它的形式存储在uint32_t的位。
对于S，使用n最低有效位，其中设置位表示+1，而未设置位表示-1。
F至少需要2位才能创建[-1，0，0，1]的分布。这是通过生成随机位并检查16个最低有效位（称为r）和16个最高有效位（称为l）来完成的。如果l & ~r我们假设F为+1，那么~l & r我们假设F为-1。否则F为0。这将生成我们正在寻找的分布。

现在我们有了S，posBits在F == 1的negBits每个位置上都有一个设置位，在F == -1的每个位置上都有一个设置位。

我们可以证明F * S（在其中*表示乘积）在条件下取+1 (S & posBits) | (~S & negBits)。我们还可以为所有F * S评估为-1的情况生成相似的逻辑。最后，我们知道sum(F * S)当且仅当结果中存在相等数量的-1和+1时，该值才为0。通过简单地比较+1位和-1位的数量，这非常容易计算。

此实现使用32位整数，并且n接受的最大值为16。可以通过修改随机生成代码将实现扩展到31位，并通过使用uint64_t而不是uint32_t扩展到63位。

编辑

以下卷积函数：

std::pair<unsigned, unsigned> convolve()
{
    const uint32_t n = 6;
    const uint32_t iters = 1000;
    unsigned firstZero = 0;
    unsigned bothZero = 0;
    uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
    static_assert( n < 16, "packing of F fails when n > 16.");


    for( unsigned i = 0; i < iters; i++ )
    {
        // generate random bit mess
        uint32_t F;
        do {
            F = genRandom() & fmask;
        } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );

        // Assume F is an array with interleaved elements such that F[0] || F[16] is one element
        // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
        // and  ~MSB(F) & LSB(F) returns 1 for all elements that are negative
        // this results in the distribution ( -1, 0, 0, 1 )
        // to ease calculations we generate r = LSB(F) and l = MSB(F)

        uint32_t r = F % ( 1 << n );
        // modulo is required because the behaviour of the leftmost bit is implementation defined
        uint32_t l = ( F >> 16 ) % ( 1 << n );

        uint32_t posBits = l & ~r;
        uint32_t negBits = ~l & r;
        assert( (posBits & negBits) == 0 );

        uint32_t mask = posBits | negBits;
        uint32_t totalBits = popcnt( mask );
        // if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0
        if ( totalBits & 1 )
            continue;

        uint32_t adjF = posBits & ~negBits;
        uint32_t desiredBits = totalBits / 2;

        uint32_t S = (1 << (n+1));
        // generate all possible N+1 bit strings
        // 1 = +1
        // 0 = -1
        while ( S-- )
        {
            // calculate which bits in the expression S * F evaluate to +1
            auto firstBits = (S & mask) ^ adjF;
            auto secondBits = (S & ( mask << 1 ) ) ^ ( adjF << 1 );

            bool a = desiredBits == popcnt( firstBits );
            bool b = desiredBits == popcnt( secondBits );
            firstZero += a;
            bothZero += a & b;
        }
    }

    return std::make_pair(firstZero, bothZero);
}

将运行时间缩短为0.160-0.161ms。手动循环展开（上面未显示）为0.150。较小的n = 10，iter = 100000情况在250ms以下运行。我敢肯定，我可以利用额外的内核在50ms内获得它，但这太容易了。

这是通过释放内部循环分支并交换F和S循环来完成的。
如果bothZero不需要，我可以通过稀疏遍历所有可能的S数组将运行时间减少到0.02ms。

— 斯特凡
source

3

您能否提供gcc友好版本，以及您的命令行是什么？我不确定目前是否可以测试。

我对此一无所知，但Google告诉我__builtin_popcount可能是_mm_popcnt_u32（）的替代品。

3

代码已更新，使用#ifdef开关选择正确的popcnt命令。它可以编译-std=c++0x -mpopcnt -O2并以1.01ms的速度在32位模式下运行（我手头没有64位GCC版本）。

— 斯蒂芬

您可以让它打印输出吗？我不确定它当前是否正在做任何事情：）

7

您显然是向导。+1

— BurntPizza 2014年

76

Python2.7 + Numpy 1.8.1：10.242秒

Fortran 90+：0.029秒 0.003秒 0.022秒 0.010秒

该死的你输了！这里也不是没有并行性，只是Fortran 90+。

编辑我采用了盖伊·西顿（Guy Sirton）的算法来置换数组S（很好的发现：D）。我显然也-g -traceback激活了编译器标志，这些标志使此代码减慢到大约0.017s。目前，我正在将其编译为

ifort -fast -o convolve convolve_random_arrays.f90

对于那些没有的人ifort，您可以使用

gfortran -O3 -ffast-math -o convolve convolve_random_arrays.f90

编辑2：运行时间的减少是因为我之前做错了什么，并且得到了错误的答案。用正确的方法做显然很慢。我仍然不敢相信C ++比我的要快，因此本周我可能会花一些时间来尝试解决这一问题，以加快速度。

编辑3：通过简单地使用基于BSD的RNG的RNG部分（如Sampo Smolander的建议）来更改RNG部分，并消除常数除以m1，我将运行时缩短为与Guy Guyton的C ++答案相同。使用静态数组（如Sharpie所建议）将运行时降至C ++运行时以下！Yay Fortran！：D

编辑4显然，这不能编译（使用gfortran）和正确运行（错误的值），因为整数超出了其极限。我已经进行了更正，以确保它可以工作，但是这需要一个拥有ifort 11+或gfortran 4.7+（或另一个允许iso_fortran_env使用F2008的编译器int64）。

这是代码：

program convolve_random_arrays
   use iso_fortran_env
   implicit none
   integer(int64), parameter :: a1 = 1103515245
   integer(int64), parameter :: c1 = 12345
   integer(int64), parameter :: m1 = 2147483648
   real, parameter ::    mi = 4.656612873e-10 ! 1/m1
   integer, parameter :: n = 6
   integer :: p, pmax, iters, i, nil(0:1), seed
   !integer, allocatable ::  F(:), S(:), FS(:)
   integer :: F(n), S(n+1), FS(2)

   !n = 6
   !allocate(F(n), S(n+1), FS(2))
   iters = 1000
   nil = 0

   !call init_random_seed()

   S = -1
   pmax = 2**(n+1)
   do p=1,pmax
      do i=1,iters
         F = rand_int_array(n)
         if(all(F==0)) then
            do while(all(F==0))
               F = rand_int_array(n)
            enddo
         endif

         FS = convolve(F,S)

         if(FS(1) == 0) then
            nil(0) = nil(0) + 1
            if(FS(2) == 0) nil(1) = nil(1) + 1
         endif

      enddo
      call permute(S)
   enddo

   print *,"first zero:",nil(0)
   print *," both zero:",nil(1)

 contains
   pure function convolve(x, h) result(y)
!x is the signal array
!h is the noise/impulse array
      integer, dimension(:), intent(in) :: x, h
      integer, dimension(abs(size(x)-size(h))+1) :: y
      integer:: i, j, r
      y(1) = dot_product(x,h(1:n-1))
      y(2) = dot_product(x,h(2:n  ))
   end function convolve

   pure subroutine permute(x)
      integer, intent(inout) :: x(:)
      integer :: i

      do i=1,size(x)
         if(x(i)==-1) then
            x(i) = 1
            return
         endif
         x(i) = -1
      enddo
   end subroutine permute

   function rand_int_array(i) result(x)
     integer, intent(in) :: i
     integer :: x(i), j
     real :: y
     do j=1,i
        y = bsd_rng()
        if(y <= 0.25) then
           x(j) = -1
        else if (y >= 0.75) then
           x(j) = +1
        else
           x(j) = 0
        endif
     enddo
   end function rand_int_array

   function bsd_rng() result(x)
      real :: x
      integer(int64) :: b=3141592653
      b = mod(a1*b + c1, m1)
      x = real(b)*mi
   end function bsd_rng
end program convolve_random_arrays

我想现在的问题是，您是否将停止使用糖蜜缓慢的Python，而使用电子可以移动的Fortran;）。

— 凯尔·卡诺斯（Kyle Kanos）
source

1

反正case语句会比生成器函数快吗？除非您期望某种分支预测/缓存行/等加速？

— OrangeDog 2014年

17

速度应该在同一台机器上进行比较。您为OP的代码获得了哪个运行时？

— nbubis

3

C ++答案实现了自己的非常轻量级的随机数生成器。您的答案使用了编译器随附的默认值，可能会更慢？

— Sampo Smolander 2014年

3

同样，C ++示例似乎正在使用静态分配的数组。尝试使用在编译时设置的固定长度数组，看看它是否在任何时候都可以节省时间。

— Sharpie 2014年

1

@KyleKanos @Lembik问题是fortran中的整数赋值未隐式使用int64规范，因此在进行任何转换之前，数字均为int32。该代码应为：integer(int64) :: b = 3141592653_int64对于所有int64。这是fortran标准的一部分，程序员希望使用一种类型声明的编程语言。（请注意，默认设置当然可以覆盖此设置）

— 2014年

69

python ~~2.7-0.882s~~ 0.283s

（OP的原版：6.404秒）

编辑：通过预先计算F值来进行Steven Rumbalski的优化。通过这种优化，cpython击败了pypy的0.365s。

import itertools
import operator
import random

n=6
iters = 1000
firstzero = 0
bothzero = 0

choicesF = filter(any, itertools.product([-1, 0, 0, 1], repeat=n))

for S in itertools.product([-1,1], repeat = n+1):
    for i in xrange(iters):
        F = random.choice(choicesF)
        if not sum(map(operator.mul, F, S[:-1])):
            firstzero += 1
            if not sum(map(operator.mul, F, S[1:])):
                bothzero += 1

print "firstzero", firstzero
print "bothzero", bothzero

OP的原始代码使用了如此小的数组，因此使用Numpy没有任何好处，正如该纯python实现所展示的那样。但也请参见这个numpy实现，它比我的代码快三倍。

如果第一个结果不为零，我也会跳过其余的卷积来进行优化。

— 阿利斯泰尔·巴克斯顿
source

11

使用pypy大约需要0.5秒。

— Alistair Buxton 2014年

2

如果将n设置为10，您将获得更具说服力的加速效果。对于cpython和pypy，我得到19s对4.6s。

3

另一个优化将是对可能性进行预先计算，F因为其中只有4032个。choicesF = filter(any, itertools.product([-1, 0, 0, 1], repeat=n))在循环之外定义。然后在innerloop中定义F = random.choice(choicesF)。通过这种方法，我的速度提高了3倍。

— Steven Rumbalski 2014年

3

如何在Cython中进行编译？然后添加一些巧妙的静态类型？

— 塔娜·布里姆霍尔

2

将所有内容放入函数中，并在最后调用它。这样就可以对名称进行本地化，这也可以使@riffraff建议的优化工作正常进行。同样，将创建range(iters)移出循环。总的来说，我的回答比我的回答好7％。

— WolframH 2014年

44

锈蚀：0.011s

原始Python：8.3

原始Python的直接翻译。

extern crate rand;

use rand::Rng;

static N: uint = 6;
static ITERS: uint = 1000;

fn convolve<T: Num>(into: &mut [T], a: &[T], b: &[T]) {
    // we want `a` to be the longest array
    if a.len() < b.len() {
        convolve(into, b, a);
        return
    }

    assert_eq!(into.len(), a.len() - b.len() + 1);

    for (n,place) in into.mut_iter().enumerate() {
        for (x, y) in a.slice_from(n).iter().zip(b.iter()) {
            *place = *place + *x * *y
        }
    }
}

fn main() {
    let mut first_zero = 0;
    let mut both_zero = 0;
    let mut rng = rand::XorShiftRng::new().unwrap();

    for s in PlusMinus::new() {
        for _ in range(0, ITERS) {
            let mut f = [0, .. N];
            while f.iter().all(|x| *x == 0) {
                for p in f.mut_iter() {
                    match rng.gen::<u32>() % 4 {
                        0 => *p = -1,
                        1 | 2 => *p = 0,
                        _ => *p = 1
                    }
                }
            }

            let mut fs = [0, .. 2];
            convolve(fs, s, f);

            if fs[0] == 0 { first_zero += 1 }
            if fs.iter().all(|&x| x == 0) { both_zero += 1 }
        }
    }

    println!("{}\n{}", first_zero, both_zero);
}



/// An iterator over [+-]1 arrays of the appropriate length
struct PlusMinus {
    done: bool,
    current: [i32, .. N + 1]
}
impl PlusMinus {
    fn new() -> PlusMinus {
        PlusMinus { done: false, current: [-1, .. N + 1] }
    }
}

impl Iterator<[i32, .. N + 1]> for PlusMinus {
    fn next(&mut self) -> Option<[i32, .. N+1]> {
        if self.done {
            return None
        }

        let ret = self.current;

        // a binary "adder", that just adds one to a bit vector (where
        // -1 is the zero, and 1 is the one).
        for (i, place) in self.current.mut_iter().enumerate() {
            *place = -*place;
            if *place == 1 {
                break
            } else if i == N {
                // we've wrapped, so we want to stop after this one
                self.done = true
            }
        }

        Some(ret)
    }
}

编译与 --opt-level=3
我的rust编译器是最近一个晚上：（rustc 0.11-pre-nightly (eea4909 2014-04-24 23:41:15 -0700)确切地说）

— 休恩
source

我使用每晚版本的rust进行编译。但是我认为代码是错误的。输出应该接近firstzero 27215 bothzero12086。相反，它给出27367 6481

@Lembik，哎呀，在卷积中弄混了我a和他b；固定（不会显着更改运行时）。

— 休恩2014年

4

这是锈蚀速度的很好展示。

39

C ++（VS 2012）~~-0.026~~秒~~0.015~~秒

python 2.7.6 / numpy 1.8.1-12s

加速〜x800。

如果卷积数组很大，则差距会小很多。

#include <vector>
#include <iostream>
#include <ctime>

using namespace std;

static unsigned int seed = 35;

int my_random()
{
   seed = seed*1664525UL + 1013904223UL; // numerical recipes, 32 bit

   switch((seed>>30) & 3)
   {
   case 0: return 0;
   case 1: return -1;
   case 2: return 1;
   case 3: return 0;
   }
   return 0;
}

bool allzero(const vector<int>& T)
{
   for(auto x : T)
   {
      if(x!=0)
      {
         return false;
      }
   }
   return true;
}

void convolve(vector<int>& out, const vector<int>& v1, const vector<int>& v2)
{
   for(size_t i = 0; i<out.size(); ++i)
   {
      int result = 0;
      for(size_t j = 0; j<v2.size(); ++j)
      {
         result += v1[i+j]*v2[j];
      }
      out[i] = result;
   }
}

void advance(vector<int>& v)
{
   for(auto &x : v)
   {
      if(x==-1)
      {
         x = 1;
         return;
      }
      x = -1;
   }
}

void convolve_random_arrays(void)
{
   const size_t n = 6;
   const int two_to_n_plus_one = 128;
   const int iters = 1000;
   int bothzero = 0;
   int firstzero = 0;

   vector<int> S(n+1);
   vector<int> F(n);
   vector<int> FS(2);

   time_t current_time;
   time(&current_time);
   seed = current_time;

   for(auto &x : S)
   {
      x = -1;
   }
   for(int i=0; i<two_to_n_plus_one; ++i)
   {
      for(int j=0; j<iters; ++j)
      {
         do
         {
            for(auto &x : F)
            {
               x = my_random();
            }
         } while(allzero(F));
         convolve(FS, S, F);
         if(FS[0] == 0)
         {
            firstzero++;
            if(FS[1] == 0)
            {
               bothzero++;
            }
         }
      }
      advance(S);
   }
   cout << firstzero << endl; // This output can slow things down
   cout << bothzero << endl; // comment out for timing the algorithm
}

一些注意事项：

循环中调用了随机函数，因此我选择了重量非常轻的线性同余生成器（但慷慨地着眼于MSB）。
这实际上只是优化解决方案的起点。
没花那么长时间写...
我将S的所有值迭代S[0]为“最低有效”数字。

将此主要功能添加为一个独立的示例：

int main(int argc, char** argv)
{
  for(int i=0; i<1000; ++i) // run 1000 times for stop-watch
  {
      convolve_random_arrays();
  }
}

— 盖·西顿
source

1

确实。OP代码中的数组很小，这意味着使用numpy实际上比直接使用python慢一个数量级。

— Alistair Buxton 2014年

2

现在，x800是我在说的！

非常好！由于您的advance功能，我提高了代码的速度，因此我的代码现在比您的：P（但竞争非常好！）更快

— Kyle Kanos

1

@lembik是的，正如Mat所说。您需要C ++ 11 supprt和一个主要功能。让我知道您是否需要更多帮助才能运行此程序...

— Guy Sirton 2014年

2

我只是测试这一点，可以通过使用代替普通的std ::阵列矢量剃的另一20％..

— PlasmaHH

21

C

在我的机器上花费0.015s，而OP的原始代码花费7.7s。试图通过生成随机数组并在同一循环中进行卷积来进行优化，但似乎并没有太大的区别。

第一个数组是通过取一个整数生成的，将其写成二进制，然后将所有1更改为-1，将所有0更改为1。其余的应该非常简单。

编辑：现在不再有n一个as int，现在我们有了n一个宏定义的常量，因此我们可以使用int arr[n];代替malloc。

Edit2：rand()现在，它实现了xorshift PRNG，而不是内置函数。同样，在生成随机数组时会删除很多条件语句。

编译指令：

gcc -O3 -march=native -fwhole-program -fstrict-aliasing -ftree-vectorize -Wall ./test.c -o ./test

码：

#include <stdio.h>
#include <time.h>

#define n (6)
#define iters (1000)
unsigned int x,y=34353,z=57768,w=1564; //PRNG seeds

/* xorshift PRNG
 * Taken from https://en.wikipedia.org/wiki/Xorshift#Example_implementation
 * Used under CC-By-SA */
int myRand() {
    unsigned int t;
    t = x ^ (x << 11);
    x = y; y = z; z = w;
    return w = w ^ (w >> 19) ^ t ^ (t >> 8);
}

int main() {
    int firstzero=0, bothzero=0;
    int arr[n+1];
    unsigned int i, j;
    x=(int)time(NULL);

    for(i=0; i< 1<<(n+1) ; i++) {
        unsigned int tmp=i;
        for(j=0; j<n+1; j++) {
            arr[j]=(tmp&1)*(-2)+1;
            tmp>>=1;
        }
        for(j=0; j<iters; j++) {
            int randArr[n];
            unsigned int k, flag=0;
            int first=0, second=0;
            do {
                for(k=0; k<n; k++) {
                    randArr[k]=(1-(myRand()&3))%2;
                    flag+=(randArr[k]&1);
                    first+=arr[k]*randArr[k];
                    second+=arr[k+1]*randArr[k];
                }
            } while(!flag);
            firstzero+=(!first);
            bothzero+=(!first&&!second);
        }
    }
    printf("firstzero %d\nbothzero %d\n", firstzero, bothzero);
    return 0;
}

— ace_香港独立
source

1

我测试了这个。它非常快（尝试n = 10）并提供正确的外观输出。谢谢。

此实现不遵循原始的实现，因为如果随机向量全为零，则只会重新生成最后一个元素。原来，整个矢量都是。您需要封闭该循环do{}while(!flag)或实现此目的的东西。我不希望它会改变运行时间（可能会使运行时间更快）。

— Guy Sirton

@Guy Sirton注意，在continue;我分配-1给的声明之前k，k它将再次从0循环。

— ace_HongKong独立2014年

1

@ace啊！你是对的。我扫描的速度太快了，看起来好像-=不是=-:-) while循环更易读。

— Guy Sirton 2014年

17

Ĵ

我不希望击败任何编译语言，并且有些东西告诉我，用它获得不到0.09 s的时间是一个奇迹般的机器，但是无论如何我都想提交这个J，因为它很漂亮。

NB. constants
num =: 6
iters =: 1000

NB. convolve
NB. take the multiplication table                */
NB. then sum along the NE-SW diagonals           +//.
NB. and keep the longest ones                    #~ [: (= >./) #/.
NB. operate on rows of higher dimensional lists  " 1
conv =: (+//. #~ [: (= >./) #/.) @: (*/) " 1

NB. main program
S  =: > , { (num+1) # < _1 1                NB. all {-1,1}^(num+1)
F  =: (3&= - 0&=) (iters , num) ?@$ 4       NB. iters random arrays of length num
FS =: ,/ S conv/ F                          NB. make a convolution table
FB =: +/ ({. , *./)"1 ] 0 = FS              NB. first and both zero
('first zero ',:'both zero ') ,. ":"0 FB    NB. output results

在过去的十年中，这在笔记本电脑上耗时约0.5 s，仅比答案中的Python快20倍。大部分时间都花在了上面，conv因为我们懒惰地编写了它（我们计算了整个卷积）并且完全笼统地写了出来。

由于我们了解S和F，因此可以通过对该程序进行特定的优化来加快处理速度。我能想到的最好的方法是- conv =: ((num, num+1) { +//.)@:(*/)"1特别选择从对角线总和到卷积的最长元素对应的两个数字-大约将时间减半。

— 算法
source

6

J永远值得提交，伙计：)

— Vitaly Dyatlov 2014年

17

Perl-加快9.3倍，改进了830％

在我古老的上网本上，OP的代码需要53秒才能运行；Alistair Buxton的版本大约需要6.5秒，而以下Perl的版本大约需要5.7秒。

use v5.10;
use strict;
use warnings;

use Algorithm::Combinatorics qw( variations_with_repetition );
use List::Util qw( any sum );
use List::MoreUtils qw( pairwise );

my $n         = 6;
my $iters     = 1000;
my $firstzero = 0;
my $bothzero  = 0;

my $variations = variations_with_repetition([-1, 1], $n+1);
while (my $S = $variations->next)
{
  for my $i (1 .. $iters)
  {
    my @F;
    until (@F and any { $_ } @F)
    {
      @F = map +((-1,0,0,1)[rand 4]), 1..$n;
    }

    # The pairwise function doesn't accept array slices,
    # so need to copy into a temp array @S0
    my @S0 = @$S[0..$n-1];

    unless (sum pairwise { $a * $b } @F, @S0)
    {
      $firstzero++;
      my @S1 = @$S[1..$n];  # copy again :-(
      $bothzero++ unless sum pairwise { $a * $b } @F, @S1;
    }
  }
}

say "firstzero ", $firstzero;
say "bothzero ", $bothzero;

— 托比墨
source

12

Python 2.7-具有mkl绑定的numpy 1.8.1-0.086s

（OP的原始：6.404s）（Buxton的纯python：0.270s）

import numpy as np
import itertools

n=6
iters = 1000

#Pack all of the Ses into a single array
S = np.array( list(itertools.product([-1,1], repeat=n+1)) )

# Create a whole array of test arrays, oversample a bit to ensure we 
# have at least (iters) of them
F = np.random.rand(int(iters*1.1),n)
F = ( F < 0.25 )*-1 + ( F > 0.75 )*1
goodrows = (np.abs(F).sum(1)!=0)
assert goodrows.sum() > iters, "Got very unlucky"
# get 1000 cases that aren't all zero
F = F[goodrows][:iters]

# Do the convolution explicitly for the two 
# slots, but on all of the Ses and Fes at the 
# same time
firstzeros = (F[:,None,:]*S[None,:,:-1]).sum(-1)==0
secondzeros = (F[:,None,:]*S[None,:,1:]).sum(-1)==0

firstzero_count = firstzeros.sum()
bothzero_count = (firstzeros * secondzeros).sum()
print "firstzero", firstzero_count
print "bothzero", bothzero_count

正如Buxton指出的那样，OP的原始代码使用了如此小的数组，因此使用Numpy没有任何好处。该实现通过以面向数组的方式一次完成所有F和S情况来利用numpy。结合了python的mkl绑定，可以实现非常快速的实现。

还要注意，仅加载库并启动解释器需要0.076s的时间，因此实际计算大约需要0.01秒，类似于C ++解决方案。

— 阿拉米
source

什么是mkl绑定，如何在ubuntu上获取它们？

跑步python -c "import numpy; numpy.show_config()"会告诉你，如果你numpy的版本编译针对BLAS /地图集/ MKL等 ATLAS是一个免费的加速数学包numpy的可以对链接，英特尔MKL你通常需要支付（除非你是一个学术）并可以链接到numpy / scipy。

— alemi 2014年

作为一种简单的方法，请使用anaconda python发行版并使用加速包。或者使用enthought分布。

— alemi 2014年

如果您使用的是Windows，只需从此处下载numpy 。与MKL关联的预编译的numpy安装程序。

— 假名称

9

MATLAB 0.024秒

电脑1

原始代码：〜3.3 s
Alistar巴克斯顿码：〜0.51 s
Alistar Buxton的新代码：〜0.25 s
Matlab代码：〜0.024 s（Matlab已在运行）

电脑2

原始代码：〜6.66 s
Alistar巴克斯顿码：〜0.64 s
Alistar Buxton的新代码：
Matlab：〜0.07 s（Matlab已在运行）
八度：〜0.07 s

我决定尝试一下哦，这么慢的Matlab。如果您知道如何做，则可以摆脱大多数循环（在Matlab中），这使其速度非常快。但是，内存要求比循环解决方案要高，但是如果您没有非常大的阵列，这将不是问题...

function call_convolve_random_arrays
tic
convolve_random_arrays
toc
end

function convolve_random_arrays

n = 6;
iters = 1000;
firstzero = 0;
bothzero = 0;

rnd = [-1, 0, 0, 1];

S = -1 *ones(1, n + 1);

IDX1 = 1:n;
IDX2 = IDX1 + 1;

for i = 1:2^(n + 1)
    F = rnd(randi(4, [iters, n]));
    sel = ~any(F,2);
    while any(sel)
        F(sel, :) = rnd(randi(4, [sum(sel), n]));
        sel = ~any(F,2);
    end

    sum1 = F * S(IDX1)';
    sel = sum1 == 0;
    firstzero = firstzero + sum(sel);

    sum2 = F(sel, :) * S(IDX2)';
    sel = sum2 == 0;
    bothzero = bothzero + sum(sel);

    S = permute(S); 
end

fprintf('firstzero %i \nbothzero %i \n', firstzero, bothzero);

end

function x = permute(x)

for i=1:length(x)
    if(x(i)==-1)
        x(i) = 1;
            return
    end
        x(i) = -1;
end

end

这是我的工作：

使用Kyle Kanos函数通过S进行置换
一次计算所有n * iters个随机数
映射1到4到[-1 0 0 1]
使用矩阵乘法（elementwise sum（F * S（1：5））等于F * S（1：5）'的矩阵乘法
对于bothzero：仅计算满足第一个条件的成员

我假设您没有matlab，这太糟糕了，因为我真的很想看看它的比较...

（该功能在您第一次运行时可能会变慢。）

— 数学
source

好吧，我有八度音阶，如果你能使它工作...？

我可以尝试一下-虽然我从未使用过八度。

— mathause

好的，如果我将代码放在一个名为call_convolve_random_arrays.m的文件中，然后从八度调用它，则可以按八度运行它。

— mathause

它是否需要更多代码才能实际执行任何操作？当我执行“ octave call_convolve_random_arrays.m”时，它不会输出任何内容。参见bpaste.net/show/JPtLOCeI3aP3wc3F3aGf

抱歉，请尝试打开八度并运行它。它应该显示firstzero，zero和执行时间。

— mathause

7

朱莉娅：0.30秒

Op的Python：21.36秒（Core2二重奏）

71倍加速

function countconv()                                                                                                                                                           
    n = 6                                                                                                                                                                      
    iters = 1000                                                                                                                                                               
    firstzero = 0                                                                                                                                                              
    bothzero = 0                                                                                                                                                               
    cprod= Iterators.product(fill([-1,1], n+1)...)                                                                                                                             
    F=Array(Float64,n);                                                                                                                                                        
    P=[-1. 0. 0. 1.]                                                                                                                                                                                                                                                                                                             

    for S in cprod                                                                                                                                                             
        Sm=[S...]                                                                                                                                                              
        for i = 1:iters                                                                                                                                                        
            F=P[rand(1:4,n)]                                                                                                                                                  
            while all(F==0)                                                                                                                                                   
                F=P[rand(1:4,n)]                                                                                                                                              
            end                                                                                                                                                               
            if  dot(reverse!(F),Sm[1:end-1]) == 0                                                                                                                           
                firstzero += 1                                                                                                                                                 
                if dot(F,Sm[2:end]) == 0                                                                                                                              
                    bothzero += 1                                                                                                                                              
                end                                                                                                                                                            
            end                                                                                                                                                                
        end                                                                                                                                                                    
    end
    return firstzero,bothzero
end

我对Arman的Julia答案做了一些修改：首先，我将其包装在一个函数中，因为全局变量使Julia的类型推断和JIT变得困难：全局变量可以随时更改其类型，并且必须在每次操作时都进行检查。然后，我摆脱了匿名函数和数组理解。它们并不是真正必需的，并且仍然很慢。Julia现在使用较低级别的抽象速度更快。

还有很多方法可以使它更快，但这做得不错。

— 用户名
source

您是在REPL中测量时间还是从命令行运行整个文件？

— Aditya 2014年

两者都来自REPL。

— user20768 2014年

6

好的，我之所以发布此帖子只是因为我觉得Java需要在这里表示。我对其他语言感到很恐惧，而且我承认无法完全理解问题，因此需要一些帮助来修复此代码。我偷走了大多数代码ace的C示例，然后从别人那里借了一些代码片段。我希望那不是假的...

我想指出的一件事是，在运行时进行优化的语言需要运行多次，才能达到全速运行。我认为采用完全优化的速度（或至少是平均速度）是合理的，因为您关心的大多数事情都将运行很多次。

该代码仍然需要修复，但是无论如何我还是运行它来看看我会得到什么时间。

以下是运行1000次的Intel®Xeon®CPU E3-1270 V2 @ 3.50GHz上的结果：

服务器：/ tmp＃time java8 -cp。测试仪

首零40000

都零20000

首次运行时间：41毫秒，上次运行时间：4毫秒

真正的0m5.014s用户0m4.664s sys 0m0.268s

这是我糟糕的代码：

public class Tester 
{
    public static void main( String[] args )
    {
        long firstRunTime = 0;
        long lastRunTime = 0;
        String testResults = null;
        for( int i=0 ; i<1000 ; i++ )
        {
            long timer = System.currentTimeMillis();
            testResults = new Tester().runtest();
            lastRunTime = System.currentTimeMillis() - timer;
            if( i ==0 )
            {
                firstRunTime = lastRunTime;
            }
        }
        System.err.println( testResults );
        System.err.println( "first run time: " + firstRunTime + " ms" );
        System.err.println( "last run time: " + lastRunTime + " ms" );
    }

    private int x,y=34353,z=57768,w=1564; 

    public String runtest()
    {
        int n = 6;
        int iters = 1000;
        //#define iters (1000)
        //PRNG seeds

        /* xorshift PRNG
         * Taken from https://en.wikipedia.org/wiki/Xorshift#Example_implementation
         * Used under CC-By-SA */

            int firstzero=0, bothzero=0;
            int[] arr = new int[n+1];
            int i=0, j=0;
            x=(int)(System.currentTimeMillis()/1000l);

            for(i=0; i< 1<<(n+1) ; i++) {
                int tmp=i;
                for(j=0; j<n+1; j++) {
                    arr[j]=(tmp&1)*(-2)+1;
                    tmp>>=1;
                }
                for(j=0; j<iters; j++) {
                    int[] randArr = new int[n];
                    int k=0;
                    long flag = 0;
                    int first=0, second=0;
                    do {
                        for(k=0; k<n; k++) {
                            randArr[k]=(1-(myRand()&3))%2;
                            flag+=(randArr[k]&1);
                            first+=arr[k]*randArr[k];
                            second+=arr[k+1]*randArr[k];
                        }
                    } while(allzero(randArr));
                    if( first == 0 )
                    {
                        firstzero+=1;
                        if( second == 0 )
                        {
                            bothzero++;
                        }
                    }
                }
            }
         return ( "firstzero " + firstzero + "\nbothzero " + bothzero + "\n" );
    }

    private boolean allzero(int[] arr)
    {
       for(int x : arr)
       {
          if(x!=0)
          {
             return false;
          }
       }
       return true;
    }

    public int myRand() 
    {
        long t;
        t = x ^ (x << 11);
        x = y; y = z; z = w;
        return (int)( w ^ (w >> 19) ^ t ^ (t >> 8));
    }
}

我尝试在升级python和安装python-numpy之后运行python代码，但是我得到了：

server:/tmp# python tester.py
Traceback (most recent call last):
  File "peepee.py", line 15, in <module>
    F = np.random.choice(np.array([-1,0,0,1], dtype=np.int8), size = n)
AttributeError: 'module' object has no attribute 'choice'

— 克里斯·塞琳
source

注释：切勿将其currentTimeMillis用于基准测试（在System中使用nano版本），并且运行1k可能不足以使JIT参与（默认值为1.5k（客户端）和10k（服务器）为默认值，尽管您经常调用myRand可以） JITed，这应该导致调用栈中的某些函数可以在这里工作）。最后但并非最不重要的一点是，弱PNRG在作弊，但C ++解决方案和其他解决方案也作弊，所以我认为这不太公平。

— Voo

在Windows上，您需要避免使用currentTimeMillis，但是对于linux而言，除了非常精细的粒度测量之外，您都不需要纳秒级时间，而获取纳秒级时间的电话要比毫里斯贵得多。因此，我非常不同意您永远不要使用它。

— 克里斯·塞琳

因此，您正在为一种特定的OS和JVM实现编写Java代码吗？实际上，我不确定您使用的是哪个操作系统，因为我只是检查了HotSpot开发树，而Linux使用的gettimeofday(&time, NULL)是milliSeconds，这不是单调的，也不提供任何准确性保证（因此在某些平台/内核上完全相同）作为currentTimeMillis Windows实现的问题-要么太好，要么都不是）。另一方面clock_gettime(CLOCK_MONOTONIC, &tp)，nanoTime的使用显然也是在Linux上进行基准测试时使用的正确方法。

— Voo

自从我在任何Linux发行版或内核上编写Java以来，它从未对我造成任何问题。

— 克里斯·塞琳

6

我的机器上Golang版本45X的python在下面的Golang代码上：

package main

import (
"fmt"
"time"
)

const (
n     = 6
iters = 1000
)

var (
x, y, z, w = 34353, 34353, 57768, 1564 //PRNG seeds
)

/* xorshift PRNG
 * Taken from https://en.wikipedia.org/wiki/Xorshift#Example_implementation
 * Used under CC-By-SA */
func myRand() int {
var t uint
t = uint(x ^ (x << 11))
x, y, z = y, z, w
w = int(uint(w^w>>19) ^ t ^ (t >> 8))
return w
}

func main() {
var firstzero, bothzero int
var arr [n + 1]int
var i, j int
x = int(time.Now().Unix())

for i = 0; i < 1<<(n+1); i = i + 1 {
    tmp := i
    for j = 0; j < n+1; j = j + 1 {
        arr[j] = (tmp&1)*(-2) + 1
        tmp >>= 1
    }
    for j = 0; j < iters; j = j + 1 {
        var randArr [n]int
        var flag uint
        var k, first, second int
        for {
            for k = 0; k < n; k = k + 1 {
                randArr[k] = (1 - (myRand() & 3)) % 2
                flag += uint(randArr[k] & 1)
                first += arr[k] * randArr[k]
                second += arr[k+1] * randArr[k]
            }
            if flag != 0 {
                break
            }
        }
        if first == 0 {
            firstzero += 1
            if second == 0 {
                bothzero += 1
            }
        }
    }
}
println("firstzero", firstzero, "bothzero", bothzero)
}

以及从上面复制的以下python代码：

import itertools
import operator
import random

n=6
iters = 1000
firstzero = 0
bothzero = 0

choicesF = filter(any, itertools.product([-1, 0, 0, 1], repeat=n))

for S in itertools.product([-1,1], repeat = n+1):
    for i in xrange(iters):
        F = random.choice(choicesF)
        if not sum(map(operator.mul, F, S[:-1])):
            firstzero += 1
            if not sum(map(operator.mul, F, S[1:])):
                bothzero += 1

print "firstzero", firstzero
print "bothzero", bothzero

和下面的时间：

$time python test.py
firstzero 27349
bothzero 12125

real    0m0.477s
user    0m0.461s
sys 0m0.014s

$time ./hf
firstzero 27253 bothzero 12142

real    0m0.011s
user    0m0.008s
sys 0m0.002s

— 伦尼
source

1

您是否考虑过使用"github.com/yanatan16/itertools"？您还会说这在多个goroutine中很好用吗？

— ymg 2014年

5

C＃0.135秒

基于Alistair Buxton的纯Python的
C＃：0.278s 并行C＃：0.135s
问题中的Python：5.907s
Alistair的纯Python：0.853s

我实际上不确定该实现是否正确-如果您从底部查看结果，其输出将有所不同。

当然，还有更多的最佳算法。我只是决定使用与Python非常相似的算法。

单线程C

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConvolvingArrays
{
    static class Program
    {
        static void Main(string[] args)
        {
            int n=6;
            int iters = 1000;
            int firstzero = 0;
            int bothzero = 0;

            int[] arraySeed = new int[] {-1, 1};
            int[] randomSource = new int[] {-1, 0, 0, 1};
            Random rand = new Random();

            foreach (var S in Enumerable.Repeat(arraySeed, n+1).CartesianProduct())
            {
                for (int i = 0; i < iters; i++)
                {
                    var F = Enumerable.Range(0, n).Select(_ => randomSource[rand.Next(randomSource.Length)]);
                    while (!F.Any(f => f != 0))
                    {
                        F = Enumerable.Range(0, n).Select(_ => randomSource[rand.Next(randomSource.Length)]);
                    }
                    if (Enumerable.Zip(F, S.Take(n), (f, s) => f * s).Sum() == 0)
                    {
                        firstzero++;
                        if (Enumerable.Zip(F, S.Skip(1), (f, s) => f * s).Sum() == 0)
                        {
                            bothzero++;
                        }
                    }
                }
            }

            Console.WriteLine("firstzero {0}", firstzero);
            Console.WriteLine("bothzero {0}", bothzero);
        }

        // itertools.product?
        // http://ericlippert.com/2010/06/28/computing-a-cartesian-product-with-linq/
        static IEnumerable<IEnumerable<T>> CartesianProduct<T>
            (this IEnumerable<IEnumerable<T>> sequences)
        {
            IEnumerable<IEnumerable<T>> emptyProduct =
              new[] { Enumerable.Empty<T>() };
            return sequences.Aggregate(
              emptyProduct,
              (accumulator, sequence) =>
                from accseq in accumulator
                from item in sequence
                select accseq.Concat(new[] { item }));
        }
    }
}

并行C＃：

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace ConvolvingArrays
{
    static class Program
    {
        static void Main(string[] args)
        {
            int n=6;
            int iters = 1000;
            int firstzero = 0;
            int bothzero = 0;

            int[] arraySeed = new int[] {-1, 1};
            int[] randomSource = new int[] {-1, 0, 0, 1};

            ConcurrentBag<int[]> results = new ConcurrentBag<int[]>();

            // The next line iterates over arrays of length n+1 which contain only -1s and 1s
            Parallel.ForEach(Enumerable.Repeat(arraySeed, n + 1).CartesianProduct(), (S) =>
            {
                int fz = 0;
                int bz = 0;
                ThreadSafeRandom rand = new ThreadSafeRandom();
                for (int i = 0; i < iters; i++)
                {
                    var F = Enumerable.Range(0, n).Select(_ => randomSource[rand.Next(randomSource.Length)]);
                    while (!F.Any(f => f != 0))
                    {
                        F = Enumerable.Range(0, n).Select(_ => randomSource[rand.Next(randomSource.Length)]);
                    }
                    if (Enumerable.Zip(F, S.Take(n), (f, s) => f * s).Sum() == 0)
                    {
                        fz++;
                        if (Enumerable.Zip(F, S.Skip(1), (f, s) => f * s).Sum() == 0)
                        {
                            bz++;
                        }
                    }
                }

                results.Add(new int[] { fz, bz });
            });

            foreach (int[] res in results)
            {
                firstzero += res[0];
                bothzero += res[1];
            }

            Console.WriteLine("firstzero {0}", firstzero);
            Console.WriteLine("bothzero {0}", bothzero);
        }

        // itertools.product?
        // http://ericlippert.com/2010/06/28/computing-a-cartesian-product-with-linq/
        static IEnumerable<IEnumerable<T>> CartesianProduct<T>
            (this IEnumerable<IEnumerable<T>> sequences)
        {
            IEnumerable<IEnumerable<T>> emptyProduct =
              new[] { Enumerable.Empty<T>() };
            return sequences.Aggregate(
              emptyProduct,
              (accumulator, sequence) =>
                from accseq in accumulator
                from item in sequence
                select accseq.Concat(new[] { item }));
        }
    }

    // http://stackoverflow.com/a/11109361/1030702
    public class ThreadSafeRandom
    {
        private static readonly Random _global = new Random();
        [ThreadStatic]
        private static Random _local;

        public ThreadSafeRandom()
        {
            if (_local == null)
            {
                int seed;
                lock (_global)
                {
                    seed = _global.Next();
                }
                _local = new Random(seed);
            }
        }
        public int Next()
        {
            return _local.Next();
        }
        public int Next(int maxValue)
        {
            return _local.Next(maxValue);
        }
    }
}

测试输出：

Windows（.NET）

在Windows上，C＃的速度要快得多。可能是因为.NET比mono更快。

用户和系统计时似乎不起作用（用于git bash计时）。

$ time /c/Python27/python.exe numpypython.py
firstzero 27413
bothzero 12073

real    0m5.907s
user    0m0.000s
sys     0m0.000s
$ time /c/Python27/python.exe plainpython.py
firstzero 26983
bothzero 12033

real    0m0.853s
user    0m0.000s
sys     0m0.000s
$ time ConvolvingArrays.exe
firstzero 28526
bothzero 6453

real    0m0.278s
user    0m0.000s
sys     0m0.000s
$ time ConvolvingArraysParallel.exe
firstzero 28857
bothzero 6485

real    0m0.135s
user    0m0.000s
sys     0m0.000s

Linux（单声道）

bob@phoebe:~/convolvingarrays$ time python program.py
firstzero 27059
bothzero 12131

real    0m11.932s
user    0m11.912s
sys     0m0.012s
bob@phoebe:~/convolvingarrays$ mcs -optimize+ -debug- program.cs
bob@phoebe:~/convolvingarrays$ time mono program.exe
firstzero 28982
bothzero 6512

real    0m1.360s
user    0m1.532s
sys     0m0.872s
bob@phoebe:~/convolvingarrays$ mcs -optimize+ -debug- parallelprogram.cs
bob@phoebe:~/convolvingarrays$ time mono parallelprogram.exe
firstzero 28857
bothzero 6496

real    0m0.851s
user    0m2.708s
sys     0m3.028s

— 鲍勃
source

1

我认为您所说的代码不正确。输出不正确。

@Lembik是的。不过，如果有人可以告诉我哪里出了问题，我将不胜感激-我无法弄清楚（仅对应该做的事情了解得很少有帮助）。

— 鲍勃

将是有趣的，看看如何做的。NET本地blogs.msdn.com/b/dotnet/archive/2014/04/02/...

— 里克Minerich

@Lembik我已经讲完了所有内容，据我所知它应该与其他Python解决方案相同...现在我真的很困惑。

— 鲍勃

4

Haskell：每核心约2000倍加速

使用“ ghc -O3 -funbox-strict-fields -threaded -fllvm”进行编译，并使用“ + RTS -Nk”运行，其中k是计算机上的内核数。

import Control.Parallel.Strategies
import Data.Bits
import Data.List
import Data.Word
import System.Random

n = 6 :: Int
iters = 1000 :: Int

data G = G !Word !Word !Word !Word deriving (Eq, Show)

gen :: G -> (Word, G)
gen (G x y z w) = let t  = x `xor` (x `shiftL` 11)
                      w' = w `xor` (w `shiftR` 19) `xor` t `xor` (t `shiftR` 8)
                  in (w', G y z w w')  

mask :: Word -> Word
mask = (.&.) $ (2 ^ n) - 1

gen_nonzero :: G -> (Word, G)
gen_nonzero g = let (x, g') = gen g 
                    a = mask x
                in if a == 0 then gen_nonzero g' else (a, g')


data F = F {zeros  :: !Word, 
            posneg :: !Word} deriving (Eq, Show)

gen_f :: G -> (F, G)       
gen_f g = let (a, g')  = gen_nonzero g
              (b, g'') = gen g'
          in  (F a $ mask b, g'')

inner :: Word -> F -> Int
inner s (F zs pn) = let s' = complement $ s `xor` pn
                        ones = s' .&. zs
                        negs = (complement s') .&. zs
                    in popCount ones - popCount negs

specialised_convolve :: Word -> F -> (Int, Int)
specialised_convolve s f@(F zs pn) = (inner s f', inner s f) 
    where f' = F (zs `shiftL` 1) (pn `shiftL` 1)

ss :: [Word]
ss = [0..2 ^ (n + 1) - 1]

main_loop :: [G] -> (Int, Int)
main_loop gs = foldl1' (\(fz, bz) (fz', bz') -> (fz + fz', bz + bz')) . parMap rdeepseq helper $ zip ss gs
    where helper (s, g) = go 0 (0, 0) g
                where go k u@(fz, bz) g = if k == iters 
                                              then u 
                                              else let (f, g') = gen_f g
                                                       v = case specialised_convolve s f
                                                               of (0, 0) -> (fz + 1, bz + 1)
                                                                  (0, _) -> (fz + 1, bz)
                                                                  _      -> (fz, bz)
                                                   in go (k + 1) v g'

seed :: IO G                                        
seed = do std_g <- newStdGen
          let [x, y, z, w] = map fromIntegral $ take 4 (randoms std_g :: [Int])
          return $ G x y z w

main :: IO ()
main = (sequence $ map (const seed) ss) >>= print . main_loop

— 用户名
source

2

因此，具有4个核的内核超过9000个？不可能是正确的。

— Cees Timmerman

阿姆达尔定律指出并行化加速与并行处理单元的数量不是线性的。相反，它们只提供黯淡的回报

— xaedes

@xaedes对于较少数量的内核，加速似乎基本上是线性的

— user1502040

3

红宝石

红宝石（2.1.0）0.277s
Ruby（2.1.1）0.281s
Python（Alistair Buxton）0.330s
Python（alemi）0.097s

n = 6
iters = 1000
first_zero = 0
both_zero = 0

choices = [-1, 0, 0, 1].repeated_permutation(n).select{|v| [0] != v.uniq}

def convolve(v1, v2)
  [0, 1].map do |i|
    r = 0
    6.times do |j|
      r += v1[i+j] * v2[j]
    end
    r
  end
end

[-1, 1].repeated_permutation(n+1) do |s|
  iters.times do
    f = choices.sample
    fs = convolve s, f
    if 0 == fs[0]
      first_zero += 1
      if 0 == fs[1]
        both_zero += 1
      end
    end
  end
end

puts 'firstzero %i' % first_zero
puts 'bothzero %i' % both_zero

— 房东
source

3

没有 PHP，线程将无法完成

快6.6倍

PHP v5.5.9 - 1.223 0.646秒;

与

Python v2.7.6-8.072秒

<?php

$n = 6;
$iters = 1000;
$firstzero = 0;
$bothzero = 0;

$x=time();
$y=34353;
$z=57768;
$w=1564; //PRNG seeds

function myRand() {
    global $x;
    global $y;
    global $z;
    global $w;
    $t = $x ^ ($x << 11);
    $x = $y; $y = $z; $z = $w;
    return $w = $w ^ ($w >> 19) ^ $t ^ ($t >> 8);
}

function array_cartesian() {
    $_ = func_get_args();
    if (count($_) == 0)
        return array();
    $a = array_shift($_);
    if (count($_) == 0)
        $c = array(array());
    else
        $c = call_user_func_array(__FUNCTION__, $_);
    $r = array();
    foreach($a as $v)
        foreach($c as $p)
            $r[] = array_merge(array($v), $p);
    return $r;
}

function rand_array($a, $n)
{
    $r = array();
    for($i = 0; $i < $n; $i++)
        $r[] = $a[myRand()%count($a)];
    return $r;
}

function convolve($a, $b)
{
    // slows down
    /*if(count($a) < count($b))
        return convolve($b,$a);*/
    $result = array();
    $w = count($a) - count($b) + 1;
    for($i = 0; $i < $w; $i++){
        $r = 0;
        for($k = 0; $k < count($b); $k++)
            $r += $b[$k] * $a[$i + $k];
        $result[] = $r;
    }
    return $result;
}

$cross = call_user_func_array('array_cartesian',array_fill(0,$n+1,array(-1,1)));

foreach($cross as $S)
    for($i = 0; $i < $iters; $i++){
        while(true)
        {
            $F = rand_array(array(-1,0,0,1), $n);
            if(in_array(-1, $F) || in_array(1, $F))
                break;
        }
        $FS = convolve($S, $F);
        if(0==$FS[0]) $firstzero += 1;
        if(0==$FS[0] && 0==$FS[1]) $bothzero += 1;
    }

echo "firstzero $firstzero\n";
echo "bothzero $bothzero\n";

使用自定义随机生成器（从C答案中窃取），PHP糟糕，数字不匹配
convolve 功能简化了一点更快
检查仅零数组也非常优化（请参阅$F和$FS检查）。

输出：

$ time python num.py 
firstzero 27050
bothzero 11990

real    0m8.072s
user    0m8.037s
sys 0m0.024s
$ time php num.php
firstzero 27407
bothzero 12216

real    0m1.223s
user    0m1.210s
sys 0m0.012s

编辑。第二版脚本仅适用于0.646 sec：

<?php

$n = 6;
$iters = 1000;
$firstzero = 0;
$bothzero = 0;

$x=time();
$y=34353;
$z=57768;
$w=1564; //PRNG seeds

function myRand() {
    global $x;
    global $y;
    global $z;
    global $w;
    $t = $x ^ ($x << 11);
    $x = $y; $y = $z; $z = $w;
    return $w = $w ^ ($w >> 19) ^ $t ^ ($t >> 8);
}

function array_cartesian() {
    $_ = func_get_args();
    if (count($_) == 0)
        return array();
    $a = array_shift($_);
    if (count($_) == 0)
        $c = array(array());
    else
        $c = call_user_func_array(__FUNCTION__, $_);
    $r = array();
    foreach($a as $v)
        foreach($c as $p)
            $r[] = array_merge(array($v), $p);
    return $r;
}

function convolve($a, $b)
{
    // slows down
    /*if(count($a) < count($b))
        return convolve($b,$a);*/
    $result = array();
    $w = count($a) - count($b) + 1;
    for($i = 0; $i < $w; $i++){
        $r = 0;
        for($k = 0; $k < count($b); $k++)
            $r += $b[$k] * $a[$i + $k];
        $result[] = $r;
    }
    return $result;
}

$cross = call_user_func_array('array_cartesian',array_fill(0,$n+1,array(-1,1)));

$choices = call_user_func_array('array_cartesian',array_fill(0,$n,array(-1,0,0,1)));

foreach($cross as $S)
    for($i = 0; $i < $iters; $i++){
        while(true)
        {
            $F = $choices[myRand()%count($choices)];
            if(in_array(-1, $F) || in_array(1, $F))
                break;
        }
        $FS = convolve($S, $F);
        if(0==$FS[0]){
            $firstzero += 1;
            if(0==$FS[1])
                $bothzero += 1;
        }
    }

echo "firstzero $firstzero\n";
echo "bothzero $bothzero\n";

— 维塔利·迪亚特洛夫（Vitaly Dyatlov）
source

3

F＃解决方案

在CLR Core i7 4（8）@ 3.4 Ghz上编译为x86时，运行时为0.030s

我不知道代码是否正确。

功能优化（内联折叠）-> 0.026s
通过控制台项目进行构建-> 0.022s
添加了用于生成置换数组的更好算法-> 0.018s
Windows单声道-> 0.089s
运行Alistair的Python脚本-> 0.259s

let inline ffoldi n f state =
    let mutable state = state
    for i = 0 to n - 1 do
        state <- f state i
    state

let product values n =
    let p = Array.length values
    Array.init (pown p n) (fun i ->
        (Array.zeroCreate n, i)
        |> ffoldi n (fun (result, i') j ->
            result.[j] <- values.[i' % p]
            result, i' / p
        )
        |> fst
    )

let convolute signals filter =
    let m = Array.length signals
    let n = Array.length filter
    let len = max m n - min m n + 1

    Array.init len (fun offset ->
        ffoldi n (fun acc i ->
            acc + filter.[i] * signals.[m - 1 - offset - i]
        ) 0
    )

let n = 6
let iters = 1000

let next =
    let arrays =
        product [|-1; 0; 0; 1|] n
        |> Array.filter (Array.forall ((=) 0) >> not)
    let rnd = System.Random()
    fun () -> arrays.[rnd.Next arrays.Length]

let signals = product [|-1; 1|] (n + 1)

let firstzero, bothzero =
    ffoldi signals.Length (fun (firstzero, bothzero) i ->
        let s = signals.[i]
        ffoldi iters (fun (first, both) _ ->
            let f = next()
            match convolute s f with
            | [|0; 0|] -> first + 1, both + 1
            | [|0; _|] -> first + 1, both
            | _ -> first, both
        ) (firstzero, bothzero)
    ) (0, 0)

printfn "firstzero %i" firstzero
printfn "bothzero %i" bothzero

— 大卫·格里尼尔（David Grenier）
source

2

Q，0.296段

n:6; iter:1000  /parametrization (constants)
c:n#0           /auxiliar constant (sequence 0 0.. 0 (n))
A:B:();         /A and B accumulates results of inner product (firstresult, secondresult)

/S=sequence with all arrays of length n+1 with values -1 and 1
S:+(2**m)#/:{,/x#/:-1 1}'m:|n(2*)\1 

f:{do[iter; F:c; while[F~c; F:n?-1 0 0 1]; A,:+/F*-1_x; B,:+/F*1_x];} /hard work
f'S               /map(S,f)
N:~A; +/'(N;N&~B) / ~A is not A (or A=0) ->bitmap.  +/ is sum (population over a bitmap)
                  / +/'(N;N&~B) = count firstResult=0, count firstResult=0 and secondResult=0

Q是面向集合的语言（kx.com）

重写代码以开发惯用的Q，但没有其他巧妙的优化方法

脚本语言优化了程序员的时间，而不是执行时间

Q不是解决此问题的最佳工具

第一次尝试编码=不是赢家，而是合理的时间（大约30倍加速）

口译人员之间颇具竞争力
停下来选择另一个问题

笔记。-

程序使用默认种子（可重复执行程序）以选择另一个种子供随机生成器使用 \S seed
结果以两个整数的倍数给出，因此存在第二个值的最终i后缀27421 12133i->读为（27241，12133）
不计算解释器启动的时间。\t sentence 确保那句话消耗的时间

— J·森德拉
source

非常有趣，谢谢。

1

朱莉娅：12.149 6.929 s

尽管他们声称要提高速度，但最初的JIT编译时间却使我们退缩！

请注意，以下Julia代码实际上是原始Python代码的直接翻译（未进行任何优化），以证明您可以轻松地将编程经验转换为更快的语言;）

require("Iterators")

n = 6
iters = 1000
firstzero = 0
bothzero = 0

for S in Iterators.product(fill([-1,1], n+1)...)
    for i = 1:iters
        F = [[-1 0 0 1][rand(1:4)] for _ = 1:n]
        while all((x) -> round(x,8) == 0, F)
            F = [[-1 0 0 1][rand(1:4)] for _ = 1:n]
        end
        FS = conv(F, [S...])
        if round(FS[1],8) == 0
            firstzero += 1
        end
        if all((x) -> round(x,8) == 0, FS)
            bothzero += 1
        end
    end
end

println("firstzero ", firstzero)
println("bothzero ", bothzero)

编辑

运行n = 8需要32.935 s。考虑到该算法的复杂度为O(2^n)，则4 * (12.149 - C) = (32.935 - C)，其中C是代表JIT编译时间的常数。通过求解，C我们发现，C = 5.2203实际执行时间为n = 66.929 s。

— 灵活的琼脂
source

如何将n增加到8来查看Julia是否适合自己？

这忽略了此处的许多性能提示：julia.readthedocs.org/en/latest/manual/performance-tips。另请参见效果更好的其他Julia条目。提交内容

— 值得

0

Rust，6.6 ms，1950x加速

将Alistair Buxton的代码直接翻译成Rust。我考虑过使用带有rayon的多个内核（无所畏惧的并发！），但这并没有提高性能，可能是因为它已经非常快了。

extern crate itertools;
extern crate rand;
extern crate time;

use itertools::Itertools;
use rand::{prelude::*, prng::XorShiftRng};
use std::iter;
use time::precise_time_ns;

fn main() {
    let start = precise_time_ns();

    let n = 6;
    let iters = 1000;
    let mut first_zero = 0;
    let mut both_zero = 0;
    let choices_f: Vec<Vec<i8>> = iter::repeat([-1, 0, 0, 1].iter().cloned())
        .take(n)
        .multi_cartesian_product()
        .filter(|i| i.iter().any(|&x| x != 0))
        .collect();
    // xorshift RNG is faster than default algorithm designed for security
    // rather than performance.
    let mut rng = XorShiftRng::from_entropy(); 
    for s in iter::repeat(&[-1, 1]).take(n + 1).multi_cartesian_product() {
        for _ in 0..iters {
            let f = rng.choose(&choices_f).unwrap();
            if f.iter()
                .zip(&s[..s.len() - 1])
                .map(|(a, &b)| a * b)
                .sum::<i8>() == 0
            {
                first_zero += 1;
                if f.iter().zip(&s[1..]).map(|(a, &b)| a * b).sum::<i8>() == 0 {
                    both_zero += 1;
                }
            }
        }
    }
    println!("first_zero = {}\nboth_zero = {}", first_zero, both_zero);

    println!("runtime {} ns", precise_time_ns() - start);
}

还有Cargo.toml，因为我使用外部依赖项：

[package]
name = "how_slow_is_python"
version = "0.1.0"

[dependencies]
itertools = "0.7.8"
rand = "0.5.3"
time = "0.1.40"

速度比较：

$ time python2 py.py
firstzero: 27478
bothzero: 12246
12.80user 0.02system 0:12.90elapsed 99%CPU (0avgtext+0avgdata 23328maxresident)k
0inputs+0outputs (0major+3544minor)pagefaults 0swaps
$ time target/release/how_slow_is_python
first_zero = 27359
both_zero = 12162
runtime 6625608 ns
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 2784maxresident)k
0inputs+0outputs (0major+189minor)pagefaults 0swaps

6625608 ns约为6.6毫秒。这意味着1950倍加速。这里有许多优化可能，但是我要的是可读性而不是性能。一种可能的优化方法是使用数组而不是向量来存储选择，因为它们将始终具有n元素。除了XorShift之外，还可以使用RNG，因为Xorshift比默认的HC-128 CSPRNG快，但比最幼稚的PRNG算法要慢。

— 康拉德·鲍罗夫斯基（Konrad Borowski）
source

Python到底有多慢？（或者您的语言有多快？）

C ++魔术

使用简单RNG时为0.84毫秒，使用c ++ 11 std :: knuth时为1.67毫秒

编辑

Python2.7 + Numpy 1.8.1：10.242秒

Fortran 90+：0.029秒 0.003秒 0.022秒 0.010秒

锈蚀：0.011s

原始Python：8.3

C

Ĵ

MATLAB 0.024秒

朱莉娅：0.30秒

Op的Python：21.36秒（Core2二重奏）

C＃0.135秒

单线程C

并行C＃：

测试输出：

Windows（.NET）

Linux（单声道）

Haskell：每核心约2000倍加速

红宝石

没有 PHP，线程将无法完成

快6.6倍

PHP v5.5.9 - 1.223 0.646秒;

与

Python v2.7.6-8.072秒

F＃解决方案

Q，0.296段

朱莉娅：12.149 6.929 s

编辑

Rust，6.6 ms，1950x加​​速

Rust，6.6 ms，1950x加速