18

给定整数数组

[1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]

我需要掩盖重复N多次的元素。需要说明的是：主要目标是检索布尔掩码数组，以后再用于装箱计算。

我想出了一个相当复杂的解决方案

import numpy as np

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])

N = 3
splits = np.split(bins, np.where(np.diff(bins) != 0)[0]+1)
mask = []
for s in splits:
    if s.shape[0] <= N:
        mask.append(np.ones(s.shape[0]).astype(np.bool_))
    else:
        mask.append(np.append(np.ones(N), np.zeros(s.shape[0]-N)).astype(np.bool_)) 

mask = np.concatenate(mask)

给例如

bins[mask]
Out[90]: array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])

有没有更好的方法可以做到这一点？

编辑＃2

非常感谢您的回答！这是MSeifert基准测试图的精简版。感谢您指出我simple_benchmark。仅显示4个最快的选项：

结论

由Paul Panzer修改的Florian H提出的想法似乎是解决此问题的一种好方法，因为它非常简单直接。但是，如果您使用得很好，MSeifert的解决方案将胜过其他解决方案。numpynumba

我选择接受MSeifert的答案作为解决方案，因为它是更笼统的答案：它可以正确地处理带有（非唯一）连续重复元素块的任意数组。万一numba不行，Divakar的答案也值得一看！

— 先生先生
source

1

是否保证对输入进行排序？

— user2357112支持Monica19年

1

在我的具体情况下，是的。总的来说，考虑未排序输入（和重复元素的非唯一块）的情况会很好。

— MrFuppes

4

我想提出一个使用numba的解决方案，应该很容易理解。我假设您要“屏蔽”连续的重复项：

import numpy as np
import numba as nb

@nb.njit
def mask_more_n(arr, n):
    mask = np.ones(arr.shape, np.bool_)

    current = arr[0]
    count = 0
    for idx, item in enumerate(arr):
        if item == current:
            count += 1
        else:
            current = item
            count = 1
        mask[idx] = count <= n
    return mask

例如：

>>> bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])
>>> bins[mask_more_n(bins, 3)]
array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])
>>> bins[mask_more_n(bins, 2)]
array([1, 1, 2, 2, 3, 3, 4, 4, 5, 5])

性能：

使用simple_benchmark-但是我还没有包括所有方法。这是对数-对数比例：

似乎numba解决方案无法胜过Paul Panzer的解决方案，后者对于大型阵列而言似乎要快一点（并且不需要其他依赖项）。

但是，两者似乎都胜过其他解决方案，但是它们确实返回掩码而不是“过滤”数组。

import numpy as np
import numba as nb
from simple_benchmark import BenchmarkBuilder, MultiArgument

b = BenchmarkBuilder()

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])

@nb.njit
def mask_more_n(arr, n):
    mask = np.ones(arr.shape, np.bool_)

    current = arr[0]
    count = 0
    for idx, item in enumerate(arr):
        if item == current:
            count += 1
        else:
            current = item
            count = 1
        mask[idx] = count <= n
    return mask

@b.add_function(warmups=True)
def MSeifert(arr, n):
    return mask_more_n(arr, n)

from scipy.ndimage.morphology import binary_dilation

@b.add_function()
def Divakar_1(a, N):
    k = np.ones(N,dtype=bool)
    m = np.r_[True,a[:-1]!=a[1:]]
    return a[binary_dilation(m,k,origin=-(N//2))]

@b.add_function()
def Divakar_2(a, N):
    k = np.ones(N,dtype=bool)
    return a[binary_dilation(np.ediff1d(a,to_begin=a[0])!=0,k,origin=-(N//2))]

@b.add_function()
def Divakar_3(a, N):
    m = np.r_[True,a[:-1]!=a[1:],True]
    idx = np.flatnonzero(m)
    c = np.diff(idx)
    return np.repeat(a[idx[:-1]],np.minimum(c,N))

from skimage.util import view_as_windows

@b.add_function()
def Divakar_4(a, N):
    m = np.r_[True,a[:-1]!=a[1:]]
    w = view_as_windows(m,N)
    idx = np.flatnonzero(m)
    v = idx<len(w)
    w[idx[v]] = 1
    if v.all()==0:
        m[idx[v.argmin()]:] = 1
    return a[m]

@b.add_function()
def Divakar_5(a, N):
    m = np.r_[True,a[:-1]!=a[1:]]
    w = view_as_windows(m,N)
    last_idx = len(a)-m[::-1].argmax()-1
    w[m[:-N+1]] = 1
    m[last_idx:last_idx+N] = 1
    return a[m]

@b.add_function()
def PaulPanzer(a,N):
    mask = np.empty(a.size,bool)
    mask[:N] = True
    np.not_equal(a[N:],a[:-N],out=mask[N:])
    return mask

import random

@b.add_arguments('array size')
def argument_provider():
    for exp in range(2, 20):
        size = 2**exp
        yield size, MultiArgument([np.array([random.randint(0, 5) for _ in range(size)]), 3])

r = b.run()
import matplotlib.pyplot as plt

plt.figure(figsize=[10, 8])
r.plot()

— 塞弗特
source

“似乎numba解决方案无法击败Paul Panzer的解决方案”，可以说，对于相当大的尺寸范围，它速度更快。而且功能更强大。我不能让我的（好吧，@ FlorianH's）工作于非唯一块值，而又不会使其变慢。有趣的是，即使使用pythran复制Florians方法（通常与numba表现类似），我也无法匹配大型数组的numpy实现。pythran不喜欢该out参数（或运算符的功能形式），因此我无法保存该副本。顺便说一句，我很喜欢simple_benchmark。

— 保罗·潘泽

很好的提示，使用simple_benchmark！谢谢你，当然感谢你的回答。由于我numba也将其用于其他用途，因此我也倾向于在此处使用它并使其成为解决方案。之间的岩石和困难的地方...

— MrFuppes

7

免责声明：这只是@FlorianH的想法的合理实现：

def f(a,N):
    mask = np.empty(a.size,bool)
    mask[:N] = True
    np.not_equal(a[N:],a[:-N],out=mask[N:])
    return mask

对于较大的数组，这有很大的不同：

a = np.arange(1000).repeat(np.random.randint(0,10,1000))
N = 3

print(timeit(lambda:f(a,N),number=1000)*1000,"us")
# 5.443050000394578 us

# compare to
print(timeit(lambda:[True for _ in range(N)] + list(bins[:-N] != bins[N:]),number=1000)*1000,"us")
# 76.18969900067896 us

— 保罗·潘泽
source

我认为它不适用于任意数组：例如，使用[1,1,1,1,2,2,1,1,2,2]。

— MSeifert

@MSeifert在OP的示例中，我认为这种事情不可能发生，但是您是正确的，因为OP的实际代码可以处理您的示例。好吧，我想只有OP可以告诉你。

— Paul Panzer

正如我对user2357112的评论所答复的，在我的特定情况下，输入已排序，并且连续重复元素的块是唯一的。但是，从更一般的角度来看，如果可以处理任意数组，则可能会非常有用。

— MrFuppes

4

方法＃1：这是向量化方式-

from scipy.ndimage.morphology import binary_dilation

def keep_N_per_group(a, N):
    k = np.ones(N,dtype=bool)
    m = np.r_[True,a[:-1]!=a[1:]]
    return a[binary_dilation(m,k,origin=-(N//2))]

样品运行-

In [42]: a
Out[42]: array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])

In [43]: keep_N_per_group(a, N=3)
Out[43]: array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])

方法2：版本更紧凑-

def keep_N_per_group_v2(a, N):
    k = np.ones(N,dtype=bool)
    return a[binary_dilation(np.ediff1d(a,to_begin=a[0])!=0,k,origin=-(N//2))]

方法3：使用分组计数和np.repeat（虽然不会给我们掩码）-

def keep_N_per_group_v3(a, N):
    m = np.r_[True,a[:-1]!=a[1:],True]
    idx = np.flatnonzero(m)
    c = np.diff(idx)
    return np.repeat(a[idx[:-1]],np.minimum(c,N))

方法4：使用view-based方法-

from skimage.util import view_as_windows

def keep_N_per_group_v4(a, N):
    m = np.r_[True,a[:-1]!=a[1:]]
    w = view_as_windows(m,N)
    idx = np.flatnonzero(m)
    v = idx<len(w)
    w[idx[v]] = 1
    if v.all()==0:
        m[idx[v.argmin()]:] = 1
    return a[m]

方法5：使用view-based没有索引的方法flatnonzero-

def keep_N_per_group_v5(a, N):
    m = np.r_[True,a[:-1]!=a[1:]]
    w = view_as_windows(m,N)
    last_idx = len(a)-m[::-1].argmax()-1
    w[m[:-N+1]] = 1
    m[last_idx:last_idx+N] = 1
    return a[m]

— 迪卡卡
source

2

您可以使用索引来做到这一点。对于任何N，代码将为：

N = 3
bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,6])

mask = [True for _ in range(N)] + list(bins[:-N] != bins[N:])
bins[mask]

输出：

array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6]

— 弗洛里安·H
source

真的很喜欢它，因为它很简单！也应该表现出色，并会进行一些检查timeit。

— MrFuppes

1

一个好得多的办法是使用numpy的unique()-功能。您将在数组中获得唯一条目，以及它们出现的频率计数：

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])
N = 3

unique, index,count = np.unique(bins, return_index=True, return_counts=True)
mask = np.full(bins.shape, True, dtype=bool)
for i,c in zip(index,count):
    if c>N:
        mask[i+N:i+c] = False

bins[mask]

输出：

array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])

— 西蒙·芬克
source

1

您可以使用while循环来检查数组元素N向后定位是否等于当前位置。请注意，此解决方案假定数组是有序的。

import numpy as np

bins = [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
N = 3
counter = N

while counter < len(bins):
    drop_condition = (bins[counter] == bins[counter - N])
    if drop_condition:
        bins = np.delete(bins, counter)
    else:
        # move on to next element
        counter += 1

— 独轮车
source

您可能需要更改len(question)为len(bins)

— Florian H

抱歉，如果我的问题不清楚。我不想删除元素，我只需要一个以后可以使用的掩码（例如，掩码因变量以使每个bin中的样本数相等）。

— MrFuppes

0

您可以使用grouby对长度大于N的公共元素和过滤器列表进行分组。

import numpy as np
from itertools import groupby, chain

def ifElse(condition, exec1, exec2):

    if condition : return exec1 
    else         : return exec2


def solve(bins, N = None):

    xss = groupby(bins)
    xss = map(lambda xs : list(xs[1]), xss)
    xss = map(lambda xs : ifElse(len(xs) > N, xs[:N], xs), xss)
    xs  = chain.from_iterable(xss)
    return list(xs)

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])
solve(bins, N = 3)

— 杨锡俊
source

0

解

您可以使用numpy.unique。该变量final_mask可用于从数组中提取traget元素bins。

import numpy as np

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])
repeat_max = 3

unique, counts = np.unique(bins, return_counts=True)
mod_counts = np.array([x if x<=repeat_max else repeat_max for x in counts])
mask = np.arange(bins.size)
#final_values = np.hstack([bins[bins==value][:count] for value, count in zip(unique, mod_counts)])
final_mask = np.hstack([mask[bins==value][:count] for value, count in zip(unique, mod_counts)])
bins[final_mask]

输出：

array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])

— 密码
source

那将需要额外的步骤来获得与形状相同的口罩bins，对吗？

— MrFuppes

正确：仅当您有兴趣首先获得口罩时。如果您想final_values直接，你可以取消注释唯一的注释行的解决方案，在这种情况下，你可以放弃三行：mask = ...，final_mask = ...和bins[final_mask]。

— CypherX

numpy 1D数组：遮罩元素重复n次以上

性能：

解