在numpy.array中查找唯一的行

199

我需要在中找到唯一的行numpy.array。

例如：

>>> a # I have
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])

我知道我可以创建一个set并在数组上循环，但是我正在寻找一种有效的纯numpy解决方案。我相信有一种方法可以将数据类型设置为void，然后我可以使用numpy.unique，但是我不知道如何使它工作。

— 阿卡瓦尔
source

11

熊猫有一个dataframe.drop_duplicates（）方法。见stackoverflow.com/questions/12322779/pandas-unique-dataframe和pandas.pydata.org/pandas-docs/dev/generated/...

— codeape

谢谢，但是我不能用熊猫。

— Akavall 2013年

2

删除numpy数组的每一行中的

— Andy Hayden

1

@Andy Hayden，尽管有标题，但这不是这个问题的重复。codeape的链接是重复的。

— 威业东

5

这个功能本来

— Eric

114

从NumPy 1.13开始，您可以简单地选择轴来选择任何N维数组中的唯一值。要获得唯一的行，可以执行以下操作：

unique_rows = np.unique(original_array, axis=0)

— 艾瓦本
source

12

仔细使用此功能。 np.unique(list_cor, axis=0)使您删除重复行的数组 ; 它不会将数组过滤为原始数组中唯一的元素。例如，请参阅此处

— 布拉德·所罗门

请注意，如果您希望唯一的行忽略该行中值的顺序，则可以首先直接对列中的原始数组进行排序：original_array.sort(axis=1)

— mangecoeur

139

另一个可能的解决方案

np.vstack({tuple(row) for row in a})

— 格雷格·冯·温克尔
source

20

+1这是清晰，简短和pythonic的。除非速度是一个真正的问题，否则此类解决方案应优先于IMO对这个问题的复杂且票数较高的答案。

— Bill Cheatham 2014年

3

优秀的！花括号或set（）函数可以解决问题。

— 天河

2

@Greg von Winckel您能提出一些建议吗？

— Laschet Jain

是的，但是没有一个命令：x = []; [[如果不是x中的tuple（r），则r中的x.append（tuple（r））]；a_unique = array（x）;

— 格雷格·冯·温克尔

1

为了避免FutureWarning，请将集合转换为类似以下的列表：np.vstack(list({tuple(row) for row in AIPbiased[i, :, :]})) FutureWarning：要作为堆栈的数组必须作为“序列”类型传递，例如列表或元组。从NumPy 1.16开始，不再支持对非序列可迭代对象（例如生成器）的支持，将来会引发错误。

— leermeester

111

使用结构化数组的另一种选择是使用一种void类型的视图，该视图将整行连接到单个项目中：

a = np.array([[1, 1, 1, 0, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [1, 1, 1, 0, 0, 0],
              [1, 1, 1, 1, 1, 0]])

b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)

unique_a = a[idx]

>>> unique_a
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

编辑添加np.ascontiguousarray以下@seberg的建议。如果数组不是连续的，这会使方法变慢。

编辑可以通过执行以下操作来稍微加快上述速度，也许是以清楚为代价的：

unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])

另外，至少在我的系统上，性能方面与lexsort方法相当，甚至更好：

a = np.random.randint(2, size=(10000, 6))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop

a = np.random.randint(2, size=(10000, 100))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop

— 海梅
source

3

非常感谢。这是我一直在寻找的答案，您能解释一下此步骤发生了b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))什么吗？

— Akavall

3

@Akavall它正在使用np.void大小为整行中的字节数的数据类型创建数据视图。如果您有一个数组np.uint8s并将其查看为np.uint16s，则得到的结果类似于两个，它将每两列合并为一个，但更加灵活。

— 2013年

3

@Jaime，您可以添加一个np.ascontiguousarray或类似的名称来确保总体安全吗（我知道它比必要时要严格一些，但是...）。行必须是连续的，视图才能按预期工作。

— seberg 2013年

2

@ConstantineEvans这是最近添加的内容：在numpy 1.6中，尝试np.unique在np.void返回数组上运行会返回与未针对该类型实现的mergesort相关的错误。它在1.7中可以正常工作。

— Jaime

9

值得注意的是，如果将此方法用于浮点数，则有一个catch的-0.比较结果不等于+0.，而逐个元素的比较将具有比较结果-0.==+0.（由ieee float标准指定）。见stackoverflow.com/questions/26782038/...

— tom10

29

如果要避免转换为一系列元组或其他类似数据结构的内存开销，则可以利用numpy的结构化数组。

诀窍是将原始数组视为结构化数组，其中每个项目都对应于原始数组的一行。这不会产生副本，并且非常有效。

作为一个简单的例子：

import numpy as np

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])

ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)

uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq

要了解发生了什么，请看一下中间结果。

一旦我们将事物视为结构化数组，则数组中的每个元素都是原始数组中的一行。（基本上，它是与元组列表类似的数据结构。）

In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(1, 1, 1, 0, 0, 0)],
       [(1, 1, 1, 1, 1, 0)]],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

一旦运行numpy.unique，我们将返回一个结构化数组：

In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

然后，我们需要将其视为“常规”数组（_将最后一次计算的结果存储在中ipython，这就是您看到的原因_.view...）：

In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])

然后重塑为2D数组（-1是一个占位符，告诉numpy计算正确的行数，给出列数）：

In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

显然，如果您想更加简洁，可以将其编写为：

import numpy as np

def unique_rows(data):
    uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
    return uniq.view(data.dtype).reshape(-1, data.shape[1])

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])
print unique_rows(data)

结果是：

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]

— 乔·金顿
source

这实际上看起来很慢，几乎与使用元组一样慢。像这样对结构化数组进行排序显然很慢。

— cge

3

@cge-尝试使用更大的数组。是的，对numpy数组进行排序比对列表进行排序要慢。不过，在大多数情况下，速度不是主要的考虑因素。它是内存使用情况。元组的列表会使用远远更多的内存比此解决方案。即使您有足够的内存（具有相当大的数组），将其转换为元组列表也比速度优势要大。

— Joe Kington 2013年

@cge-啊，我没有注意到您正在使用lexsort。我以为您指的是使用元组列表。是的，lexsort在这种情况下可能是更好的选择。我忘记了它，而跳到一个过于复杂的解决方案。

— Joe Kington 2013年

20

np.unique当我运行它时，np.random.random(100).reshape(10,10)返回所有唯一的单个元素，但是您想要唯一的行，因此首先需要将它们放入元组：

array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)

这是我看到您更改类型以执行所需操作的唯一方法，并且我不确定将列表迭代更改为元组是否可以“不循环”

— 瑞安·萨克斯（Ryan Saxe）
source

5

+1这是清晰，简短和pythonic的。除非速度是一个真正的问题，否则此类解决方案应优先于IMO对这个问题的复杂且票数较高的答案。

— Bill Cheatham 2014年

我更喜欢这个，而不是公认的解决方案。速度对我来说不是问题，因为< 100每次调用可能只有几行。这恰好描述了如何执行跨行唯一操作。

— rayryeng

4

这实际上不适用于我的数据，uniques包含唯一元素。我可能会误解的预期形状array-您在这里可以更精确吗？

— FooBar

@ ryan-saxe我喜欢这是pythonic的方法，但这不是一个好的解决方案，因为返回的行uniques已排序（因此与中的行不同array）。 B = np.array([[1,2],[2,1]]); A = np.unique([tuple(row) for row in B]); print(A) = array([[1, 2],[1, 2]])

— jmlarson '16

16

np.unique的工作方式是：对扁平化的数组进行排序，然后查看各项是否等于上一项。这可以手动完成而无需展平：

ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]

此方法不使用元组，并且应比此处给出的其他方法更快，更简单。

注意：以前的版本在a [之后没有ind，这表示使用了错误的索引。另外，乔·肯顿（Joe Kington）指出，这样做确实可以制作各种中间副本。通过制作排序后的副本，然后使用其视图，以下方法可以减少数量：

b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]

这样速度更快，占用的内存更少。

此外，如果您要在ndarray中查找唯一行，而不管该数组中有多少维，则可以执行以下操作：

b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]

剩下的一个有趣的问题是，如果您想沿着任意维数组的任意轴排序/唯一，那将更加困难。

编辑：

为了演示速度差异，我在ipython中对答案中描述的三种不同方法进行了一些测试。与您的精确值a相比，差别不大，尽管此版本要快一些：

In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop

In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop

In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop

但是，使用更大的版本，最终会变得快得多：

In [96]: a = np.random.randint(0,2,size=(10000,6))

In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop

In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop

In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop

— cge
source

非常好！顺便说一句，它确实制作了多个中间副本。（例如，a[ind[1:]]是副本等）另一方面，在用尽内存之前，您的解决方案通常比我的解决方案快2-3倍。

— Joe Kington 2013年

好点子。事实证明，我尝试仅使用索引来取出中间副本的方法使我的方法使用更多的内存，并且比仅对数组进行排序的副本要慢得多，因为a_sorted [1：]不是a_sorted的副本。

— cge

什么是dtype你的时刻？我想你错了。在我的系统上，np.unique按照我的答案中的描述进行呼叫比使用您的两种风格的稍快np.lexsort。如果查找唯一性的数组具有形状，则速度快约5倍(10000, 100)。即使您决定重新实现np.unique可以减少某些（较小）执行时间的措施，将每行折叠到一个对象中的比较操作也比必须进行np.any列比较的操作要快，尤其是对于列数较高的情况。

— 2013年

@cge：您可能是说'np.any'而不是标准的'any'，这不带关键字参数。

— M. Toya 2013年

@Jaime-我相信dtypejust a.dtype，即正在查看的数据的数据类型，就像Joe Kington在回答中所做的那样。如果有很多列，另一种（不完美！）保持事物快速使用的方法lexsort是仅对几列进行排序。这是特定于数据的，因为需要知道哪些列提供了足够的方差以进行完美排序。例如a.shape = (60000, 500)，对前3列进行排序：ind = np.lexsort((a[:, 2], a[:, 1], a[:, 0]))。节省的时间相当可观，但免责声明又一次：它可能无法涵盖所有情况-取决于数据。

— n1k31t4

9

这是@Greg pythonic答案的另一种变化

np.vstack(set(map(tuple, a)))

— 潜水服
source

9

我比较了建议的替代速度，发现令人惊讶的是，无效视图unique解决方案甚至比unique带有axis参数的numpy本机快一点。如果您正在寻找速度，您会想要

numpy.unique(
    a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
    ).view(a.dtype).reshape(-1, a.shape[1])

复制剧情的代码：

import numpy
import perfplot


def unique_void_view(a):
    return numpy.unique(
        a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
        ).view(a.dtype).reshape(-1, a.shape[1])


def lexsort(a):
    ind = numpy.lexsort(a.T)
    return a[ind[
        numpy.concatenate((
            [True], numpy.any(a[ind[1:]] != a[ind[:-1]], axis=1)
            ))
        ]]


def vstack(a):
    return numpy.vstack({tuple(row) for row in a})


def unique_axis(a):
    return numpy.unique(a, axis=0)


perfplot.show(
    setup=lambda n: numpy.random.randint(2, size=(n, 20)),
    kernels=[unique_void_view, lexsort, vstack, unique_axis],
    n_range=[2**k for k in range(15)],
    logx=True,
    logy=True,
    xlabel='len(a)',
    equality_check=None
    )

— 尼科·施洛默（NicoSchlömer）
source

1

很好的答案，一个小问题：vstack_dict，从不使用dict，花括号是一种集合理解，因此其行为与几乎相同vstatck_set。由于vstack_dict来回图缺少性能线，因此vstack_set性能图看起来很像被性能图覆盖了！

— Akavall '17

谢谢回复。我改进了该图，使其仅包含一个vstack变体。

— NicoSchlömer17年

8

我不喜欢这些答案，因为没有一个以线性代数或向量空间的意义处理浮点数组，其中两行“相等”表示“在𝜀内”。具有容忍度阈值的一个答案https://stackoverflow.com/a/26867764/500207将阈值设为按元素和十进制精度精度，这在某些情况下适用，但在数学上不如真实向量距离。

这是我的版本：

from scipy.spatial.distance import squareform, pdist

def uniqueRows(arr, thresh=0.0, metric='euclidean'):
    "Returns subset of rows that are unique, in terms of Euclidean distance"
    distances = squareform(pdist(arr, metric=metric))
    idxset = {tuple(np.nonzero(v)[0]) for v in distances <= thresh}
    return arr[[x[0] for x in idxset]]

# With this, unique columns are super-easy:
def uniqueColumns(arr, *args, **kwargs):
    return uniqueRows(arr.T, *args, **kwargs)

上面的公共域函数scipy.spatial.distance.pdist用于查找每对行之间的欧式距离（可自定义）。然后，将每个距离与thresh旧距离进行比较，以找到彼此之间的行thresh，并从每个行中仅返回一行thresh -cluster中。

如所暗示的，距离metric不必是欧几里得- pdist可以计算各种距离，包括cityblock（曼哈顿范数）和cosine（向量之间的角度）。

如果thresh=0（默认），则行必须精确到位才能被视为“唯一”。用于thresh缩放机器精度的其他良好值，即thresh=np.spacing(1)*1e3。

— 艾哈迈德·法西（Ahmed Fasih）
source

最佳答案。谢谢。这是迄今为止编写的（数学上）最广泛的答案。它认为矩阵是N维空间中的一组数据点或样本，并找到相同或相似点的集合（相似性由欧几里得距离或任何其他方法定义）。这些点可以是重叠的数据点，也可以是非常接近的邻域。最后，将相同或相似点的集合替换为属于同一集合的任何点（在上面的答案中为第一点）。这有助于减少来自点云的冗余。

— 桑契特

@Sanchit aha，这是一个好点，而不是选择“第一个”点（实际上可以是随机的，因为它取决于Python如何将点存储在set）作为每个thresh大小邻域的代表，该函数可以允许用户可以指定如何挑选这一点，例如，使用“中位数”或者最接近质心等点

— 艾哈迈德Fasih

当然。毫无疑问。我刚刚提到了第一点，因为这是您的程序正在做的事情，这完全可以。

— 桑契特

只是一个更正-我在上面错误地说，thresh由于的无序性质，将为每个群集选择的行将是随机的set。当然，就我而言，这很麻烦，因为set存储在thresh-neighborhood中的索引的元组，所以实际上对于每个-cluster findRows 确实返回thresh其中的第一行。

— 艾哈迈德·法西

3

为什么不使用drop_duplicates熊猫：

>>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values
1 loops, best of 3: 3.08 s per loop

>>> timeit np.vstack({tuple(r) for r in image.reshape(-1,3)})
1 loops, best of 3: 51 s per loop

— 卡鲁
source

我实际上很喜欢这个答案。当然，它并没有直接使用numpy，但对我来说，这是最容易理解却又快速的方法。

— noctilux

3

该numpy_indexed包（免责声明：我是它的作者）包装由Jaime在一个不错的发布解决方案和测试界面，再加上还有更多的功能：

import numpy_indexed as npi
new_a = npi.unique(a)  # unique elements over axis=0 (rows) by default

— Eelco Hoogendoorn
source

1

np.unique给出了元组列表：

>>> np.unique([(1, 1), (2, 2), (3, 3), (4, 4), (2, 2)])
Out[9]: 
array([[1, 1],
       [2, 2],
       [3, 3],
       [4, 4]])

通过列表列表，它会引发一个 TypeError: unhashable type: 'list'

— 编码
source

似乎不适用于我的。每个元组是两个字符串，而不是两个浮点数

— mjp

不起作用，它返回一个不是元组的元素列表

— Mohanad Kaleia

1

基于此页面上的答案，我编写了一个函数，该函数复制了MATLAB函数的unique(input,'rows')功能，并具有接受检查唯一性公差的附加功能。它还返回诸如c = data[ia,:]和的索引data = c[ic,:]。如果发现任何差异或错误，请报告。

def unique_rows(data, prec=5):
    import numpy as np
    d_r = np.fix(data * 10 ** prec) / 10 ** prec + 0.0
    b = np.ascontiguousarray(d_r).view(np.dtype((np.void, d_r.dtype.itemsize * d_r.shape[1])))
    _, ia = np.unique(b, return_index=True)
    _, ic = np.unique(b, return_inverse=True)
    return np.unique(b).view(d_r.dtype).reshape(-1, d_r.shape[1]), ia, ic

— Arash_D_B
source

1

除了@Jaime最佳答案之外，折叠行的另一种方法是使用等于a.strides[0]（假设a是C连续的）a.dtype.itemsize*a.shape[0]。而且void(n)是的快捷方式dtype((void,n))。我们终于到了最短的版本：

a[unique(a.view(void(a.strides[0])),1)[1]]

对于

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]

— BM
source

0

对于3D或更高级别的多维嵌套数组等一般用途，请尝试以下操作：

import numpy as np

def unique_nested_arrays(ar):
    origin_shape = ar.shape
    origin_dtype = ar.dtype
    ar = ar.reshape(origin_shape[0], np.prod(origin_shape[1:]))
    ar = np.ascontiguousarray(ar)
    unique_ar = np.unique(ar.view([('', origin_dtype)]*np.prod(origin_shape[1:])))
    return unique_ar.view(origin_dtype).reshape((unique_ar.shape[0], ) + origin_shape[1:])

满足您的2D数据集：

a = np.array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
unique_nested_arrays(a)

给出：

array([[0, 1, 1, 1, 0, 0],
   [1, 1, 1, 0, 0, 0],
   [1, 1, 1, 1, 1, 0]])

而且还有3D阵列，例如：

b = np.array([[[1, 1, 1], [0, 1, 1]],
              [[0, 1, 1], [1, 1, 1]],
              [[1, 1, 1], [0, 1, 1]],
              [[1, 1, 1], [1, 1, 1]]])
unique_nested_arrays(b)

给出：

array([[[0, 1, 1], [1, 1, 1]],
   [[1, 1, 1], [0, 1, 1]],
   [[1, 1, 1], [1, 1, 1]]])

— 塔拉
source

使用unique return_indexJaime确实可以return简化最后一行。只需ar在右轴上索引原点即可。

— hpaulj

0

这些答案都不对我有用。我假设我的唯一行包含字符串而不是数字。但是，另一个线程的这个答案确实起作用：

资料来源：https : //stackoverflow.com/a/38461043/5402386

您可以使用.count（）和.index（）列表的方法

coor = np.array([[10, 10], [12, 9], [10, 5], [12, 9]])
coor_tuple = [tuple(x) for x in coor]
unique_coor = sorted(set(coor_tuple), key=lambda x: coor_tuple.index(x))
unique_count = [coor_tuple.count(x) for x in unique_coor]
unique_index = [coor_tuple.index(x) for x in unique_coor]

— mjp
source

0

我们实际上可以将mxn数字numpy数组转换为mx 1 numpy字符串数组，请尝试使用以下函数，它提供了count，inverse_idx等，就像numpy.unique一样：

import numpy as np

def uniqueRow(a):
    #This function turn m x n numpy array into m x 1 numpy array storing 
    #string, and so the np.unique can be used

    #Input: an m x n numpy array (a)
    #Output unique m' x n numpy array (unique), inverse_indx, and counts 

    s = np.chararray((a.shape[0],1))
    s[:] = '-'

    b = (a).astype(np.str)

    s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)

    n = a.shape[1] - 2    

    for i in range(0,n):
         s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)

    s3, idx, inv_, c = np.unique(s2,return_index = True,  return_inverse = True, return_counts = True)

    return a[idx], inv_, c

例：

A = np.array([[ 3.17   9.502  3.291],
  [ 9.984  2.773  6.852],
  [ 1.172  8.885  4.258],
  [ 9.73   7.518  3.227],
  [ 8.113  9.563  9.117],
  [ 9.984  2.773  6.852],
  [ 9.73   7.518  3.227]])

B, inv_, c = uniqueRow(A)

Results:

B:
[[ 1.172  8.885  4.258]
[ 3.17   9.502  3.291]
[ 8.113  9.563  9.117]
[ 9.73   7.518  3.227]
[ 9.984  2.773  6.852]]

inv_:
[3 4 1 0 2 4 0]

c:
[2 1 1 1 2]

— 陈定安
source

-1

让我们以列表的形式获取整个numpy矩阵，然后从该列表中删除重复项，最后将唯一列表返回到numpy矩阵中：

matrix_as_list=data.tolist() 
matrix_as_list:
[[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]

uniq_list=list()
uniq_list.append(matrix_as_list[0])

[uniq_list.append(item) for item in matrix_as_list if item not in uniq_list]

unique_matrix=np.array(uniq_list)
unique_matrix:
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])

— Mahdi Ghelichi
source

-3

最直接的解决方案是通过将行设置为字符串来使其成为单个项目。然后，可以使用numpy将每一行的唯一性进行整体比较。该解决方案是可概括的，您只需要重塑形状并为其他组合转置数组即可。这是所提供问题的解决方案。

import numpy as np

original = np.array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

uniques, index = np.unique([str(i) for i in original], return_index=True)
cleaned = original[index]
print(cleaned)

会给：

 array([[0, 1, 1, 1, 0, 0],
        [1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 0]])

通过邮件发送我的诺贝尔奖

— 戴夫·佩纳（Dave Pena）
source

非常低效且容易出错，例如使用不同的打印选项。其他选择显然是更可取的。

— 迈克尔

-3

import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [1, 1, 1, 0, 0, 0],
                     [1, 1, 1, 1, 1, 0]])
# create a view that the subarray as tuple and return unique indeies.
_, unique_index = np.unique(original.view(original.dtype.descr * original.shape[1]),
                            return_index=True)
# get unique set
print(original[unique_index])

— YoungLearnsToCoding
source