如何衡量“分类”

34

我想知道是否有一种标准的方法来测量数组的“排序”？是否将具有可能反转的中位数的数组视为最大未排序的数组？我的意思是，基本上从排序或反向排序开始都是尽可能的。

— Robert S. Barnes
source

31

No, it depends on your application. The measures of sortedness are often refered to as measures of disorder, which are functions from $N^{<N}$ to $\mathbb{R}$ , where $N^{<N}$ is the collection of all finite sequences of distinct nonnegative integers. The survey by Estivill-Castro and Wood [1] lists and discusses 11 different measures of disorder in the context of adaptive sorting algorithms.

The number of inversions might work for some cases, but is sometimes insufficient. An example given in [1] is the sequence

⟨ ⌊ n / 2 ⌋ + 1, ⌊ n / 2 ⌋ + 2, \dots, n, 1, \dots, ⌊ n / 2 ⌋ ⟩

$\langle \lfloor n/2 \rfloor + 1, \lfloor n/2 \rfloor + 2, \ldots, n, 1, \ldots, \lfloor n/2 \rfloor \rangle$

that has a quadratic number of inversions, but only consists of two ascending runs. It is nearly sorted, but this is not captured by inversions.

[1] Estivill-Castro, Vladmir, and Derick Wood. "A survey of adaptive sorting algorithms." ACM Computing Surveys (CSUR) 24.4 (1992): 441-476.

— Juho
source

2

The context is trying to understand why quicksort performs relatively poorly on random permutations of n elements where the number of inversions is close to the median.

— Robert S. Barnes

1

Great example, that's exactly the info I was looking for.

— Robert S. Barnes

1

Estivill-Castro and Wood is THE reference for this for sure.

— Pedro Dusso

10

Mannila [1] axiomatizes presortedness (with a focus on comparison-based algorithms) as follows (paraphrasing).

Let $\Sigma$ a totally ordered set. Then a mapping $m$ from $\Sigma^{\star}$ (the sequences of distinct elements from $\Sigma$ ) to the naturals is a measure of presortedness if it satisfies below conditions.

If $X \in \Sigma^{\star}$ is sorted then $m(X) = 0$ .

If $X,Y \in \Sigma^{\star}$ with $X = x_1 \dots x_n$ , $Y = y_1 \dots y_n$ and $x_i < x_i \iff y_i < y_j$ for all $i,j \in [1..n]$ , then $m(X) = m(Y)$ .

If $X$ is a subsequence of $Y \in \Sigma^{\star}$ , then $m(X) \leq m(Y)$ .

If $x_i < y_j$ for all $i \in [1..|X|]$ and $j \in [1..|Y|]$ for some $X,Y \in \Sigma^{\star}$ , then $m(X \cdot Y) \leq m(X) + m(Y)$ .

$m(a \cdot X) \leq |X| + m(X)$ for all $X \in \Sigma^{\star}$ and $a \in E \setminus X$ .

Examples of such measures are the

number of inversions,
number of swaps,
the number of elements that are not left-to-right maxima, and
the length of a longest increasing subsequence (subtracted from the input length).

Note that random distributions using these measures have been defined, i.e. such that make sequences that are more/less sorted more or less likely. These are called Ewens-like distributions [2, Ch. 4-5; 3, Example 12; 4], a special case of which is the so-called Mallows distribution. The weights are parametric in a constant $\theta > 0$ and fulfill

$\qquad\displaystyle \operatorname{Pr}(X) = \frac{\theta^{\,m(X)}}{\sum_{Y \in \Sigma^{\star} \cap \Sigma^{|X|}} \theta^{\,m(Y)}}$ .

Note how $\theta = 1$ defines the uniform distribution (for all $m$ ).

Since it is possible to sample permutations w.r.t. these measures efficiently, this body of work can be useful in practice when benchmarking sorting algorithms.

Measures of Presortedness and Optimal Sorting Algorithms by H. Mannila (1985)
Logarithmic combinatorial structures: a probabilistic approach by R. Arratia, A.D. Barbour and S. Tavaré (2003)
On adding a list of numbers (and other one-dependent determinantal processes) by A. Borodin, P. Diaconis and J. Fulman (2010)
Ewens-like distributions and Analysis of Algorithms by N. Auger et al. (2016)

— Raphael
source

3

I have my own definition of "sortedness" of a sequence.

Given any sequence [a,b,c,…] we compare it with the sorted sequence containing the same elements, count number of matches and divide it by the number of elements in the sequence.

For example, given sequence [5,1,2,3,4] we proceed as follows:

1) sort the sequence: [1,2,3,4,5]

2) compare the sorted sequence with the original by moving it one position at a time and counting the maximal number of matches:

        [5,1,2,3,4]
[1,2,3,4,5]                            one match

        [5,1,2,3,4]
  [1,2,3,4,5]                          no matches

        [5,1,2,3,4]
    [1,2,3,4,5]                        no matches

        [5,1,2,3,4]
      [1,2,3,4,5]                      no matches

        [5,1,2,3,4]
        [1,2,3,4,5]                    no matches

        [5,1,2,3,4]
          [1,2,3,4,5]                  4 matches

        [5,1,2,3,4]
            [1,2,3,4,5]                no matches

                ...

         [5,1,2,3,4]
                 [1,2,3,4,5]            no matches

3) The maximal number of matches is 4, we can calculate the "sortedness" as 4/5 = 0.8.

Sortedness of a sorted sequence would be 1, and sortedness of a sequence with elements placed in reversed order would be 1/n.

The idea behind this definition is to estimate the minimal amount of work we would need to do to convert any sequence to the sorted sequence. In the example above we need to move just one element, the 5 (there are many ways, but moving 5 is the most efficient). When the elements would be placed in reversed order, we would need to move 4 elements. And when the sequence were sorted, no work is needed.

I hope my definition makes sense.

— Andrushenko Alexander
source

Nice idea. A similar definition is Exc, the third definition of disorder in the paper mentioned in Juho's answer. Exc is the number of operations required to rearrange a sequence into sorted order.

— Apass.Jack

Well, may be, I just applied my understanding of entropy and disorder to the sequence of elements :-)

— Andrushenko Alexander

-2

If you need something quick and dirty (summation signs scare me) I wrote a super easy disorder function in C++ for a Class named Array which generates int arrays filled with randomly generated numbers:

void Array::disorder() {
    double disorderValue = 0;
    int counter = this->arraySize;
    for (int n = 0; n < this->arraySize; n++) {
        disorderValue += abs(((n + 1) - array[n]));
//      cout << "disorderValue variable test value = " << disorderValue << endl;
        counter++;
    }
    cout << "Disorder Value = " << (disorderValue / this->arraySize) / (this->arraySize / 2) << "\n" << endl;
}

Function simply compares the value in each element to the index of the element + 1 so that an array in reverse order has a disorder value of 1, and a sorted array has a disorder value of 0. Not sophisticated, but working.

Michael

— Michael Sneberger
source

This is not a programming site. It would have sufficed to define the disorder notion, and to mention that it can be computed in linear time.

— Yuval Filmus