用少于N个随机位来模拟2 ^ N中的1的概率

说我需要模拟以下离散分布：

P (X = k) = {\begin{cases} \frac{1}{2^{N}}, & if k = 1 \\ 1 - \frac{1}{2^{N}}, & if k = 0 \end{cases}

$P(X = k) = \begin{cases} \frac{1}{2^N}, & \text{if $k = 1$} \\ 1 - \frac{1}{2^N}, & \text{if $k = 0$} \end{cases}$

最明显的方法是绘制 $N$ 随机位，并检查它们是否均等于 $0$ （或 $1$ ）。但是，信息论说

\begin{aligned} S & = - \sum_{i} P_{i} \log P_{i} \\ = - \frac{1}{2^{N}} \log \frac{1}{2^{N}} - (1 - \frac{1}{2^{N}}) \log (1 - \frac{1}{2^{N}}) \\ = \frac{1}{2^{N}} \log 2^{N} + (1 - \frac{1}{2^{N}}) \log \frac{2^{N}}{2^{N} - 1} \\ \to 0 \end{aligned}

$\begin{align} S & = - \sum_{i} P_i \log{P_i} \\ & = - \frac{1}{2^N} \log{\frac{1}{2^N}} - \left(1 - \frac{1}{2^N}\right) \log{\left(1 - \frac{1}{2^N}\right)} \\ & = \frac{1}{2^N} \log{2^N} + \left(1 - \frac{1}{2^N}\right) \log{\frac{2^N}{2^N - 1}} \\ & \rightarrow 0 \end{align}$

因此，随着变大，所需的最小随机位数实际上会减少。这怎么可能？ $N$

请假设我们在计算机上运行，其中位是您唯一的随机性来源，因此您不能只是丢下有偏见的硬币。

— 纳尔索克
source

如果您正在寻找更深入的关键字，则这与编码理论和Kolmogorov复杂度密切相关。DW在下面提到的计数同一位重复运行的技术非常有用

— Brian Gordon

Answers:

Wow, great question! Let me try to explain the resolution. It'll take three distinct steps.

The first thing to note is that the entropy is focused more on the average number of bits needed per draw, not the maximum number of bits needed.

With your sampling procedure, the maximum number of random bits needed per draw is $N$ bits, but the average number of bits needed is 2 bits (the average of a geometric distribution with $p=1/2$ ) -- this is because there is a $1/2$ probability that you only need 1 bit (if the first bit turns out to be 1), a $1/4$ probability that you only need 2 bits (if the first two bits turn out to be 01), a $1/8$ probability that you only need 3 bits (if the first three bits turn out to be 001), and so on.

The second thing to note is that the entropy doesn't really capture the average number of bits needed for a single draw. Instead, the entropy captures the amortized number of bits needed to sample $m$ i.i.d. draws from this distribution. Suppose we need $f(m)$ bits to sample $m$ draws; then the entropy is the limit of $f(m)/m$ as $m \to \infty$ .

The third thing to note is that, with this distribution, you can sample $m$ i.i.d. draws with fewer bits than needed to repeatedly sample one draw. Suppose you naively decided to draw one sample (takes 2 random bits on average), then draw another sample (using 2 more random bits on average), and so on, until you've repeated this $m$ times. That would require about $2m$ random bits on average.

But it turns out there's a way to sample from $m$ draws using fewer than $2m$ bits. It's hard to believe, but it's true!

Let me give you the intuition. Suppose you wrote down the result of sampling $m$ draws, where $m$ is really large. Then the result could be specified as a $m$ -bit string. This $m$ -bit string will be mostly 0's, with a few 1's in it: in particular, on average it will have about $m/2^N$ 1's (could be more or less than that, but if $m$ is sufficiently large, usually the number will be close to that). The length of the gaps between the 1's are random, but will typically be somewhere vaguely in the vicinity of $2^N$ (could easily be half that or twice that or even more, but of that order of magnitude). Of course, instead of writing down the entire $m$ -bit string, we could write it down more succinctly by writing down a list of the lengths of the gaps -- that carries all the same information, in a more compressed format. How much more succinct? Well, we'll usually need about $N$ bits to represent the length of each gap; and there will be about $m/2^N$ gaps; so we'll need in total about $mN/2^N$ bits (could be a bit more, could be a bit less, but if $m$ is sufficiently large, it'll usually be close to that). That's a lot shorter than a $m$

$p=1/2^N$ $\sim N$ $2^N$ $m/2^N$ i.i.d. draws from this geometric distribution, so you'll need in total roughly $\sim Nm/2^N$ random bits. (It could be a small constant factor larger, but not too much larger.) And, notice is that this is much smaller than $2m$ bits.

So, we can sample $m$ i.i.d. draws from your distribution, using just $f(m) \sim Nm/2^N$ random bits (roughly). Recall that the entropy is $\lim_{m \to \infty} f(m)/m$ . So this means that you should expect the entropy to be (roughly) $N/2^N$ . That's off by a little bit, because the above calculation was sketchy and crude -- but hopefully it gives you some intuition for why the entropy is what it is, and why everything is consistent and reasonable.

— D.W.
source

Wow, great answer! But could you elaborate on why sampling from a geometric distribution with

p = \frac{1}{2^{N}}

$p=\frac{1}{2^N}$ takes

N

$N$ bits on average? I know such a random variable would have a mean of

2^{N}

$2^N$ , so it takes on average

N

$N$ bits to store, but I suppose this doesn't mean you can generate one with

N

$N$ bits.

— nalzok

@nalzok, A fair question! Could you perhaps ask that as a separate question? I can see how to do it, but it's a bit messy to type up right now. If you ask perhaps someone will get to answering quicker than I can. The approach I'm thinking of is similar to arithmetic coding. Define

q_{i} = Pr [X \leq i]

$q_i = \Pr[X\le i]$ (where

X

$X$ is the geometric r.v.), then generate a random number

r

$r$ in the interval

[0, 1)

$[0,1)$ , and find

i

$i$ such that

q_{i} \leq r < q_{i + 1}

$q_i \le r < q_{i+1}$ . If you write down the bits of the binary expension

r

$r$ one at a time, usually after writing down

N + O (1)

$N+O(1)$ bits of

r

$r$ ,

i

$i$ will be fully determined.

— D.W.

So you're basically using the inverse CDF method to convert a uniformly distributed random variable to an arbitrary distribution, combined with an idea similar to binary search? I'll need to analyze the quantile function of a geometric distribution to be sure, but this hint is enough. Thanks!

— nalzok

@nalzok, ahh, yes, that's a nicer way to think about it -- lovely. Thank you for suggesting that. Yup, that's what I had in mind.

— D.W.

You can think this backwards: consider the problem of binary encoding instead of generation. Suppose that you have a source that emits symbols $X\in \{A,B\}$ with $p(A)=2^{-N}$ , $p(B)=1-2^{-N}$ . For example, if $N=3$ , we get $H(X)\approx 0.54356$ . So (Shannon tells us) there is an uniquely decodable binary encoding $X \to Y$ , where $Y \in \{0,1\}$ (data bits), such that we need, on average, about $0.54356$ data bits for each original symbol $X$ .

(In case you are wondering how such encoding can exists, given that we have only two source symbols, and it seems that we cannot do better that the trivial encoding , $A\to 0$ , $B\to 1$ , with one bit per symbol, you need to understand that to approximate the Shannon bound we need to take "extensions" of the source, that is, to code sequences of inputs as a whole. See in particular arithmetic encoding).

Once the above is clear, if we assume we have an invertible mapping $X^n \to Y^n$ , and noticing that, in the Shannon limit $Y^n$ must have maximum entropy (1 bit of information per bit of data), i.e., $Y^n$ has the statistics of a fair coin, then we have a generation scheme at hand: draw $n$ random bits (here $n$ has no relation with $N$ ) with a fair coin, interpret it as the output $Y^n$ of the encoder, and decode $X^n$ from it. In this way, $X^n$ will have the desired probability distribution, and we need (in average) $H(X)<1$ coins to generate each value of $X$ .

— leonbloy
source