anlak: Probability of k collisions

While I was reading about hash tables I stumbled upon this deceptively simple question. I have rephrased it to make it easier to understand.

Say we have $m$ buckets. We select a random bucket and put a ball in it, we repeat this (select-put) $n$ times. In the end what is the probability of having at least one bucket with $k$ balls?

My initial attempts to solve it failed, and I posted the question on math.stackexchange. An approximation for large values of $m$ and $n$ was posted as an answer but I was interested in exact results. Once I realized I had no way of finding someone else's solution, I was able to solve it myself.

Simulation

First, to check if my solution is correct, I have written a small program to simulate the select-put procedure million times and report the experimental probability for different values of $m$,$n$ and $k$. If you are interested, you can find the java code at the end of my post.

Solution

First we will start with writing the probability of 1 box having $k$ balls. Then using inclusion-exclusion principle we will generalize. Let's write the probability of putting first $k$ balls into the first box and remaining balls to the other boxes: $$ \left(\frac1{m}\right)^k \left(1-\frac1{m}\right)^{n-k} $$ We have $\binom{n}{k}$ ways of putting $k$ balls into a particular box (e.g. first box), and we have $m$ such boxes. So the probability becomes: $$ m \binom{n}{k} \left(\frac1{m}\right)^k \left(1-\frac1{m}\right)^{n-k} $$ But the expression above suffers from double counting, i.e. it counts configurations that contains multiple boxes with $k$ balls more than once. So we need to subtract the value for 2 boxes having $k$ balls, add the value for 3 boxes having $k$ balls, etc... Let's write the probability of putting first $k$ balls into the 1st box, next $k$ balls into the 2nd box, and remaining balls to the other boxes. $$ \left(\frac1{m}\right)^k \left(\frac1{m}\right)^k \left(1-\frac1{m}\right)^{n-2k} $$ We have $\binom{n}{k}\binom{n-k}{k}$ ways of putting $2k$ balls into two particular boxes (e.g. first and second box), and we have $\binom{m}{2}$ such boxes. So the probability becomes: $$ \binom{m}{2}\binom{n}{k}\binom{n-k}{k}\left(\frac1{m}\right)^k \left(\frac1{m}\right)^k \left(1-\frac1{m}\right)^{n-2k} $$ At this point you should be able to see the pattern that is emerging.

Generalization

If we put together and generalize what we have done so far we get the following expression as probability of having a box with $k$ balls: $$ \sum^{\lfloor \frac{n}{k} \rfloor}_{i=1} (-1)^{i+1} \binom{m}{i} \left[ \prod^{i-1}_{j=0} \binom{n-jk}{k} \right] \left( \frac1{m} \right)^{ik} \left( \frac{m-i}{m} \right)^{n-ik} $$ The powers of $-1$ at the beginning enables adding values for odd number of balls and subtracting values for even number of balls, i.e. inclusion-exclusion principle.

The code

  private static double calculateKcollisions(int m, int n, int k) {
    double probability = 0;

    int sign = 1;

    for (int i = 1; n - i * k >= 0 && i <= m; i++) {
      double p = binomial(m, i) * pow(1.0 / m, i * k);
      p *= pow((double) (m - i) / m, n - i * k);

      for (int j = 0; j < i; j++) {
        p *= binomial(n - j * k, k);
      }

      probability += sign * p;
      sign *= -1;
    }

    return probability;
  }

Simulation code

  private static final Random RNG = new Random();

  public static void main(String[] args) {
    int sampleCount = 1000000;

    for (int experimentNo = 0; experimentNo < 10; experimentNo++) {
      int m = RNG.nextInt(7) + 3;
      int n = RNG.nextInt(m) + RNG.nextInt(10) + 1;
      int k = RNG.nextInt(n + 1);
      
      System.out.println(format("m:%d  n:%d  k:%d", m, n, k));

      int successCount = 0;
      for (int sampleNo = 0; sampleNo < sampleCount; sampleNo++) {

        int[] boxes = new int[m];

        for (int i = 0; i < n; i++)
          boxes[RNG.nextInt(m)]++;

        for (int i = 0; i < m; i++) {
          if (boxes[i] == k) {
            successCount++;
            break;
          }
        }
      }

      double experimental = (double) successCount / sampleCount;

      double calculated = calculateKcollisions(m, n, k);

      System.out.println(format("xperimental: %f  calculated: %f", experimental, calculated));
      System.out.println(format("diff: %.2f\n", abs(experimental - calculated)));
      System.out.println();
    }
  }

Probability of k collisions

Simulation

Solution

Generalization

The code

Simulation code

No comments:

Post a Comment