### Probability of k collisions

, Friday, 30 January 2015
While I was reading about hash tables I stumbled upon this deceptively simple question. I have rephrased it to make it easier to understand.
Say we have $m$ buckets. We select a random bucket and put a ball in it, we repeat this (select-put) $n$ times. In the end what is the probability of having at least one bucket with $k$ balls?
My initial attempts to solve it failed, and I posted the question on math.stackexchange. An approximation for large values of $m$ and $n$ was posted as an answer but I was interested in exact results. Once I realized I had no way of finding someone else's solution, I was able to solve it myself.

## Simulation

First, to check if my solution is correct, I have written a small program to simulate the select-put procedure million times and report the experimental probability for different values of $m$,$n$ and $k$. If you are interested, you can find the java code at the end of my post.

## Solution

First we will start with writing the probability of 1 box having $k$ balls. Then using inclusion-exclusion principle we will generalize. Let's write the probability of putting first $k$ balls into the first box and remaining balls to the other boxes: $$\left(\frac1{m}\right)^k \left(1-\frac1{m}\right)^{n-k}$$ We have $\binom{n}{k}$ ways of putting $k$ balls into a particular box (e.g. first box), and we have $m$ such boxes. So the probability becomes: $$m \binom{n}{k} \left(\frac1{m}\right)^k \left(1-\frac1{m}\right)^{n-k}$$ But the expression above suffers from double counting, i.e. it counts configurations that contains multiple boxes with $k$ balls more than once. So we need to subtract the value for 2 boxes having $k$ balls, add the value for 3 boxes having $k$ balls, etc... Let's write the probability of putting first $k$ balls into the 1st box, next $k$ balls into the 2nd box, and remaining balls to the other boxes. $$\left(\frac1{m}\right)^k \left(\frac1{m}\right)^k \left(1-\frac1{m}\right)^{n-2k}$$ We have $\binom{n}{k}\binom{n-k}{k}$ ways of putting $2k$ balls into two particular boxes (e.g. first and second box), and we have $\binom{m}{2}$ such boxes. So the probability becomes: $$\binom{m}{2}\binom{n}{k}\binom{n-k}{k}\left(\frac1{m}\right)^k \left(\frac1{m}\right)^k \left(1-\frac1{m}\right)^{n-2k}$$ At this point you should be able to see the pattern that is emerging.

## Generalization

If we put together and generalize what we have done so far we get the following expression as probability of having a box with $k$ balls: $$\sum^{\lfloor \frac{n}{k} \rfloor}_{i=1} (-1)^{i+1} \binom{m}{i} \left[ \prod^{i-1}_{j=0} \binom{n-jk}{k} \right] \left( \frac1{m} \right)^{ik} \left( \frac{m-i}{m} \right)^{n-ik}$$ The powers of $-1$ at the beginning enables adding values for odd number of balls and subtracting values for even number of balls, i.e. inclusion-exclusion principle.

## The code

  private static double calculateKcollisions(int m, int n, int k) {
double probability = 0;

int sign = 1;

for (int i = 1; n - i * k >= 0 && i <= m; i++) {
double p = binomial(m, i) * pow(1.0 / m, i * k);
p *= pow((double) (m - i) / m, n - i * k);

for (int j = 0; j < i; j++) {
p *= binomial(n - j * k, k);
}

probability += sign * p;
sign *= -1;
}

return probability;
}



## Simulation code

  private static final Random RNG = new Random();

public static void main(String[] args) {
int sampleCount = 1000000;

for (int experimentNo = 0; experimentNo < 10; experimentNo++) {
int m = RNG.nextInt(7) + 3;
int n = RNG.nextInt(m) + RNG.nextInt(10) + 1;
int k = RNG.nextInt(n + 1);

System.out.println(format("m:%d  n:%d  k:%d", m, n, k));

int successCount = 0;
for (int sampleNo = 0; sampleNo < sampleCount; sampleNo++) {

int[] boxes = new int[m];

for (int i = 0; i < n; i++)
boxes[RNG.nextInt(m)]++;

for (int i = 0; i < m; i++) {
if (boxes[i] == k) {
successCount++;
break;
}
}
}

double experimental = (double) successCount / sampleCount;

double calculated = calculateKcollisions(m, n, k);

System.out.println(format("xperimental: %f  calculated: %f", experimental, calculated));
System.out.println(format("diff: %.2f\n", abs(experimental - calculated)));
System.out.println();
}
}


#### 1 comment:

1. Excellent blog thanks for sharing the valuable information..it becomes easy to read and easily understand the information.
Useful article which was very helpful. also interesting and contains good information.