The frenzy surrounding the Aussie Animal cards promotion got me thinking. How many randomly collected cards would you need to collect before you had the whole set? More specifically, if you were randomly given n cards what's the probability of having the full set. I'm going to use the 108 card Aussie Animal set as an example.
All the files associated with this post can be found here.
When I initially came up with the idea for this post I thought it would be easy. Turns out I was wrong. A couple of quick simulations showed my understanding of the problem wasn't right, but after a little research I was back on track. It did take me 2 days though.
The problem is mathematically the same as a well known problem called The Coupon Collector's Problem. In this context the problem can be restated in the following way:
"Suppose that there are k different coupons, equally likely, from which coupons are being collected with replacement. What's the probability that after n sample trials the complete set of k coupons is collected."
The formula to calculate this is deceptively simple.
There are k^n different ways to select the coupons. The next step is to figure out how many of those result in a complete set being collected. I'm still getting my head around this, but it comes down to something called Stirling numbers of the second kind that equal the number of ways to partition a set of n objects into k non-empty subsets. This suits the problem. The collected coupons need to partitioned into k sets and none of these sets can be empty, i.e. the coupon hasn't been collected. This number is represented as S(n,k). This number has to multiplied by k!, as there are this many ways to arrange the sets.
This is where things got hard, I tried GNU Octave to calculate the results but as there was no built in function I tried to roll my own. I used an explicit formula for S(n,k) but that required numbers like 1000! to be calculated. A number this large is just too big for Octave to handle. I came up with a way to do all the calculations using the log of the number, but there was a precision problem for low values of n. Although with more time I could have got it working, I scrapped it and moved to Maxima, a computer algebra system that uses arbitrary precision (should have started there). A couple of lines of code later I had my result. The data was imported into Libre Calc to graph the solution.
Back to the Aussie Animal cards. I've chosen a couple of data points to illustrate the how hard it can be to collect the set.
n = 200, P = 8.99E-11
n = 400, P = 0.0620
n = 600, P = 0.662
n = 800, P = 0.938
n = 1000, P = 0.990
After collecting 600 cards you still only have a 66.2% chance that you've collected a set.
The expected value of how many cards you need to collect before you have a set is given by the formula nH(n), where H(n) is the harmonic number of n. In our case this is equivalent to 108*H(108) = 568.5 To double check this, the Probability Density Function of the above data was generated to manually calculate the Expected number. I get 557. Given it was done in a spreadsheet with low precision data only up to n=1000, I'm pretty happy they agree that much.
This shows that without trading cards with other people it can be difficult to collect an entire set. It assumes that all cards are equally likely and independent. This may not be the case but it's not unreasonable to assume it is.
I'm glad I tackled this problem, learning about Stirling numbers made it worthwhile. I now have another mathematical tool under my belt. Would've been nice to have heard about them in statistics class, but there's only so much you can cover.