Tuesday, March 18, 2014

Probability of Collecting Multiple Full Sets

Well, it's that time again.  One of the large supermarket chains has released yet another set of collector cards.  This time there are 42 cards to collect compared to the 108 cards of the last campaign, and I thought it'd be interesting to explore some the statistics of cards collecting.  In my last post on this topic, Probability of Collecting a Full Set, I looked at the expected number of cards you need to collect on average before you have a full set of cards.  I suggest reading that article for some background on the problem before preceding any further.

All caught up?  Good.  The question I'm posing this time is how many cards you need to collect on average before you have multiple complete sets?  The inspiration for this question came from hearing about families with multiple children trying to collect a set of cards for each child.  Of course this once again assumes that the trading of cards to other people is disallowed.

Like my first article about this, I initially thought that I had a handle on the problem and the answer would be easy, and once again the maths kicked my butt.  I scraped through by the skin of my teeth last time, but not this time.  I originally thought that the probability of collecting two sets of cards could be found by finding the probability of collecting one set of cards, multiplied by the probability of collecting further sets from the left over cards.  It seems logical, but after running a few quick simulations it was obvious I was wrong.  The problem lies in the fact the the left over cards are not an even distribution.

So, where to now?  Google.  It took some time to find but I eventually found that the problem was called the "double dixie cup problem" and was only solved  by Newman and Shepp in 1960.  I found a good run-down of the problem at http://www.brynmawr.edu/math/people/anmyers/PAPERS/SIGEST_Coupons.pdf that gets me close to solving it analytically, but I just didn't have the time to fully absorb it.  So I listened to that little engineer buried deep inside and decided to cheat by using a Monte Carlo simulation.  Near enough is good enough.  I suppose it's not technically cheating but it does feel a little dirty.

The files associated with this post are located here.  They're a bit rough, but they were only used for working.

Octave was good enough for the job.  It was simple to write a script that repeatedly drew cards and counted how many full sets were drawn.  My simulation was for randomly drawing 1 to 1000 cards and seeing how many full sets of 42 cards were collected.  Each of these simulations was run 10000 times.  The results are shown below.
cumulative distribution function
The Probability of collecting at least n full sets of 42 cards

The familiar shape from the previous article can be seen here.  The noise from the Monte Carlo simulation can also be seen on the graphs.  The results match what I expect.  For example if you were to collect 200 cards, it would be highly unlikely you would have 3 sets (0%), there's a slim chance you'd have 2 sets (10%), but you'd almost certainly have one complete set (90%).  But what is the expected value for a certain number of sets?   This is where things get a little dodgy.  Because of the noise in the simulation it's possible to have negative probabilities in the probability mass function.  (yes I know that it should be a stem plot, but I find it easier to see what's going on this way)  The main point is that a negative probability doesn't make much sense.  The plots below should be a series of smooth humps.  If the Monte Carlo simulation was run for 10 times longer we would get close to approximating a smooth plot.  Even thought the data doesn't make sense it should still be able to be used to approximate the expected value.
The expected number of cards needed to collect 1,2,3,4,5 sets was calculated.

1 set  - 182 cards - 182 cards per set
2 sets - 264 cards - 132 cards per set
3 sets - 337 cards - 112 cards per set
4 sets - 404 cards - 101 cards per set
5 sets - 468 cards -  93 cards per set

As the number of sets to collect increases, I suspect that the number of cards per set would approach a limit of 42 cards per set.

As a sanity check we can analytically calculate the expected number of cards to collect for one set using the equation from the previous post. nH(n), where H represents the harmonic number of n.  42 H(42) = 181.7 this agrees with the result from the Monte Carlo simulation.

So what's the point of all this?  Although it may seem a daunting task to collect a set of cards (you need to collect 182 on average) when you pool resources with other people the number of cards you need to collect decreases.  This can be accomplished within a family or by gathering a group of people to trade with.

No comments:

Post a Comment