Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Sunday, September 15, 2019

Self Normalising Price Markdowns For Short Life Products

When selling products that have a relatively short shelf life, it becomes necessary to reduce the price on items as they approach their use by date so they sell and aren't thrown out.  This is done for several reasons.

  • Sending plastic packaging and food that's safe to eat to landfill is a waste of resources, and in 2019 we can do a lot better than that
  • If a business can reduce its waste, they can save money on garbage collection
  • It's an acknowledgement to customers that the product isn't as flexible as one with a longer use by date
  • It adds a little excitement for customers if they can buy a product that they normally wouldn't buy because it's normally too expensive

In my experience, supermarkets perform markdowns in an incremental way to make sure that stock isn't wasted, and to get as much money for the product that is possible.  This was usually accomplished by reducing the price by 20% on the 2nd last day of sale, 30% on the morning of the last day of sale, 40% at lunch time on the last day of sale and then going 50%, 60%, 70%, 80%, and 90% until the close of the store.  In essence, it's a Dutch auction. This works well but has problems.  People buy the most popular items first leaving an assortment of unappealing products towards the end.  As fixed percentages are used across the board, it means you might be reducing some items more than you need to and some less than you need to.  It's also reasonably labour intensive.

If you track markdowns and plot a graph of the most common ones you end up with a graph like the one below, which I'll use to demonstrate a point.
markdown percentage graph
Markdowns shouldn't waste time or money

In the graph above you can see that average markdown is around 50%.  Traditionally you'd start with a 20% markdown, but the graph shows that in this case almost no one will buy the product at that discount.  This is the "Waste of Time" section of the graph.  Most likely you'll have to come back and do another markdown on the product.  Conversely if you go straight to an 80% markdown you've wasted money because you mostly likely could have sold the product for more.  The "Waste of Money" section.  Ideally most of the markdowns should be in the 30% to 70% range, the "Sweet Spot". But how do you calculate the markdown percentages?

I'm going to demonstrate a method that doesn't necessarily generate optimum markdown percentages, but does generate percentages so that popular and unpopular products sell at roughly the same rate with a reasonable chance of selling on the first markdown.

Let's start with some Lamb Hearts, and to keep things simple, from most to least recent the markdown percentages are as follows, 98%, 66%, and 6%.  This process uses a method called Kernel Density Estimation to generate a smooth curve of markdown probabilities.  This means that at each of those percentages above you place a predefined shape, in this case a Gaussian distribution.
markdown percentage graph
Initial placement of kernels

You can see the distributions listed above.  You may notice that because the grey and blue curves are truncated at the sides of the graph the area under them is less.  This means that they'll have less of an influence on the result.  This can be corrected with a simple scaling process though.
markdown percentage graph
Compensation for truncated kernels

The areas under the graphs above are now all the same, but is this what we want?  Most recent markdowns should have a greater impact on the process than ones that were performed long ago.  To correct this, each peak is weighted by 3 for the most recent, 2 for the next most recent, and 1 for the oldest markdown.
markdown percentage graph
Weighting the kernels for recency

Adding each of these graphs gives the final markdown distribution in yellow.
markdown percentage graph
Summation of the kernels

Now we perform a process called integration to get the light blue line.  For those of you unfamiliar integration it basically records the area under a graph. For example at the 20 point on the bottom axis the blue line records the area under the yellow line up until that point.
markdown percentage graph
Integration of the distribution to produce the final curve

The next step is to introduce the concept of a markdown level which ranges from 0 to 10.  0 corresponds to a 0% markdown and 10 corresponds to a 100% markdown, but in between, things are very different.  As a demonstration we'll calculate the markdown percentages for markdown levels 2, 5, and 8.  The process is quite simple.  Find the levels on the left hand side of the graph and project them across to the blue curve and then down to the bottom of the graph.  This gives percentages of 53, 80 and then 94.
markdown percentage graph
Calculating markdowns percentages from the graph

For another demonstration let's try something popular like chicken breast that has recent markdowns of 30%, 10%, and 20%.  This leads to percentages of 10, 22, and 33.
markdown percentage graph
Calculating markdowns percentages from the graph

Like I said above, this isn't designed to generate an optimum markdown, it's designed to create a markdown specific to each product so that they'll sell at similar rates.  In the demonstration above the situation may warrant a level 2 markdown.  This means that a product that's hard to sell like lamb hearts gets a 53% markdown, while a popular item like chicken breast only gets a 10% markdown.

So a strategy may be to use a level 2 markdown on the second last day of sale, and then taper throughout the trading day from level 3 to 10 on the last day.

The demonstrations above have been quite simple and don't give a full picture of how powerful this method can be.  Let's put together a complicated situation with fake data to test it.

A new flavour of sausages is introduced and we are unsure how they'll perform.  They initially use an assumed prior that isn't mentioned above, but it quickly becomes obvious that they are a good seller and don't require large markdowns, with the first level 2 markdown being in the region of 15%.  After 500 recorded markdowns a new similar cheaper line is introduced and the sales of the original line suffer.  Before long the system calculates that the first markdown at level 2 needs to be about 55%
markdown percentage graph
Animation of the markdown calculation process

The animation above is generated from 500 random markdowns with an average of 20% and then 500 random markdowns with an average of 80%.  The calculations take into account the last 100 markdown and are weighted from 1 to 100.

There are a couple important points to consider.  Firstly the process uses a lot of statistical methods and equations but isn't really grounded in theory.  It uses the properties of the methods to create a stable system that fits the requirements that are specified. The next thing to remember is that this process will follow the markdowns.  If you start the markdown process at level 8 without even trying something less than a level 5.  The system will generate larger and larger markdowns.  A stabilising term can be added to prevent this.

It's not a perfect method but it's a lot better than some of the optimised markdown systems I've seen.  I actually trialed this process a while ago(I have no deep access to the computers and had to do it manually) in my store and it's fascinating to see it converge on higher and lower markdowns based on the saleability of the product.

Saturday, November 21, 2015

Adding a Sales Profile to a Discrete Event Simulation

If you've been following along, you'll know I've been putting together a discrete event simulator for inventory management.  The goal is to better integrate knowledge of sales history, stock levels, and ordering patterns to make quantitative decisions about what to order and when.  So far I've been using a scenario involving a 24 hour business with a constant sales rate, but that's not realistic.  Therefore the next step is to add a way to generate a customer flow based on a sales profile.

The way I've gone about doing this is to start with a graph of a sales profile.  Let's slightly change the scenario I've been using.  We're now looking at a profile of sales at a fruit shop.  The fruit shop only opens between 8am and 9pm on weekdays, 8am to 5pm on Saturdays, and 9am to 6pm on Sunday. Weekday sales are $3000 a day, and weekend sales are $7500 a day.  This is a total of $30000 per week.  When graphed against seconds passed, the total sales for the week are shown below.
sales graph
sales profile
In the previous simulations I was modelling the customer flow as a poisson process, where item sales per second was used to generate random times between customer arrivals.  This makes it hard to create a varying sales pattern based on customer flow.  This is easily changed though, you just need to interpret the problem differently.  If for example 150 watermelons are sold per week, that means that on average $200 are spent in the store as a whole between each watermelon purchase.  If we assume that the customer flow of people buying watermelons is similar to those buying fruit in general, which in the absence of better data is a reasonable assumption, a sales pattern can now be generated.

If instead of saying "a customer will buy a watermelon at x seconds into the week" we now say "a customer will buy a watermelon at x dollars into the week", it makes sense to reverse the situation and use the sales as our independent variable.  The way to do that is indicated in the image above.  The evenly spaced horizontal black lines indicate sales increments of $1000.  When projected across to the sales profile and then down to the time axis, a customer flow is generated.  You can see that there won't be any sales when the store is closed as the line can't intersect with these time periods.  Days like Saturday and Sunday that have a high rate of sales intersect a lot of lines and project them into a small time period.

I've written a custom class to perform this interpolation process on a piecewise function that's passed to it.  It allows a dollar value to be converted to a time stamp and vice versa.  The supplied interp function in numpy is almost adequate, but it has some extra constraints placed on it.  Firstly the data supplied needs to have unique timestamps in increasing order, it doesn't make sense to have two different sales values for a single point in time.  The other requirement is that the sales data needs to be increasing, in this case it's fine to have two data points the same, it just means no sales have occurred in this period.

To ensure consistency, I wanted to ensure that when converting from dollars to time, predictable and consistent results were produced.  This may not seem obvious, but if you were to convert the dollar value of 3000 to a timestamp in the above example there are multiple answers.  Any time between closing on the first day and opening on the second day the sales value is $3000.  I decided to make it so the earliest time a particular dollar value is reached would be returned.

It's simply a matter of creating a piecewise function as shown below.  It can be as detailed as you want.  I've only indicated sales at opening and closing times and linearly interpolated between the data points, but you could supply hourly or minute by minute sales data if you wanted.

python code
specification of a sales profile

Using this function, a customer pattern is generated in terms of dollars and then back projected to timestamps.  These values are then used in the simulation.  I haven't changed the ordering requirements from the previous scenario.  You have to order 24 items at a time, the deliveries happen at 1 am, and after the delivery driver has left, you have to have 50 or more items.  So what does the simulation look like now?
simulation results
stock on hand simulation with sales profile
You can see the plateaus where there are no sales because the store is closed.  The deliveries at 1 am also appear as interruptions to the plateaus.  Also visible are the slow weekdays sales of $3000 over 13 hours.  This can be compared to the weekend sales of $7500 over 9 hours.  This increased rate of sales of approximately 3.6x on a weekend is visible as well.

I'm not happy with my code at the moment so I'll take some time to refactor it and change some of the terminology to generalise some parts.  My main concern at the moment is to create good libraries, the test.py program is sort of a test bed for these.
https://github.com/GrantTrebbin/DiscreteEventSimulator/commit/eb24b7c44298c428170db3490f4d57e4062de7af
Get the code!

Tuesday, November 10, 2015

Monte Carlo Simulation and Discrete Event Simulation

I've been doing some more work on my discrete event simulator, or to be more accurate, the supporting functions that help to analyse the results.  In my previous post I described a basic stock management situation that can be modelled.  I'll continue to use that model here as it's simple to understand and helps illustrate the features I've added.

To bring you up to speed, the simulation models a 24 hour, 7 day a week business that gets deliveries at 1 am and sells items all day.  It's not realistic but it's a starting point.  The main problem with it is the discrete random nature of the events.  This can be seen in the week long simulation below from the last post.  It's possible to identify trends and events, but it offers no insight into how likely these events are to occur.

Simulation Graph
Original One Week Simulation

Getting data about the likelihood of events occurring can be accomplished by running the simulation for a longer time period and then comparing smaller sections of the simulation.  In my case I ran the simulation for 200 weeks, broke the data up into weeks, and then performed analysis on those multiple trials, just like a Monte Carlo simulation.  These results can be seen below.

The graph illustrates the exact same model, but instead of one trial, the model is essentially run 200 times.  This allows percentiles for the simulation to be generated.  In this case 0, 10, 50, 90, 100.  To put this into more understandable terms, in the graph below, 0% of the simulations predicted stock levels less than the dark blue line, 10% predicted levels less than the green line, 50% predicted levels less than the red line, 90% predicted levels less than the light blue line, and 100% of the simulations predict levels less than the purple line.

Simulation Graph
A monte carlo analysis of a discrete event simulation
This kind of analysis allows sensible measured conversations about stock levels to take place.  All to often predictions are made that don't allow nuance and can cause problems.  For example, instead of saying "I reckon we'll run out of stock by 1pm"  we can now say "There's a 40% chance we'll run out by 1 pm".  Everyone can take that on board and take appropriate action, it may be a 40% chance is acceptable as there may be some other more urgent matter to take care of.

I'm anthropomorphising the conversation, but in reality I see this as a way for computers to compare and prioritise tasks.

You may have noticed a problem with what I've described.  The original data from the discrete event simulator isn't a regularly sampled data set.  A customer may come in 10 minutes after the last one and then you won't see another for an hour.  To perform the Monte Carlo simulation the data needs to be resampled with a regular timebase.  In my case I've choose to sample the data every 5 minutes.

This is done with a kind of zero order hold.  The process to determine the value at a specific time is straight forward, if an event occurs at this time the new data point will be the value after the event.  If there is no event, the data point will be the value after the last event.  If there is no previous event the first one is used.

The image below illustrates the initial data set of (1, 1), (5, 2), (10, 3), (11, 4), (12, 5), (20 , 6).  These are marked with blue crosses.  The red line shows how this data is resampled for different points.  The black bars show the result of resampling the data at -10, 1, 2, 10 and 30.  The results are 1, 1, 1, 3, and 6 respectively.  I've chosen to resample the data at 5 random points, but if performed at regular intervals the information is ready to be fed to the Monte Carlo analyser.
Simulation Graph
Resampling a data set
To give a usage example I've folded the entire simulation into a period of one day.  That means there are 1400 days sampled every 5 minutes simulated and compared in the image below.  If I had concerns about the stock levels at 8:50 pm (sample 250), I could look at this and say that I'm very confident that there will be stock, if fact there should be about 30 to 50 items available.

Simulation Graph
The simulation folded into a single day
In the simulation below I've increased the sales rate so a different question can be asked.  If customers complain about out of stocks you can look at this and say that supplies will run out somewhere between 7:30 am and 3:50 pm (90 and 190), but most likely at 10:50 am (130).  This then allows you to take action, such as increasing the order size.

Simulation Graph
Increase rate of purchase indicates likely stock exhaustion points
It's all a bit simplistic at the moment, but it's getting there.  What I'm really aiming to show is the complex interaction of seemingly unrelated variables.
https://github.com/GrantTrebbin/DiscreteEventSimulator
Get the code
.

Saturday, June 20, 2015

An Analysis of the Triple M Brisbane Playlist

When listening to radio in the car I do exactly what I would at home, channel surf.  Generally I listen to Triple J but also switch between Triple M, 4ZZZ, and 612 ABC for something different.  For a couple of months there was a period when every time I switched over to Triple M it seemed that they were playing The Spin Doctors.  Now any normal person would pass that off as mere coincidence and leave it there, but if you've read this blog before you're probably aware that I'm not normal.  I started to wonder exactly what type of songs were played on the station, did they play certain music at different times of the day to cater to a different demographic?  More importantly how often do they actually play The Spin Doctors?  To answer this I needed to log what songs were played, unsurprisingly I chose a solution that required electronics to log the RDS messages the station broadcasts.  The method used was described in a previous post, Using a Raspberry Pi to Log Songs Played on a Radio Station.  They do list recently played songs in their on-line streaming player, but I couldn't figure out a way to easily log them.  Besides, electronics is more fun.

I've since discovered a similar analysis was done by Daniel Nitsche in 2014.

The logger was left to run for two weeks, generating a file with 273902 entries, on average logging what was being played every 4-5 seconds.  Writing software to turn that into something useful was in my opinion going to take longer to do than editing the file by hand as there were intelligent decision to make along the way.  There were small errors that needed correcting as well.  For example, when the daytime presenters played a snippet of the songs to be aired in the next hour, that was also broadcast on the RDS message system.  So it would appear they played the song and then played it again within the hour.  That had to be corrected.  I'm not saying this couldn't be automated, but for a one off, it's not worth it.

After editing, the data showed that there were 2805 songs played over the two week period, about 8 songs an hour.  That sounds about right.  There's probably the occasional error in my data set, but I believe it's accurate enough to show any trends.  There may also be some errors in the release year I've listed for the songs.  Initially I tried to use an on-line music database to get the release date for each song, but that was way to slow and error prone.  In the end I Googled each of the 868 songs to find out when they were released (doesn't take as long as you think, I averaged about 6 a minute).  Then comes the problem of when a song was actually released.  Some songs are released on the album but their release date is when the single comes out.  I choose when they were first publicly available.

So what can we actually learn from this data?  I started with the histogram below to see the distribution of songs played.  The oldest song played was "Sympathy for the Devil" by the Rolling Stones from 1968.  Then there's a  bit of a peak in 1971 before the bulk of songs from about 1980 to 2000.  1991 seems to be a massive year for music and unsurprisingly aligns closely with the active rock format the Triple M subscribes to.  Billboard has a go at explaining 1991.  The bulk of the songs played in 1991 are from the following classic albums.

U2 - Achtung Baby
Red Hot Chili Peppers - Blood Sugar Sex Magik
R.E.M. - Out Of Time
Pearl Jam - Ten
Nirvana - Nevermind
Metallica - Metallica
Guns N' Roses - Use You Illusion I
Guns N' Roses - Use You Illusion II

The next thing to notice is that there aren't many songs from the 2000's, but they do play a lot of recent music from the last 5 years.

histogram
Yearly Distribution of Songs Played on Triple M Brisbane
The next plot shows how the release years of songs are distributed over the two week period.  Although you can see the information from the histogram reflected here, there isn't much too much else to see here.
graph
Release Year of Music Played on Triple M Brisbane
The last graph was a bit boring, but by taking the data and showing it differently we can learn some interesting things in the one below.  By plotting all the weekday data on one daily graph, patterns begin to emerge.  For instance, between the hours of 5 and 9 am it appears that there is a blackout on pre 1980 music.  You can also see that songs are less frequently played between 6 and 9 am when the breakfast crew are doing their thing.

When the drive show comes on from 4 to 6 pm you can once again see the drop off in the amount of music played to make way for the presenter.  Between 6 and 7 pm there is a sports show that plays relatively little music as well.  Then between 7 and 9:30 pm the focus seems to be on 80's music, this then changes to 90's music until about midnight.  I assume they have data that says people who like 80's music are in bed by 9:30 pm.
graph
Weekday Daily Distribution of Songs Played on Triple M Brisbane
The data for the weekend shows similar patterns.  After lunch, the sports coverage starts and the music stops.  There's also a relatively dense block of new music between about 6:30 and 8:30 at night, this turns out to be a show on Sunday that plays a lot of new music.
graph
Weekend Daily Distribution of Songs Played on Triple M Brisbane
I also did up a Poincaré plot to see if any other patterns emerged, but the only thing obvious is the predominance of 1991.  If you've never seen one of these before it's a graph of data that plots data point n against (n+1).  They're very useful for finding patterns in binary files as well.
Poincaré plot
Poincaré plot - Release Year of Songs Played on Triple M Brisbane
So to wrap things up I looked at what the most popular artists were.


Artist Times Played
Red Hot Chili Peppers 76
INXS 69
Foo Fighters 66
U2 65
Pearl Jam 61
R E M 56
Midnight Oil 55
AC/DC 44
Guns N' Roses 40
Powderfinger 39


I also looked at the most popular songs too.


Song Artist Times Played
Georgia Vance Joy 26
Hold Back The River James Bray 19
Congregation Foo Fighters 17
Blame It On Me George Ezra 17
Every Breaking Wave U2 17
Crystals Of Monsters and Men 16
Someone New Hozier 15
Believe Mumford & Sons 15
Two Princes Spin Doctors 14


Ha!!!!!  Proof I'm not insane*.  Two Princes by Spin Doctors is the 9th most played song on Triple M.  Coincidentally released in 1991.

*This blog post probably doesn't help my case.  Hmmm :-/

You can find my original logs here.

https://drive.google.com/file/d/0B5Hb04O3hlQST2pKU0dHOTQ5SlE/view?usp=sharing


Tuesday, March 18, 2014

Probability of Collecting Multiple Full Sets

Well, it's that time again.  One of the large supermarket chains has released yet another set of collector cards.  This time there are 42 cards to collect compared to the 108 cards of the last campaign, and I thought it'd be interesting to explore some the statistics of cards collecting.  In my last post on this topic, Probability of Collecting a Full Set, I looked at the expected number of cards you need to collect on average before you have a full set of cards.  I suggest reading that article for some background on the problem before preceding any further.

All caught up?  Good.  The question I'm posing this time is how many cards you need to collect on average before you have multiple complete sets?  The inspiration for this question came from hearing about families with multiple children trying to collect a set of cards for each child.  Of course this once again assumes that the trading of cards to other people is disallowed.

Like my first article about this, I initially thought that I had a handle on the problem and the answer would be easy, and once again the maths kicked my butt.  I scraped through by the skin of my teeth last time, but not this time.  I originally thought that the probability of collecting two sets of cards could be found by finding the probability of collecting one set of cards, multiplied by the probability of collecting further sets from the left over cards.  It seems logical, but after running a few quick simulations it was obvious I was wrong.  The problem lies in the fact the the left over cards are not an even distribution.

So, where to now?  Google.  It took some time to find but I eventually found that the problem was called the "double dixie cup problem" and was only solved  by Newman and Shepp in 1960.  I found a good run-down of the problem at http://www.brynmawr.edu/math/people/anmyers/PAPERS/SIGEST_Coupons.pdf that gets me close to solving it analytically, but I just didn't have the time to fully absorb it.  So I listened to that little engineer buried deep inside and decided to cheat by using a Monte Carlo simulation.  Near enough is good enough.  I suppose it's not technically cheating but it does feel a little dirty.

The files associated with this post are located here.  They're a bit rough, but they were only used for working.

Octave was good enough for the job.  It was simple to write a script that repeatedly drew cards and counted how many full sets were drawn.  My simulation was for randomly drawing 1 to 1000 cards and seeing how many full sets of 42 cards were collected.  Each of these simulations was run 10000 times.  The results are shown below.
cumulative distribution function
The Probability of collecting at least n full sets of 42 cards

The familiar shape from the previous article can be seen here.  The noise from the Monte Carlo simulation can also be seen on the graphs.  The results match what I expect.  For example if you were to collect 200 cards, it would be highly unlikely you would have 3 sets (0%), there's a slim chance you'd have 2 sets (10%), but you'd almost certainly have one complete set (90%).  But what is the expected value for a certain number of sets?   This is where things get a little dodgy.  Because of the noise in the simulation it's possible to have negative probabilities in the probability mass function.  (yes I know that it should be a stem plot, but I find it easier to see what's going on this way)  The main point is that a negative probability doesn't make much sense.  The plots below should be a series of smooth humps.  If the Monte Carlo simulation was run for 10 times longer we would get close to approximating a smooth plot.  Even thought the data doesn't make sense it should still be able to be used to approximate the expected value.
The expected number of cards needed to collect 1,2,3,4,5 sets was calculated.

1 set  - 182 cards - 182 cards per set
2 sets - 264 cards - 132 cards per set
3 sets - 337 cards - 112 cards per set
4 sets - 404 cards - 101 cards per set
5 sets - 468 cards -  93 cards per set

As the number of sets to collect increases, I suspect that the number of cards per set would approach a limit of 42 cards per set.

As a sanity check we can analytically calculate the expected number of cards to collect for one set using the equation from the previous post. nH(n), where H represents the harmonic number of n.  42 H(42) = 181.7 this agrees with the result from the Monte Carlo simulation.

So what's the point of all this?  Although it may seem a daunting task to collect a set of cards (you need to collect 182 on average) when you pool resources with other people the number of cards you need to collect decreases.  This can be accomplished within a family or by gathering a group of people to trade with.

Tuesday, October 15, 2013

Probability of Collecting a Full Set

The frenzy surrounding the Aussie Animal cards promotion got me thinking.  How many randomly collected cards would you need to collect before you had the whole set?  More specifically, if you were randomly given n cards what's the probability of having the full set.  I'm going to use the 108 card Aussie Animal set as an example.

All the files associated with this post can be found here.

When I initially came up with the idea for this post I thought it would be easy.  Turns out I was wrong.  A couple of quick simulations showed my understanding of the problem wasn't right, but after a little research I was back on track.  It did take me 2 days though.

The problem is mathematically the same as a well known problem called The Coupon Collector's Problem.  In this context the problem can be restated in the following way:

"Suppose that there are k different coupons, equally likely, from which coupons are being collected with replacement.  What's the probability that after n sample trials the complete set of k coupons is collected."

The formula to calculate this is deceptively simple.

There are k^n different ways to select the coupons. The next step is to figure out how many of those result in a complete set being collected.  I'm still getting my head around this, but it comes down to something called Stirling numbers of the second kind that equal the number of ways to partition a set of n objects into k non-empty subsets.  This suits the problem.  The collected coupons need to partitioned into k sets and none of these sets can be empty, i.e. the coupon hasn't been collected.  This number is represented as S(n,k).  This number has to multiplied by k!, as there are this many ways to arrange the sets.

This is where things got hard, I tried GNU Octave to calculate the results but as there was no built in function I tried to roll my own. I used an explicit formula for S(n,k) but that required numbers like 1000! to be calculated.  A number this large is just too big for Octave to handle.  I came up with a way to do all the calculations using the log of the number, but there was a precision problem for low values of n. Although with more time I could have got it working, I scrapped it and moved to Maxima, a computer algebra system that uses arbitrary precision (should have started there).  A couple of lines of code later I had my result.  The data was imported into Libre Calc to graph the solution.

Back to the Aussie Animal cards.  I've chosen a couple of data points to illustrate the how hard it can be to collect the set.

n = 200,    P = 8.99E-11
n = 400,    P = 0.0620
n = 600,    P = 0.662
n = 800,    P = 0.938
n = 1000,    P = 0.990

After collecting 600 cards you still only have a 66.2% chance that you've collected a set.


The expected value of how many cards you need to collect before you have a set is given by the formula nH(n), where H(n) is the harmonic number of n.  In our case this is equivalent to 108*H(108) = 568.5  To double check this, the Probability Density Function of the above data was generated to manually calculate the Expected number.  I get 557.  Given it was done in a spreadsheet with low precision data only up to n=1000, I'm pretty happy they agree that much.
This shows that without trading cards with other people it can be difficult to collect an entire set.  It assumes that all cards are equally likely and independent.  This may not be the case but it's not unreasonable to assume it is.

I'm glad I tackled this problem, learning about Stirling numbers made it worthwhile.  I now have another mathematical tool under my belt.  Would've been nice to have heard about them in statistics class, but there's only so much you can cover.