Dot Star Money: Toughest TopCoder Problems #1: CollectingBonuses

Time for a new series! I am very cynical of "competitive programming websites," that is to say I think they hardly showcase talent and instead showcase whose lucky enough to remember what specific bit of knowledge is required per-problem. What I'd like to do then, is take possibly the most well known competitive programming website (TopCoder), take a look at their "hardest problems," and deconstruct/solve them to both prove my point, and uncover the knowledge required behind them. I am in no way involved in the TopCoder community nor do I care to be, so everything you see here is original work.

To find what TopCoder would indicate are the worst of the worst, I've gone to their problem archive and sorted the problems by the lowest solve percentage. There are about two pages of 0% (so nobody got 'em), and then we start to see a trickle in of problems one to a handful of people were able to solve. Keep in mind, "solving" one of these problems means you were able to get it from solution to software in at most 75m.

Lets start with the very first problem on this list that had a successful submission...

CollectingBonuses

The Problem

I take it most are capable of reading the problem statement, but to summarize an equivalent form of the problem that is less verbose:

There is a set of \(n\) unique numbers. Let the following sampling process be defined: Choose a number from the set with each number's selection equally likely. Compute the expected value \(P\) of samplings such that from the set of all numbers sampled, there are \(k\) distinct values.

Like all competitive programming problems, this one is deceptively simple. At first glance it appears you could just cheese it and run a Monte-Carlo simulation, yet they stipulate the expected value be computed accurately to within a precision of \(1.0\times10^{-9}\), which would require more computation time than TopCoder allows. Okay so fair enough, we'll need to find a way to compute this value directly.

A Formula for the Expected Value

Anyone with a basic understanding of probability shouldn't have too much difficulty recovering a formula for the expected value. You might try solving this problem by thinking about how the expected value of samplings changes as we sample towards \(k\) unique sampled numbers. When \(k=1\), the expected value \(P\) is \(1\) as our set of sampled numbers has exactly \(k=1\) unique members: we certainly don't need to sample more than once to get \(1\) unique number.

What about when \(k=2\)? Well sample once to get our first unique number, so \(P=1\) so far. Now to determine how many samplings we expect to make to get a second unique number, we have to think about the probability of selecting a number that isn't the one we already selected. If we have \(n\) different numbers to choose from, then the probability of picking one we haven't already chosen is: \[\frac{n-1}{n}\]This can be inverted to give us an expected value of samples for picking a 2nd unique number so: \[\frac{n}{n-1}\]Therefore, \[P=1+\frac{n}{n-1}\]When \(k=2\).

When \(k=3\) we can follow very similar logic. We can just use the expected value from when \(k=2\) and add to it the expected value of samples for picking a 3rd unique number. Similar to last time, when a 3rd number is sampled, the probability of picking one that isn't one of the numbers we've already selected is: \[\frac{n-2}{n}\]Again then the expected value of samples we must take is: \[\frac{n}{n-2}\]This makes our expected value of samples for \(k=3\): \[P=1+\frac{n}{n-1}+\frac{n}{n-2}\]If a pattern wasn't immediately obvious before, its clear that we are encountering a sum of some kind.

We can consider the last term of our latest formula for \(P\) as the beginning of a summation, and attempt to generalize the formula above using sigma notation: \[P=n \sum_{i=n-k+1}^{n} \frac{1}{i}\]

Computing Large Values of \(P\)

Now anyone who is familiar with competitive programming knows that the easy bit above is far from the solution to the problem. The problem further requires that our function compute expected values for \(n\) and \(k\) as large as \(1.0\times10^{18}\). Obviously we can't just grind out the sum stated above as we firstly, well... we'd be there for a while... Secondly, if we were somehow able to add that many terms of the sum, the numerical inaccuracies would build up pretty severely by the time we got to \(1.0\times10^{18}\). So what do we do?

As it turns out, the previously mentioned sum is known as a Harmonic series. This mathematical form has been around for ages, and there is a well known relationship between a specific form of Harmonic series and what is known as the digamma function:\[\sum_{i=0}^{\infty}\frac{1}{i+x} =-\psi^{0}(x)\]Our strategy here will be to leverage some of the ways we can represent \(\psi^{0}(x)\) to compute values for our Harmonic series for \(P\). Let us first represent the series for \(P\) in terms of \(\psi^{0}(x)\). (For now, we won't greatly delve into the digamma function, just know that its relationship with Harmonic series lends itself to solving this problem.)

Our sum for the digamma function was infinite, yet we seek a sum that is finite. We can deal with this by representing the sum for \(P\) in two parts, one part that contains both our finite sum and the unwanted terms that stretch out to infinity, and then a second part with just the unwanted infinite terms. By subtracting the second part from the first, we can recover the finite sum for \(P\). In other words, by subtracting two infinite sums, we recover a finite sum. Lets start with the first one by pretending that our sum for \(P\) has upper limit \(\infty\) instead of \(k\). Since our sum counts up from \(n-k+1\), we can represent this variation of our sum for \(P\) as:\[\sum_{i=0}^{\infty} \frac{1}{i+(n-k+1)}=-\psi^{0}(n-k+1)\]Now we aren't interested in the terms that go past \(i=n\), so for our second sum which we will subtract from the first, we will use:\[\sum_{i=0}^{\infty} \frac{1}{i+(n+1)}=-\psi^{0}(n+1)\]Therefore, we can combine the two sums in terms of digamma and aesthetically rearrange to recover:\[P=n(\psi^{0}(n+1)-\psi^{0}(n-k+1))\]This leaves only the difficulty of computing values for digamma, which conveniently bears an approximation perfect for our application.

Computing Values for Digamma

Now it is entirely permissible for us to use summation formula for \(P\) when \(k\) is sufficiently small, but when \(k\) is large, as previously stated, we'll need to use the digamma formula for \(P\). The takeaway here is that we need only use the digamma formation when \(k\) is large, which opens for the possibility to replace \(\psi^{0}(x)\) with an asymptotic approximation assuming we have one and its a very good approximation. I imagine you see where this is going? As it turns out, we have one (sort of)!

Let us first examine the exponential of the expression inside the parenthesis in the digamma formation of \(P\):\[{\Large\mathrm{e}^{\psi{0}(n+1)-\psi{0}(n-k+1)}=\mathrm{e}^{\psi{0}(n+1)}\mathrm{e}^{-\psi{0}(n-k+1)}}\]This expression is useful considering there happens to be a very good approximation for \(\Large\mathrm{e}^{\psi{0}(x)}\), namely:\[{\Large\mathrm{e}^{\psi{0}(x)}}\approx x-\frac{1}{2}, \mbox{if }x>1\]Using this approximation, we can write our previously mentioned exponential:\[{\Large\mathrm{e}^{\psi{0}(n+1)}\mathrm{e}^{-\psi{0}(n-k+1)}}=(n+\frac{1}{2})(n-k+\frac{1}{2})\]Woah, now how is that for an improvement?

Of course, the exponential of the inner parenthesis in the formula for \(P\) isn't very useful, so we'll remove the exponentiation by taking the natural logarithm of the exponential form and bring \(n\) back in to recover an approximation for \(P\):\[P\approx\ln\left((n+\frac{1}{2})(n-k+\frac{1}{2})\right) n\]

Conclusion

Finding the expected value is now simply a matter of checking the size of \(k\), and at some point such that the approximation provides enough digits of precision (which should be well before the number of terms gets out of control), switch to the approximation of the series to compute \(P\) instead of using the series directly. I never got around to coding this solution as I solved the problem in about 40 minutes at 4:00am-ish, but if you can't already tell, the code should be a piece of ice cream cake preferably because that's my favorite kind of cake.

Monday, January 11, 2016

Toughest TopCoder Problems #1: CollectingBonuses