tag:blogger.com,1999:blog-22713811.post2394640459970117908..comments2016-03-12T08:52:28.363+01:00Comments on Pappus' plane - cricket stats: Highest scoresDavid Barryhttp://www.blogger.com/profile/08378763233797445502noreply@blogger.comBlogger10125tag:blogger.com,1999:blog-22713811.post-57628966314403358992014-03-10T11:46:15.632+01:002014-03-10T11:46:15.632+01:00that's what I get for posting too late at nigh...that's what I get for posting too late at night.bernard kachoyannoreply@blogger.comtag:blogger.com,1999:blog-22713811.post-33021361774850712182014-03-06T07:34:49.240+01:002014-03-06T07:34:49.240+01:00Filling in some of the algebra that I skipped yest...Filling in some of the algebra that I skipped yesterday:<br /><br />My original post has<br /><br />HS = -avg * ln(1 - 0.5^(1/N)).<br /><br />Now, for N large, 1/N is small, and we can approximate 0.5^(1/N) by its Taylor series truncated after the linear term. I.e.,<br /><br />0.5^(1/N) = exp((1/N)*ln(1/2))<br />~ 1 + (1/N)ln(1/2)<br />= 1 - ln(2) / N.<br /><br />So,<br /><br />-avg*ln(1 - 0.5^(1/N)) ~ -avg*ln(ln(2)/N)<br />= avg*[ln(N) - ln(ln(2))],<br /><br />which is the median of the Gumbel distribution.David Barryhttps://www.blogger.com/profile/08378763233797445502noreply@blogger.comtag:blogger.com,1999:blog-22713811.post-5069025369695178672014-03-05T13:17:35.059+01:002014-03-05T13:17:35.059+01:00bernard kachoyan left an excellent comment on the ...bernard kachoyan left an excellent comment on the wrong post:<br /><br />"I think a more elegant way to look at this is to consider extreme value theory. The maximum of a set of N exponential random variables can be shown to approach a Gumbel distribution for N large. The expected value of that distribution is similarly given, for N large, by (G + Ln(N))/A where A is the average of the distribution and G is Euler’s constant (about 0.58). This gives a relationship between the expected maximum and the average of the individual distribution which doesn’t rely on your arbitrary factor of ½. Note that your expression for the maximum is approximately ((ln 2+ Ln(N))/Afor large N. Since ln(2)= 0.69 this is close to the expression above so perhaps there is a deeper result here that I haven’t noticed.<br />All this obviously begs the question of whether there is a correlation between the max/mean ratio and the number of innings that these formulation suggest, but that is the subject of another post."<br /><br />My response:<br /><br />This is really interesting - I didn't know anything about limiting distributions of maximum values and had never heard of the Gumbel distribution.<br /><br />(Clerical note: you should be multiplying by the mean of the exponential distribution, not dividing.)<br /><br />What I am effectively doing with my (not actually arbitrary!) 1/2 is asking for the median of the almost-Gumbel distribution, which here is avg*(-ln(ln(2)) + ln(N)). I haven't proved that my original expression is approximately equal to this for large N, but it is clearly true when I empirically plot one against the other (differences of "predicted" high scores are less than half a run). I guess you lost a log somewhere in your algebra!<br /><br />Anyway, the Gumbel distribution is skewed to the right, with the mean larger than the median, so your suggested method of using the mean of the Gumbel distribution results in higher "predicted" highest scores.<br /><br />median: avg*(-ln(ln(2)) + ln(N))<br />mean: avg*(gamma + ln(N)).<br /><br />I prefer using the median - I like having a number here that'll have around half of all batsmen below it and half above. (Actually only 45% of the batsmen in the dataset are above the predicted-by-Gumbel-median highest score; 38% are above the predicted-by-Gumbel-mean highest score.) But I can imagine people's tastes being different here.David Barryhttps://www.blogger.com/profile/08378763233797445502noreply@blogger.comtag:blogger.com,1999:blog-22713811.post-22993329938175879242014-03-03T02:02:41.115+01:002014-03-03T02:02:41.115+01:00To give a very weak defense of myself, I put quota...To give a very weak defense of myself, I put quotation marks around "underachiever" the first time I wrote it. I meant underachiever in the field of scoring large scores.<br /><br />Without having thought about it much, I think I'd be pretty agnostic about whether I prefer consistent/inconsistent batsmen. Gabe, is there a reason that you would prefer to have a consistent batsman? Does that somehow lead to more won/drawn games?Martin Lesliehttps://www.blogger.com/profile/10162654359569285507noreply@blogger.comtag:blogger.com,1999:blog-22713811.post-64721388073322726442014-03-03T01:21:32.290+01:002014-03-03T01:21:32.290+01:00Hi Gabe!
The under/over-achiever terminology was ...Hi Gabe!<br /><br />The under/over-achiever terminology was Martin's, not mine....<br /><br />I did wonder about using RPI - I think it was you who first pointed out to me that it's a better predictor of the next innings than the average? But I'm too lazy to change my ways when all it does is squeeze a couple of R-squared points out of everything. (I feel that there has to be a better predictor than RPI, though again it wouldn't make large changes....)<br /><br />On ducks and centuries: I've written on this topic before! e.g., <a href="http://pappubahry.blogspot.com.au/2008/03/partly-explaining-most-of-all-double.html" rel="nofollow">here</a>, <a href="http://pappubahry.blogspot.com.au/2010/12/are-some-batsmen-nervous-starters.html" rel="nofollow">here</a> (this one has the next-simplest hazard function that I would use to incorporate ducks properly), <a href="http://pappubahry.blogspot.com.au/2011/12/duck-to-century-ratios.html" rel="nofollow">here</a>.David Barryhttps://www.blogger.com/profile/08378763233797445502noreply@blogger.comtag:blogger.com,1999:blog-22713811.post-49882184907709551692014-03-03T01:00:59.092+01:002014-03-03T01:00:59.092+01:00Nice to see you back, however fleetingly, David.
...Nice to see you back, however fleetingly, David.<br /><br />I did a practically identical exercise to this a few years ago, but never got around to writing it up. There was a companion piece about how you just need to know how many innings a bowler has bowled in and how many wickets he's taken overall to estimate, e.g., his 5WI count using the Poisson distribution. That never saw the light of day, either.<br /><br />A few notes:<br /><br />(1) A simple extension is to calculate the expected number of, e.g., 100s in a career of a given length and a given mean, which is<br /><br />exp(-1/avg*100)*N<br /><br />So, e.g., someone with Tendulkar's average should have scored 51.25 100s (he actually got 51). I get r2 = 0.946 for your 50+ dismissals dataset for this.<br /><br />(1a) As an aside, you could, of course, try to predict the number of ducks you should expect using this method, but that draws attention to the inadequacy of the constant hazard assumption and consequent exponential approximation when it comes to the beginning of batsmen's innings (Tendulkar should have 6 ducks; he has 14; r2 = 0.616 across the dataset).<br /><br />(2) I'm not convinced that using the batsman's average as the mean of your distribution does you any favours when you then compare your model with reality. That's because the average, by accounting for not-out innings, effectively estimates a batsman's expectation of runs in a world in which not-outs don't exist. So you end up modelling that world and, when you compare it to the actual record, the not-outs muck things up ever so slightly. I can think of some quite involved ways of getting around this problem, but that would take away the fun of the really simple model providing an impressively good approximation of reality. So a quick fix would be to use RPI rather than avg as your mean. If you do that, you get a slightly improved fit (r2 = 0.753), though you move from systematically slightly overestimating the HS to systematically slightly underestimating it.<br /><br />(3) Pedantically, I think I take issue with the suggestion that batsmen who have a lower HS than would be expected from their average are underachievers; a narrower range of achievement for a given average suggests a slightly more consistent career and, since innings-to-innings consistency is weakly a good thing, I think I say the lower the HS the better.Gabehttp://www.deepbs.comnoreply@blogger.comtag:blogger.com,1999:blog-22713811.post-82571649238469081482014-03-02T17:46:33.110+01:002014-03-02T17:46:33.110+01:00I think Bob Willis's absurdly high p-value is ...I think Bob Willis's absurdly high p-value is entirely because of being stranded!<br /><br />If you look at his innings list on statsguru I think he has a high score of 56 runs between dismissals (for all-time records of this type see http://cricketarchive.com/Archive/Articles/1/1610.html although that's not very up to date) in 73 completed innings.<br /><br />1-(1-exp(-56/11.5))^73 = 0.43, so his "high score" is about right.<br /><br />I'm not willing to go through any more innings lists, so I think I'll leave this investigation here.Martin Lesliehttps://www.blogger.com/profile/10162654359569285507noreply@blogger.comtag:blogger.com,1999:blog-22713811.post-23808155259338895032014-03-02T04:11:15.271+01:002014-03-02T04:11:15.271+01:00I'm not sure it measures quite what you're...I'm not sure it measures quite what you're after, but an idea would be to just calculate a sort of p-value: What is the probability of having a highest score of at least the batsman's actual highest score? Which is 1 - (1 - exp(-HS/avg))^N.<br /><br />Wasim Akram comes out on top at 0.0017, followed by Gillespie 0.0020. At the other end, Bob Willis has an absurdly high p-value of 0.999992, and Corey Collymore 0.9993. Probably they're hurt a bit by batting at number eleven and being left stranded so often.<br /><br />(Everyone else is pretty sensible, though the numbers aren't a perfect fit: about 10% of the dataset are at p > 0.95).David Barryhttps://www.blogger.com/profile/08378763233797445502noreply@blogger.comtag:blogger.com,1999:blog-22713811.post-85244977483279974452014-03-01T16:32:30.134+01:002014-03-01T16:32:30.134+01:00I suppose I really should normalise the difference...I suppose I really should normalise the difference. Do you have any insight into what the correct thing to divide by is here so that we are finding the batsman who underachieves in high score for their skill level, not absolute underachievement?<br /><br />If I calculate (estHS-HS)/HS the new list of underachievers is Willis, Collymore, Kasprowicz, Chatfield, Trueman, Giles, Oldfield, Kelly, Mackay, Prior. Also M Waugh is a bigger underachiever than S Waugh by this measure.Martin Lesliehttps://www.blogger.com/profile/10162654359569285507noreply@blogger.comtag:blogger.com,1999:blog-22713811.post-13465282593514532262014-03-01T16:22:42.815+01:002014-03-01T16:22:42.815+01:00As you say, it's kind of silly to predict an a...As you say, it's kind of silly to predict an average based on a high score, but predicting a high score based on average could have some use to it.<br /><br />Sorting the table by estimated high score minus actual high score we see the biggest "underachievers" are Bradman, Kallis, Chanderpaul, S Waugh, Sutcliffe, Border, M Waugh, Tendulkar, Armanath, Prior.<br />The fact that Bradman is on top shows a flaw of the exponential model: there are other batsman who get out and time constraints that make batting for long enough to score 475 difficult. Some of the other batsman on the list have excuses too: they are all-rounders or bat at number 5 etc. It does amuse me that S Waugh is seen as not going on with it from this data, whereas at the time it was M Waugh that was perceived as having this problem (maybe that was before Mark scored his 150?).<br /><br />At the other end, the biggest overachiever? Wasim Akram.<br /><br />I guess it would be better to compare the actual distributions of scores to the theoretical one based on average so you're not just using one datapoint to label a player as not-going-on-with-it but I think this is interesting anyway.Martin Lesliehttps://www.blogger.com/profile/10162654359569285507noreply@blogger.com