Saturday, April 05, 2008
Slumps - Is there a problem or is he just unlucky?
I've had a bit of a think about "form slumps", and how we might go about seeing if they're just due to chance (sometimes a batsmen will be dismissed for a string of low scores) or due to some genuine problem (a technique flaw, or a weakness found by opposition bowlers). This post becomes more technical than usual later on, so feel free to fall asleep after the first table. This may not be the best or fastest way of going about this problem, but it's they way I did go about it — sort of like stream of consciousness statistics. The important thing is that it works, and it doesn't require anything more than Excel.
The first thing we need to know is the distribution of individual innings scores. Actually we don't need that — we just need to know what the standard deviation is, relative to the mean. So, for each batsman with 50 Test innings or more and an average of at least 35, I calculated the coefficient of variation (the standard deviation divided by the mean). This is a measure of the consistency of a batsman. My own opinion is that consistency is over-rated — a batsman who goes 0, 100, 0, 100, etc. is just as useful as a batsman who goes 50, 50, 50, etc. But it is still interesting to see which batsmen are more consistent than others, so here's the top and bottom of the table. A higher co-efficient of variation means less consistency.
(Technical note: for not-outs, I just added the batsman's average and considered it a regular innings. Not the best thing to do, but it's close enough, and we'll be doing far worse later on.)
There is only a very slight trend (R-squared = 0,022) showing higher averages associated with lower co-efficients of variation.
The average co-efficient of variation is about 1,05. From now on we'll ignore individual differences between batsmen and just assume that the scores of each batsmen can be treated as random variables, coming from a distribution with mean µ (i.e., his average) and standard deviation 1,05µ.
No, we're interested in slumps. So instead of considering individual innings, we'll be considering groups of innings. We don't know what the distribution of individual innings is exactly, but the distribution of groups of innings will be approximately normal, by the central limit theorem. In particular, the mean of a group of n innings will be approximately normally distributed with mean µ (the same as the career average) and standard deviation 1,05µ/sqrt(n).
Note that while I said that µ was the career average, in order for things to work out properly, we'll actually use the career average apart from the slump in question.
So, to define how bad a slump of n innings is, we'll calculate a z-score. Let µc be the career average (excluding the slump) and µs be the average during the slump. Then define z = (µs - µc) * sqrt(n) / (1,05 * µc).
For example, suppose a batsman was averaging 45. Then he had a slump where for 25 innings he averaged 30. The z-score here would be (30 - 45) * sqrt(25) / (1,05 * 45) = -1,59. I'll call this a z = -1,59 slump.
How rare is this sort of slump? We can look up the answer in either a cumulative normal distribution table or use Excel (or something fancier). In Excel, the relevant function in French is LOI.NORMALE.STANDARD(z). For that small minority of you who use computers set to English, I think the function is called NORMSDIST.
Anyway, plug in -1,59 and you get 0,056. So, that tells us that the probability that he'd have a slump that bad (or worse) right now is about 5,6%. Does the theory match reality? For each batsman, I considered blocks of 26 innings (26 and not 25 for reasons that were important to me when I was testing this but don't make any difference now) that didn't overlap. It's important that they don't overlap, because otherwise the probabilities will be dependent. Add up the number of slumps worse than a particular z, divide by the total number of blocks sampled, and you get a probability. Here are the results, for varying z's, with the observed probability and that derived from the normal distribution.
That's not too bad. The observed probabilities are low in the tail, but we'd be talking about really really bad slumps out there, so that's not too important. For the z=-1,59 slump in the example above (something reasonably typical), we see that it matches pretty closely. From now on, we'll just use the normal distribution instead of the empirically derived probabilities.
So the theory works well enough. But we can't stop here. The probability of a z = -1,59 slump right now is about 5,6%. But that's not what we're interested in. We'd like to know the probability that at some point during a batsman's career, he'll have a z = -1,59 slump. This is a very different thing entirely! The probability of a batsman making a duck is about 6,5%, but you wouldn't call it a slump if he's just made a duck, because ducks are just going to happen sometimes.
So how do we find the probability, given a career of M innings, that there'll be a z = -1,59 slump in there somewhere. To answer this (perhaps not in the best way), let's review some basic probability.
Imagine you roll a die five times. What's the probability that you get at least one 6? You can't say: "The probability of a 6 on any roll is 1/6, so the probability of getting a 6 after five rolls is 5/6." That would be the expectation of how many 6's you get. What you need to do is find the probability that you won't get a 6 on each roll — 5/6 — raise that to the fifth power (probability that you won't get a 6 in five rolls), then subtract it from 1. The answer is about 0,598.
So, suppose a career is 80 innings long. A slump of 20 innings could start at innings 1, innings 2, ..., innings 61. So, do we take our probability of 0,056 from above, subtract from 1, raise to the power of 61, subtract from 1? No! Because when you take blocks of innings that overlap with each other, you're looking at dependent events. Two rolls of the die won't affect each other. But the average of innings 2 to 21 will be highly dependent on the average from innings 1 to 20 — after all, 19 of the innings are in both blocks.
So we can't just raise that probability to the 61st power. So what can we do? I don't know what the best way is, but I decided to numerically work out what power you should raise (1 - 0,056) to.
So, summary of the procedure so far:
1. Find z.
2. Get associated probability pnow = LOI.NORMALE.STANDARD(z).
3. Raise (1 - pnow) to some power x, to be determined numerically.
4. Find 1 - (1 - pnow)x. This is the probability of having a z-slump at some point during the career.
But, when I was working through this, I accidentally cut out the last step. Happily, I get much better fits this way (I did try doing following the above procedure afterwards), but the exponent that you get doesn't have the same nice interpretation as the one above.
So, actual procedure:
1. Find z.
2. Get associated probability pnow = LOI.NORMALE.STANDARD(z).
3. Raise (1 - pnow) to some power x, to be determined numerically. This is the probability of having a z-slump at some point during the career.
So, time to work out x. The first thing we'll see is that the ratio of the career length to slump length is important — for a given ratio, it doesn't matter so much what the length of the slump is. So x will be the same for a 40-innings slump in a 120-innings career as for a 20-innings slump in a 60-innings career.
If N is the length of the career and n the length of the slump, then the ratio I actually worked with (just by 'historical accident) was (N + 1 - n) / N.
Let's plot x against z for 20-innings slumps in 79-innings careers:
It's a nice exponential fit. Repeating the procedure for other block sizes (always with length ratio of 3), you get the following table (fit parameters here are the A and k in x = Ae-kz):
The co-efficients can be sensitive to how much of the tail you let in, but basically they don't change too much. You might be thinking, "Hey! There's almost a factor of 2 difference there!" We'll ignore that and see what happens later.
Now we'll hold the slump length constant (at 26) and vary the ratio. Resulting graph of the fit parameter A:
It's another exponential decay.
Resulting graph for fit parameter k:
It's linear.
You can see that I've gone up to a ratio of 5. Much past that and I start running into lack of data problems, though I could probably have kept going with more thought and patience.
So, now we have all we need. Full procedure for finding the probability that a batsmen will have a particular z-slump of length n during his career of length N:
1. Calculate his average over the slump µs, and his career average excluding the slump µc.
2. Calculate z = (µs - µc) * sqrt(n) / (1,05 * µc).
3. Find pnow = LOI.NORMALE.STANDARD(z).
4. Find q = (N + 1 - n) / n.
5. Find A = 0,25 * e-1,4*q.
6. Find k = -(0,62 * q + 2,96).
7. Find x = A * ek * z.
8. Find p = (1 - pnow)x.
So does it work? Let's have a look at observed probabilites and the p as calculated above for a 20-innings slump in a 79-innings career:
Wrong way out in the tail, but pretty good for z greater than -2 or so.
What about a 15-innings slump in a 134-innings career? That's outside the regions we used to derive the fit parameters.
Pretty good.
Now, if I were a selector who never actually watched the players and only looked at the scores they made, how would I used this in practice? Well, if a particular batsmen was in a slump, I'd want to know how likely it is that that sort of slump would happen in a career as long as his. If it's more than 50%, then I'd let him keep going. If it's less than 50%, I'd drop him.
A practical example: Andrew Strauss. Before his much-publicised recent slump, he averaged 46,39. During the slump, up to the second Test against New Zealand, he averaged 28,10. The slump lasted 29 innings; his career up to then had lasted 83 innings.
So:
1. µs = 28,10; µc = 46,39.
2. z = (28,10 - 46,39) * sqrt(29) / (1,05 * 46,39) = -2,02.
3. pnow = LOI.NORMALE.STANDARD(-2,02) = 0,0217.
4. q = (83 + 1 - 29) / 29 = 1,90.
5. A = 0,25 * e-1,4 * 1,90 = 0,0176.
6. k = -(0,62 * 1,90 + 2,96) * (-2,02) = 8,359.
7. x = A * ek = 75,1.
8. p = (1 - 0,0217)75,1 = 0,192.
So slumps as bad as Strauss's should only happen to about one player in five, in a career as long as his. So based on numbers alone, I would have dropped him. Of course, he hit 177 in his next Test.
Now I think that this has been a useful exercise, but I'm not sure how much use it has in practice. You don't pick cricket teams based purely on statistics — you have to watch the players as well. If (say) a batsman is regularly getting out LBW early in his innings, you don't want to let it keep happening until p becomes 0,5 before dropping him. You want to get in early, and either drop him or work on his technique.
The first thing we need to know is the distribution of individual innings scores. Actually we don't need that — we just need to know what the standard deviation is, relative to the mean. So, for each batsman with 50 Test innings or more and an average of at least 35, I calculated the coefficient of variation (the standard deviation divided by the mean). This is a measure of the consistency of a batsman. My own opinion is that consistency is over-rated — a batsman who goes 0, 100, 0, 100, etc. is just as useful as a batsman who goes 50, 50, 50, etc. But it is still interesting to see which batsmen are more consistent than others, so here's the top and bottom of the table. A higher co-efficient of variation means less consistency.
(Technical note: for not-outs, I just added the batsman's average and considered it a regular innings. Not the best thing to do, but it's close enough, and we'll be doing far worse later on.)
name inns avg sd cv
MH Richardson 65 44,77 35,72 0,80
H Sutcliffe 84 60,73 48,52 0,80
JB Hobbs 102 56,95 48,53 0,85
AN Cook 51 43,47 37,83 0,87
PE Richardson 56 37,47 32,91 0,88
A Ranatunga 155 35,70 31,40 0,88
IR Redpath 120 43,46 38,62 0,89
NC O'Neill 69 45,56 40,72 0,89
KF Barrington 131 58,67 52,50 0,90
JB Stollmeyer 56 42,33 37,98 0,90
---
Ijaz Ahmed 92 37,67 46,07 1,22
DN Sardesai 55 39,24 48,54 1,24
V Sehwag 90 53,76 66,76 1,24
VT Trumper 89 39,05 48,57 1,24
Zaheer Abbas 124 44,80 56,86 1,27
Hanif Mohammad 97 43,99 55,81 1,27
DL Amiss 88 46,31 60,59 1,31
JA Rudolph 63 36,21 47,99 1,33
W Jaffer 54 35,68 48,09 1,35
MS Atapattu 156 39,02 52,81 1,35
There is only a very slight trend (R-squared = 0,022) showing higher averages associated with lower co-efficients of variation.
The average co-efficient of variation is about 1,05. From now on we'll ignore individual differences between batsmen and just assume that the scores of each batsmen can be treated as random variables, coming from a distribution with mean µ (i.e., his average) and standard deviation 1,05µ.
No, we're interested in slumps. So instead of considering individual innings, we'll be considering groups of innings. We don't know what the distribution of individual innings is exactly, but the distribution of groups of innings will be approximately normal, by the central limit theorem. In particular, the mean of a group of n innings will be approximately normally distributed with mean µ (the same as the career average) and standard deviation 1,05µ/sqrt(n).
Note that while I said that µ was the career average, in order for things to work out properly, we'll actually use the career average apart from the slump in question.
So, to define how bad a slump of n innings is, we'll calculate a z-score. Let µc be the career average (excluding the slump) and µs be the average during the slump. Then define z = (µs - µc) * sqrt(n) / (1,05 * µc).
For example, suppose a batsman was averaging 45. Then he had a slump where for 25 innings he averaged 30. The z-score here would be (30 - 45) * sqrt(25) / (1,05 * 45) = -1,59. I'll call this a z = -1,59 slump.
How rare is this sort of slump? We can look up the answer in either a cumulative normal distribution table or use Excel (or something fancier). In Excel, the relevant function in French is LOI.NORMALE.STANDARD(z). For that small minority of you who use computers set to English, I think the function is called NORMSDIST.
Anyway, plug in -1,59 and you get 0,056. So, that tells us that the probability that he'd have a slump that bad (or worse) right now is about 5,6%. Does the theory match reality? For each batsman, I considered blocks of 26 innings (26 and not 25 for reasons that were important to me when I was testing this but don't make any difference now) that didn't overlap. It's important that they don't overlap, because otherwise the probabilities will be dependent. Add up the number of slumps worse than a particular z, divide by the total number of blocks sampled, and you get a probability. Here are the results, for varying z's, with the observed probability and that derived from the normal distribution.
z obs loi normale
-3,0 0 0,001
-2,9 0 0,002
-2,8 0,000 0,003
-2,7 0,001 0,003
-2,6 0,001 0,005
-2,5 0,002 0,006
-2,4 0,004 0,008
-2,3 0,006 0,011
-2,2 0,008 0,014
-2,1 0,012 0,018
-2,0 0,018 0,023
-1,9 0,023 0,029
-1,8 0,032 0,036
-1,7 0,041 0,045
-1,6 0,053 0,055
-1,5 0,066 0,067
-1,4 0,080 0,081
-1,3 0,095 0,097
-1,2 0,111 0,115
-1,1 0,134 0,136
That's not too bad. The observed probabilities are low in the tail, but we'd be talking about really really bad slumps out there, so that's not too important. For the z=-1,59 slump in the example above (something reasonably typical), we see that it matches pretty closely. From now on, we'll just use the normal distribution instead of the empirically derived probabilities.
So the theory works well enough. But we can't stop here. The probability of a z = -1,59 slump right now is about 5,6%. But that's not what we're interested in. We'd like to know the probability that at some point during a batsman's career, he'll have a z = -1,59 slump. This is a very different thing entirely! The probability of a batsman making a duck is about 6,5%, but you wouldn't call it a slump if he's just made a duck, because ducks are just going to happen sometimes.
So how do we find the probability, given a career of M innings, that there'll be a z = -1,59 slump in there somewhere. To answer this (perhaps not in the best way), let's review some basic probability.
Imagine you roll a die five times. What's the probability that you get at least one 6? You can't say: "The probability of a 6 on any roll is 1/6, so the probability of getting a 6 after five rolls is 5/6." That would be the expectation of how many 6's you get. What you need to do is find the probability that you won't get a 6 on each roll — 5/6 — raise that to the fifth power (probability that you won't get a 6 in five rolls), then subtract it from 1. The answer is about 0,598.
So, suppose a career is 80 innings long. A slump of 20 innings could start at innings 1, innings 2, ..., innings 61. So, do we take our probability of 0,056 from above, subtract from 1, raise to the power of 61, subtract from 1? No! Because when you take blocks of innings that overlap with each other, you're looking at dependent events. Two rolls of the die won't affect each other. But the average of innings 2 to 21 will be highly dependent on the average from innings 1 to 20 — after all, 19 of the innings are in both blocks.
So we can't just raise that probability to the 61st power. So what can we do? I don't know what the best way is, but I decided to numerically work out what power you should raise (1 - 0,056) to.
So, summary of the procedure so far:
1. Find z.
2. Get associated probability pnow = LOI.NORMALE.STANDARD(z).
3. Raise (1 - pnow) to some power x, to be determined numerically.
4. Find 1 - (1 - pnow)x. This is the probability of having a z-slump at some point during the career.
But, when I was working through this, I accidentally cut out the last step. Happily, I get much better fits this way (I did try doing following the above procedure afterwards), but the exponent that you get doesn't have the same nice interpretation as the one above.
So, actual procedure:
1. Find z.
2. Get associated probability pnow = LOI.NORMALE.STANDARD(z).
3. Raise (1 - pnow) to some power x, to be determined numerically. This is the probability of having a z-slump at some point during the career.
So, time to work out x. The first thing we'll see is that the ratio of the career length to slump length is important — for a given ratio, it doesn't matter so much what the length of the slump is. So x will be the same for a 40-innings slump in a 120-innings career as for a 20-innings slump in a 60-innings career.
If N is the length of the career and n the length of the slump, then the ratio I actually worked with (just by 'historical accident) was (N + 1 - n) / N.
Let's plot x against z for 20-innings slumps in 79-innings careers:
It's a nice exponential fit. Repeating the procedure for other block sizes (always with length ratio of 3), you get the following table (fit parameters here are the A and k in x = Ae-kz):
slump length A k
20 0,0026 -4,97
26 0,0018 -5,05
30 0,0018 -5,08
40 0,0033 -4,90
The co-efficients can be sensitive to how much of the tail you let in, but basically they don't change too much. You might be thinking, "Hey! There's almost a factor of 2 difference there!" We'll ignore that and see what happens later.
Now we'll hold the slump length constant (at 26) and vary the ratio. Resulting graph of the fit parameter A:
It's another exponential decay.
Resulting graph for fit parameter k:
It's linear.
You can see that I've gone up to a ratio of 5. Much past that and I start running into lack of data problems, though I could probably have kept going with more thought and patience.
So, now we have all we need. Full procedure for finding the probability that a batsmen will have a particular z-slump of length n during his career of length N:
1. Calculate his average over the slump µs, and his career average excluding the slump µc.
2. Calculate z = (µs - µc) * sqrt(n) / (1,05 * µc).
3. Find pnow = LOI.NORMALE.STANDARD(z).
4. Find q = (N + 1 - n) / n.
5. Find A = 0,25 * e-1,4*q.
6. Find k = -(0,62 * q + 2,96).
7. Find x = A * ek * z.
8. Find p = (1 - pnow)x.
So does it work? Let's have a look at observed probabilites and the p as calculated above for a 20-innings slump in a 79-innings career:
z obs p
-3,0 0 0,000
-2,9 0 0,000
-2,8 0 0,001
-2,7 0,008 0,003
-2,6 0,008 0,008
-2,5 0,039 0,018
-2,4 0,086 0,038
-2,3 0,117 0,071
-2,2 0,148 0,120
-2,1 0,211 0,186
-2,0 0,289 0,265
-1,9 0,336 0,354
-1,8 0,398 0,447
-1,7 0,531 0,538
-1,6 0,602 0,623
-1,5 0,695 0,699
-1,4 0,734 0,764
-1,3 0,813 0,818
-1,2 0,859 0,861
-1,1 0,969 0,896
Wrong way out in the tail, but pretty good for z greater than -2 or so.
What about a 15-innings slump in a 134-innings career? That's outside the regions we used to derive the fit parameters.
-2,1 0,390 0,353
-2,0 0,542 0,548
-1,9 0,678 0,708
-1,8 0,780 0,821
-1,7 0,814 0,895
-1,6 0,848 0,940
-1,5 0,864 0,966
-1,4 0,966 0,981
-1,3 0,983 0,990
-1,2 0,983 0,994
Pretty good.
Now, if I were a selector who never actually watched the players and only looked at the scores they made, how would I used this in practice? Well, if a particular batsmen was in a slump, I'd want to know how likely it is that that sort of slump would happen in a career as long as his. If it's more than 50%, then I'd let him keep going. If it's less than 50%, I'd drop him.
A practical example: Andrew Strauss. Before his much-publicised recent slump, he averaged 46,39. During the slump, up to the second Test against New Zealand, he averaged 28,10. The slump lasted 29 innings; his career up to then had lasted 83 innings.
So:
1. µs = 28,10; µc = 46,39.
2. z = (28,10 - 46,39) * sqrt(29) / (1,05 * 46,39) = -2,02.
3. pnow = LOI.NORMALE.STANDARD(-2,02) = 0,0217.
4. q = (83 + 1 - 29) / 29 = 1,90.
5. A = 0,25 * e-1,4 * 1,90 = 0,0176.
6. k = -(0,62 * 1,90 + 2,96) * (-2,02) = 8,359.
7. x = A * ek = 75,1.
8. p = (1 - 0,0217)75,1 = 0,192.
So slumps as bad as Strauss's should only happen to about one player in five, in a career as long as his. So based on numbers alone, I would have dropped him. Of course, he hit 177 in his next Test.
Now I think that this has been a useful exercise, but I'm not sure how much use it has in practice. You don't pick cricket teams based purely on statistics — you have to watch the players as well. If (say) a batsman is regularly getting out LBW early in his innings, you don't want to let it keep happening until p becomes 0,5 before dropping him. You want to get in early, and either drop him or work on his technique.
Comments:
<< Home
Wow, David! This seems like a hell of a post! I shall surely read it carefully when i can since it might take some time to sink in for people who are new to your blog!
(And guys, if you are new, trust me - this is a great blog if you understand how to use it best!)
(And guys, if you are new, trust me - this is a great blog if you understand how to use it best!)
Thanks Arjwiz.
I'll just add a bit more on why I think that most of the time, you'd go with your gut on selection rather than this algorithm. Take Sachin Tendulkar. He recently went over a year without scoring a century. It was a z = -2,2 slump and p = 0,2. So only one in five players would go through a slump that bad.
But of course you'll keep picking Tendulkar, because you know he's a great batsman and he'll start scoring runs again. And that's what happened.
(Another technical point: The value of p can be quite sensitive to the career average. Tendulkar went from 57 before the slump to 25 in it, and you get p = 0,2. If he'd gone from 50 to 25, it would have been p = 0,8.
There's also a problem that at z = -2,2, we're looking at around the sort of slump where my numbers don't work so well.)
I'll just add a bit more on why I think that most of the time, you'd go with your gut on selection rather than this algorithm. Take Sachin Tendulkar. He recently went over a year without scoring a century. It was a z = -2,2 slump and p = 0,2. So only one in five players would go through a slump that bad.
But of course you'll keep picking Tendulkar, because you know he's a great batsman and he'll start scoring runs again. And that's what happened.
(Another technical point: The value of p can be quite sensitive to the career average. Tendulkar went from 57 before the slump to 25 in it, and you get p = 0,2. If he'd gone from 50 to 25, it would have been p = 0,8.
There's also a problem that at z = -2,2, we're looking at around the sort of slump where my numbers don't work so well.)
Salutations Professori! I'm going to take a long time digesting that.
David, the spin-off since my first visit to your blog has been that I went out and got a stats textbook to begin all over again.
Reading some online tuts as well.
I'm not going to pretend I have understood it all...but I'm getting the idea.
Thanks.
Post a Comment
David, the spin-off since my first visit to your blog has been that I went out and got a stats textbook to begin all over again.
Reading some online tuts as well.
I'm not going to pretend I have understood it all...but I'm getting the idea.
Thanks.
Subscribe to Post Comments [Atom]
<< Home
Subscribe to Posts [Atom]