Sunday, September 07, 2008
Simulating one-day cricket and batting orders
I spent some of today reading through this paper (you'll probably need a University subscription to read that) by Swartz et al. It's called 'Optimal batting orders in one-day cricket', and it is useful because it gives a way of simulating one-day innings.
(The paper itself looks at the Indian batting order in the 2003 World Cup. The best they came up with went Dravid, Tendulkar, Ganguly, Sehwag, Mongia, Y Singh, Khan, Kaif, H Singh, Agarkar, Srinath. Their second-best lineup swapped Dravid and Ganguly, sent Kaif to 7. They reckon it would have done better by about 6 runs, on average, than their actual lineup for the World Cup final. Not many matches are won and lost by less than six runs, but I suppose you want to squeeze out every run you can. It's interesting that the simulations reckoned that Kaif was best left to come in and slog at the death. A full run of their simulations takes a long time — there are a lot of batting lineups to go through, even when you do clever tricks and make the search much smaller. But they say that it would be much quicker if you had only a limited number of options, such as during a match when you've lost a couple of wickets. Using a computer to find the optimal batting order based on the situation of the game meshes well with Rob Smyth's belief that batting orders in one-day should be fluid.)
The way they do it is to work out 'baseline' characteristics for each batsman in the team. That is, they get the probability that a batsman will play a dot ball, score a single, a 2, a 3, a 4, a 6, or get out. But they don't just take their overall career numbers, they take into account the match situation when they batted.
So, given the number of wickets fallen w, balls bowled b, Duckworth-Lewis percentage resources used R(w,b), what they actually did was fit parameters to a loglinear model that looks like this (the subscript k denotes what happened on the ball, so k = 0 is a dot, etc.; the subscript j refers to the jth batsman):
log(qjwbk) = μjk + αk*w/9 + βk*b/299 + θk*R(w,b)/100.
The μjk's give the baseline probabilities for each type of ball-result (dot, single, etc.) for each batsman at the start of the innings. (Well, not directly probabilities — the probabilities pjk are given by pjk = qjk / &Sigmak qjk.)
The other parameters (α, β, θ) describe how the probabilities change as the game situation changes. It is assumed that all batsmen change in the same way.
So that's all well and good. You throw the paramaters into the computer, generate a bunch of random numbers and you end up with 50 overs of simulated cricket. The results are pretty close to what real cricket scores are, at least at the team level. I tried running the same algorithm and one of the openers scored a double-century on the second run, so it's probably not perfect, but on the average it seems to do a good job. (I'm getting a slightly higher result for the team average — 253 against 250 — than the authors of the study did. Some minor bug in my code somewhere, I guess.)
Anyway, I think that this paper could be useful for me. What I want to do is see how to properly assess batting average and strike rate. So what I hope to do is get the relevant parameters for batsmen as a whole in the 2000's. Unfortunately, I don't have ball-by-ball ODI data, so I'm going to have to estimate it somehow. I've asked S Rajesh for the overall numbers (i.e., total dot balls, singles, 2's, etc.), and hopefully with some fiddling I'll get the weird loglinear parameters to match them.
The α, β, and θ parameters I'll leave unchanged. Hopefully India's batsmen from 1998 to 2003 (the period that the study looked at) are representative of how batsmen generally change over the course of an innings.
Then, once I've got a good simulator of an average batting lineup against average bowling, I'll be able to vary the parameters of one of the batsmen, tweaking average and strike rate (indirectly — I'll be tweaking probability of dismissal on each ball, and probability of each type of scoring shot). Then you see what effect this has on the average team score. So it'll be like the post below, only accurate. Hopefully.
(The paper itself looks at the Indian batting order in the 2003 World Cup. The best they came up with went Dravid, Tendulkar, Ganguly, Sehwag, Mongia, Y Singh, Khan, Kaif, H Singh, Agarkar, Srinath. Their second-best lineup swapped Dravid and Ganguly, sent Kaif to 7. They reckon it would have done better by about 6 runs, on average, than their actual lineup for the World Cup final. Not many matches are won and lost by less than six runs, but I suppose you want to squeeze out every run you can. It's interesting that the simulations reckoned that Kaif was best left to come in and slog at the death. A full run of their simulations takes a long time — there are a lot of batting lineups to go through, even when you do clever tricks and make the search much smaller. But they say that it would be much quicker if you had only a limited number of options, such as during a match when you've lost a couple of wickets. Using a computer to find the optimal batting order based on the situation of the game meshes well with Rob Smyth's belief that batting orders in one-day should be fluid.)
The way they do it is to work out 'baseline' characteristics for each batsman in the team. That is, they get the probability that a batsman will play a dot ball, score a single, a 2, a 3, a 4, a 6, or get out. But they don't just take their overall career numbers, they take into account the match situation when they batted.
So, given the number of wickets fallen w, balls bowled b, Duckworth-Lewis percentage resources used R(w,b), what they actually did was fit parameters to a loglinear model that looks like this (the subscript k denotes what happened on the ball, so k = 0 is a dot, etc.; the subscript j refers to the jth batsman):
log(qjwbk) = μjk + αk*w/9 + βk*b/299 + θk*R(w,b)/100.
The μjk's give the baseline probabilities for each type of ball-result (dot, single, etc.) for each batsman at the start of the innings. (Well, not directly probabilities — the probabilities pjk are given by pjk = qjk / &Sigmak qjk.)
The other parameters (α, β, θ) describe how the probabilities change as the game situation changes. It is assumed that all batsmen change in the same way.
So that's all well and good. You throw the paramaters into the computer, generate a bunch of random numbers and you end up with 50 overs of simulated cricket. The results are pretty close to what real cricket scores are, at least at the team level. I tried running the same algorithm and one of the openers scored a double-century on the second run, so it's probably not perfect, but on the average it seems to do a good job. (I'm getting a slightly higher result for the team average — 253 against 250 — than the authors of the study did. Some minor bug in my code somewhere, I guess.)
Anyway, I think that this paper could be useful for me. What I want to do is see how to properly assess batting average and strike rate. So what I hope to do is get the relevant parameters for batsmen as a whole in the 2000's. Unfortunately, I don't have ball-by-ball ODI data, so I'm going to have to estimate it somehow. I've asked S Rajesh for the overall numbers (i.e., total dot balls, singles, 2's, etc.), and hopefully with some fiddling I'll get the weird loglinear parameters to match them.
The α, β, and θ parameters I'll leave unchanged. Hopefully India's batsmen from 1998 to 2003 (the period that the study looked at) are representative of how batsmen generally change over the course of an innings.
Then, once I've got a good simulator of an average batting lineup against average bowling, I'll be able to vary the parameters of one of the batsmen, tweaking average and strike rate (indirectly — I'll be tweaking probability of dismissal on each ball, and probability of each type of scoring shot). Then you see what effect this has on the average team score. So it'll be like the post below, only accurate. Hopefully.
Subscribe to Posts [Atom]