Sometimes, however, you may be interested in the median response, or something more general like the cutoff for the top 20 percent of the responses for a given set of predictors. To model such cutoffs, quantile regression excels, and the quantreg package makes it feel a lot like linear regression.
This spotlight includes a simple bivariate normal example, then moves on to a more complex applciation that could be used to track player progression in a video game (e.g. a future digitization of Kudzu).
First, we generate the bivarate normals, where y is the response to x with normal variation, and x is the predictor.
set.seed(12345) x = rnorm(1000) y = x + rnorm(1000)
Since the normal distribution is symmetric, the linear regression and quantile regression about the median should both approximate a slope of 1 and intercept of 0.
lm(y ~ x)$coef (Intercept) x -0.03220106 1.03914825 library(quantreg) rq(y ~ x)$coef (Intercept) x -0.02357192 0.98935700
Furthermore, for other quantile regressions of this dataset, we would expect the slope to remain near 1 and the intercept to approximate the univariate normal quantiles. We expect this because the random variation in y is known to be Norm(0,1).
### 25th and 75th percentiles rq(y ~ x, tau=c(.25,.75)) tau= 0.25 tau= 0.75 (Intercept) -0.729756 0.6756721 x 1.042261 1.0169105 qnorm(c(.25,.75)) -0.6744898 0.6744898 ### 70th, 90th, 95th, 97.5th, 99.5th percentiles rq(y ~ x, tau=c(.7,.9,.95,.975,.995)) tau= 0.700 tau= 0.900 tau= 0.950 tau= 0.975 tau= 0.995 (Intercept) 0.5099463 1.298006 1.545429 1.833121 2.5811008 x 1.0036638 1.043121 1.023059 1.010482 0.8197345 qnorm( c(.7,.9,.95,.975,.995)) 0.5244005 1.2815516 1.6448536 1.9599640 2.5758293
As you might expect. The model behaves unpredicably for extreme quantiles because useful data grows sparse. Finally, we can draw all these quantile regression lines.
mod = rq(y ~ x, tau=c(.25,.5,.7,.9,.95,.975,.995)) coefs = mod$coef plot(y ~ x) abline(h=0) abline(v=0) for(k in 1:ncol(coefs)) { abline(coefs[,k], col="Red", lwd=2) }
For this next example, we are interested in estimating how many players can reach an arbitrary score in a game after playing a certain amount of time. This sort of information is valueable when tuning the difficulty of "reach a score of Y" challenge in a game so that few players can reach the goal right away but most players can after some effort. Consider the best score a player reaches in a game as the response Y and time spent on a challenge as a predictor X.
One problem: scores in games do not always increase linearly or in a simple transformation of linear. To handle this, the quantreg package also includes nlrq(), a non-linear quantile regression function. As before, we generate some player data.
#### Make a database of simulated users for our game set.seed(12345) Nusers = 1000 ## An average of 20 minutes played, capped at 100 minutes minutes = rexp(Nusers, rate=1/20) minutes = pmin(minutes,100) #pmin = parallel min ## performance is a function of skill (gamma), luck (norm) and time score = rnorm(Nusers, mean=100,sd=100) + 10*minutes + 0.25*minutes^2 gamedata = data.frame(score,minutes)
Like the non-linear least squares function nls() in base R, nlrq() needs a formula with parameters on which to find an optimum fit. Starting values for the parameters are also needed. The formula can be user-defined function, such as the cubic polynomial function shown here:
### Polynomial degree 3 function polyd3 = function(x,a,b,c) { output = a + b*x + c*x^2 return(output) }
Now, using a flat line at zero our initial curve, we can build a model.
init = list(a=0,b=0,c=0) mod = nlrq(score ~ polyd3(minutes,a,b,c),data=gamedata,start=init,tau=0.5) coefs = summary(mod)$coef coefs Value Std. Error t value Pr(>|t|) a 103.7766248 7.381757743 14.05852 0 b 9.8370123 0.559544019 17.58041 0 c 0.2525046 0.007178738 35.17396 0 coefs[,1] a b c 103.7766248 9.8370123 0.2525046
If you look at the score formula we used in the simulation, the parameters of 100, 10, and 0.25 were all estimated fairly well. Also, the standard error estimates are reasonable, if not a bit conservative in this instance. The nlrq() function depends on summary() for most of the information, which is why the extra steps involving summary are shown here. We can use the formula function defined earlier to make predictions, and the delta method to get standard error of those predictions (delta method not shown).
### Median score after 15 minutes polyd3(x=15,coefs['a',1],coefs['b',1],coefs['c',1]) 308.1454 ### Median score after 100 minutes polyd3(x=100,coefs['a',1],coefs['b',1],coefs['c',1]) 3612.524
attach(gamedata) plot(score ~ minutes, xlab="Time Played (minutes)", ylab="Best Score", cex=0.6) #cex = Character EXpansion mod.10 = nlrq(score ~ polyd3(minutes,a,b,c),start=init,tau=0.1) coefs = summary(mod.10)$coef curve(polyd3(x,coefs['a',1],coefs['b',1],coefs['c',1]), from=0,to=100,col="Red",add=T,lwd=1) mod.50 = nlrq(score ~ polyd3(minutes,a,b,c),start=init,tau=0.5) coefs = summary(mod.50)$coef curve(polyd3(x,coefs['a',1],coefs['b',1],coefs['c',1]), from=0,to=100,col="Red",add=T,lwd=1) mod.90 = nlrq(score ~ polyd3(minutes,a,b,c),start=init,tau=0.9) coefs = summary(mod.90)$coef curve(polyd3(x,coefs['a',1],coefs['b',1],coefs['c',1]), from=0,to=100,col="Red",add=T,lwd=1)
No comments:
Post a Comment