I want to make a FUSS. That is, a Formula-Unspecified System Solver.
The FUSS would input an n * p matrix of X values, and a 1 * p vector of y values.
The FUSS would output a formula, such as y = log(a + bx1 + c * sqrt(x2)), where the parameters a,b, and c are chosen to fit y the best for this formula by some criterion such as sum of squares*. The formula would be chosen by a simulated annealing method that starts with a basic formula and mutates to find one that balances accuracy with simplicity.
Here's the mechanics as I've worked them out so far:
The R function nlm(f, p, ...) takes in a function f and initial parameters 'p', which get fed into the function. nlm then estimates the values for the variables specified in 'init' that produce the lowest output in 'f'. Usually, the output of 'f' is some value representing distance from some ideal value, so the output of nlm effectively estimates the best value of things for you.
The '...' represents the additional variables that are fed into nlm(f, p, ...) that are not changed by nlm when it's trying to optimize. So what would happen if those extra variables determined the formula to optimize 'p' on.
For example
FUSS_INNER = function(p, x, y, form) ## Assume p=3, x is 2*n { y_hat = rep(0,length(y)) y_hat = rep(0,length(y))
for(k in 1:length(y)) { ## Start with a polynomial of exponents y_hat[k] = p[1] + p[2]*x[1,k]^form[1] + p[3]*x[2,k]^form[2] ## Apply transformations if(form[3] == 1){ y_hat = log(y_hat)} if(form[4] == 1){ y_hat = sqrt(y_hat)} }
## compute lack-of-fit lack_of_fit = sum( (y - y_hat)^2) ## compute a complexity score based on the formula used
complexity = 1
## 1 point for each non trivial exponent (0 or 1)
complexity = complexity + length(which( !(form[1:2] %in% c(0,1)))) ## 1 point for each transformation applied complexity = complexity + length(which( form[3:4] == 1)) return(c(lack_of_fit,complexity)) }For the above example formula of y = log(a + bx1 + c * sqrt(x2)), this would be specified by the form vector of "0,1/2,1,0", and would have a complexity of 3.
With a bigger FUSS_INNER function, more possibilities for mutation are available. Also, there's a lot of housekeeping that I'm ignoring in this function because it's all very high level at this moment. Also, I would use apply() instead of for() to improve speed, but this is for demonstration purposes.
It's 2:30am in Montreal, and that's what the FUSS is about.
* maximum likelihood would be better, but let's start small.
No comments:
Post a Comment