I got the new Firefox (2.something) and WTF it still blows chunks. Pages take much longer to load & format than in IE and they pop all over during formating in evil ways. I had to get "SetBrowser" and "FireTune" to even get it working right. I wish I could fucking use it because I want the better security and ad blocking and such, but I'm not sure I can handle the craptacularity. Also, you can't drag & drop url's between IE and Firefox which makes it impossible to use a hybrid browsing approach. Blah blah I know about initialpaint delay.
12/31/2006
12/30/2006
123006  2
123006  1
12/29/2006
122906  1
12/25/2006
122506  1
I'm not around any really little kids this year, which sucks.
12/22/2006
122206  3
122206  2
122206  1
Meh, most of my logic is flawed here. I'm a condescending ignorant ass.
12/21/2006
122106  3
Say you have some function F(A,B) where A and B are vectors. You wish to solve F(A,B) = 0 , (or perhaps minimize an error E(A,B), so take the derivative to get F's). Do some algebra and substitutions until you have two seperate functions that depend only on the other variable :
A = G(B)
B = H(A)
Where G and H may be arbitrary nonlinear functions so you can't just plug them together and solve it. If B was the right answer for B then A =G(B) would give you the exact solution for A and viceversa.
The hacky solution is basically just to set A from G, then B from H, and repeat. There are some simple hacks to make it better.
First imagine the solutions converging as a time series, so we have A_t and B_t. First of all it's important to have a pretty good initial guess for A_0 and B_0 (actually just for one or the other) because this method doesn't work if you're very far from the right answer.
The next hacky trick is to take less than the full step for the first few steps. A full step would be :
A_t_1 = G(B_t)
B_t_1 = H(A_t)
( A_t_1 should be read as A_(t+1) , it's A at the
next time step)
Instead take a fractional step of size f (f in 0  1) :
A_t_1 = (1f) * A_t + f * G(B_t)
B_t_1 = (1f) * B_t + f * H(A_t)
This avoid oscillations and helps you settle in. You can start with f = 0.5 and quickly step it up 0.6,0.7. For some problems you might want to have "f" max out at 0.99 or something rather than 1.0
The next hack which helps a lot is to use A_t_1 instead of A_t when solving for B_t_1 :
A_t_1 = G(B_t)
B_t_1 = H(A_t_1)
This sort of lets you take a more exact step. If you want you can swap the order of A & B on every other step, or do some sort of fancier thing like :
A_int = 0.5 * A_t + 0.5 * G(B_t)
B_t_1 = H(A_int)
A_t_1 = G(B_t_1)
which is sort of like a centeredtime integrator or something though I have no idea if that's actually better. Actually this whole thing sort of reminds me of implicit integrators.
Anyway, in practice this type of iteration converges incredibly fast. In problems where I've used Newton's method or Conjugate Gradient or something and it's taken like 500 iterations to converge, this iteration took 20 steps. Obviously it only works on certain types of problems and they have to have well behaved local minima, but when it works it works great.
To be a bit more concrete I'll give you a specific example. I've been examining the "weighted SVD", and one step of that is the approximation of a matrix with a rank1 tensor (that's just an outer product of two vectors). The problem is :
Given matrix A_ij
and weights w_ij
Find vectors U_i and V_j such that the error
E = Sum_ij { w_ij * (A_ij  U_i*V_j) }
is minimized
If you take derivatives of E with respect to U and V you can solve for two equations :
U_i = Sum_j { A_ij * w_ij * V_j } / Sum_j { w_ij * (V_j^2) } = H(V)
V_j = Sum_i { A_ij * w_ij * U_i } / Sum_i { w_ij * (U_i^2) } = G(U)
Directly solving these is impossible (I think), but they're exactly in the form I need for my hacky iteration, and in fact it works super well.
I guess the reason no one likes this approach is that on general problems it can very badly fail to converge.
Addendum : this approach is also great for equations in one variable by substitution. For example say you have some evil equation to solve like :
3 + 7*x + sin(x) + sin(x)^2 = 0
Just set y = sin(x) and pretend you have equations in two variables. Now :
x = ( 3  y  y*y)/7
y = sin(x)
Are the two equations and you just use the alternate stepping approach. The halfstep iteration I described here converges to within FLT_EPSILON in 5 steps !!
122106  2
122106  1
I hear a lot of women say that they admire or respect Condy Rice for her acheivements and for the fact that she's a woman with a top role in the government. That sentiment disgusts me. I find her of the same cloth as Colin Powell, who is also despicable. Someone who is clearly intelligent and rational is even more accountable for their actions, and by making themselves totally subservient to a madman they have given up all their integrity to further their careers. They know that they're lying, they know they're trying to cover up gross misdeeds and incompetence, they know their actions are leading to the death of innocents; sometimes you can even see it on their faces in an interview where they take a tough question and they almost wince before they shake it off and spout the awkward lies again. Someone like Rumsfeld I almost can excuse because he's clearly just loony tunes or drunk on his own koolaid.
12/19/2006
121906  1
First of all, generally the output should be "linear" in A's and B's. That is, sqrt(A*B) is good but A*B is not, because it gives you 2 powers of A or B. We think like physics and assume these things have units, the output should have the same units.
Secondly, we must be aware of scales. If A & B are scaled in exactly the same way so their numeric values have the same significance, then (A+B) is a good combiner. If they are not scaled the same, then any forms which add A & B are not okay. In that case you only really have one option :
sqrt(A*B) * G(A/B)
Often A & B have the same significance which means the output should be symmetric in swap A <> B. In that case the G function has limitted forms. I haven't thought about exactly what they can be, but it has to be things like (A/B) + (B/A). In fact if A & B aren't on a similar scale even that form is not okay.
If we assume A & B are on the same scale, then additive forms are okay and it opens up some other options. (A+B)/2 is obviously your first guess.
2AB / (A+B) is an interesting combiner. If an A&B are in [0,1] then this takes (0,x)>0 , (.5,.5) > .5 and (1,1) > 1, and sort of penalizes when they're not equal. It takes (x,x) > x which is a nice property of any combiner when you're trying to make a combiner that can stand in for (A+B)/2.
sqrt(A^2 + B^2) is another, and then you can take simple multiples of these, which gives you forms like (A^2 + B^2)/(2AB).
Anyway the point is that there really aren't very many forms to choose from which satisfy the basic properties and you can easily try them all.
12/16/2006
121606  3
Say you have two experts. They make predictions which have a Guassian error. The expert provides both the prediction (the center of the Gaussian) and his best estimate of the accuracy of the prediction (the sdev of the Gaussian, which is the sqrt of the variance), call this P & S, so you have {P1,S1} and {P2,S2}.
We're going to make a combine prediction by guessing a weight for each expert (here W1 and W2). The combined prediction is then :
Pred = (W1 * P1 + W2 * P2) / (W1 + W2)
(or) Pred = P1 + (W2 / (W1+W2) ) * (P2  P1)
Pred = P1 + F * (P2  P1)
F = 1 / (1 + W1/W2)
This latter form is more useful in practice because it allows you to apply arbitrary scaling to each term so you can run them through an lsqr or nnet or whatever. (assuming W1 and W2 never go to zero; if they do then just choose the other pred)
The ideal combination of experts in general is a contextdependent problem, but a general good way to weight them is via the entropy of the prediction. If you knew the instantaneous entropy of each prediction, H, the ideal weight would be :
W = e^(  Beta * H )
for some Beta which in general depends on the problem. In practice, Beta = 2 is usually very good.
The entropy of a Gaussian can be analytically integrated. The answer is :
H(Gauss) = (1/2) * ln(2pi) + ln(S)
Where S = the sdev (sigma), recall. Since we assumed our predictors were Gaussian we can use this. But we can simplify :
W = e^(  Beta * H ) = C * e^(  Beta * ln(S) ) = C * ( S^Beta )
Where we put a bunch of constants in C, and they don't matter because it's an overall scaling of W which divides out.
F = 1 / (1 + W1/W2)
F = 1 / (1 + S1^Beta/S2^Beta
F = 1 / ( 1 + (S2/S1)^Beta )
If we use V = S^2 (V is the "variance" of the Gaussian, or the MSE), and choose the common case of Beta=2, then we have a very simple expression :
F = 1 / ( 1 + V2/V1 )
or F = V1 / (V1 + V2)
and this is the same as W1 = V2 and W2 = V1
Which is in fact a very good way to combine Gaussian predictions.
121606  2
121606  1
12/15/2006
121506  2
Related to my perpetual improved ranting method search, the new Google Blogger (now in beta) will have an SDK for code access, so I imagine it will be easy to write an app to throw raw text at it via Blogger GData
121506  1
12/13/2006
121306  1
1. The FANN Neural Net library is actually pretty decent. It's missing a lot of functions you would want, but it's reasonable easy to write them yourself.
2. SVDLIBC is a nice little implementation of the old "Lanczos" SVD algorithm which is fast and good. It comes from the old FORTRAN math goodness. One amazing little routine I found in there is "Machar" which will deduce the properties of your FPU by doing a few simple math operations!
3. If you search google for things like "SVD learning unknown missing" you will find about 100 papers; this is a major field of research, a lot of it driven by computer vision.
4. Simon's spilled the beans on his Netflix approach. His description of the "GHA" (Generalized Hebbian Algorithm) is way way clearer than the paper. The Hebbian is actually a type of Neural Network, and I find this approach much easier to understand if you just think of it in terms of doing a gradient descent to optimize the SVD vectors. I also found that Gorrell has released source code for her own GHA Lab
12/12/2006
121206  1
12/10/2006
121006  1
12/09/2006
120906  1
12/06/2006
120606  1
12/03/2006
120306  1
A lot of people in Sports Betting use the motivation of the two teams as a basis for choosing the side to bet. The theory is that the team that needs a win is more likely to play to full effort and beat the spread. I've always been skeptical of these theories and thought I'd try to statistically examine how important this theory actually is.
I examined all the NFL games from 1999 to 2006. For each game I made a rough evaluation of which of the teams "needed a win". Only games in weeks 814 were examined, which is the heart of the playoff run. Teams that "need a win" were any teams with a record >= 50% wins and with a record worse than 82. Generally these are the teams on the bubble for their division or a wildcard spot. Note that this isn't a totally precise measure of "need a win" but it should be good enough to see some statistical correlation if any exists.
Here are the results. There are four categories :
cat 0 = both teams don't need a win cat 1 = road team needs a win cat 2 = home team needs a win cat 3 = both teams need a win The results are (home wins)(home losses)(pushes) ATS = Against The Spread 32 teams, 1992 games cat 0 WLP : 8275780 cat 0 ATS : 67868344 cat 1 WLP : 841250 cat 1 ATS : 1031006 cat 2 WLP : 138650 cat 2 ATS : 941027 cat 3 WLP : 105691 cat 3 ATS : 91768
In particular look at categories 1 and 2. The team that needs a win is 263149 vs. teams that don't need a win. However, they are only 194205 ATS !!
Now, is that WL record even significant? The fact is most of the time you have a "need a win" vs. "dont need a win" matchup it's a pretty good team vs. a very bad team, so it's no surprise that they win a lot more.
For comparison I did the same study, but just grouped based on teams with a record >= 50% against teams with a record < 50% , eg. winning teams vs. losing teams.
32 teams, 1992 games cat 0 WLP : 2191360 cat 0 ATS : 1771699 cat 1 WLP : 1862680 cat 1 ATS : 22421317 cat 2 WLP : 3281110 cat 2 ATS : 21621112 cat 3 WLP : 4213221 cat 3 ATS : 34936827
Winnings teams are 596297 against losing teams, which is actually a better ratio than the "need a win" vs "dont need a win".
It's possible that a better measure of "need a win" would reveal a slight statistical significance, but the absense of any here indicates it is not strong at all.
12/02/2006
120206  1
Also Radiohead Giglink Dump is full of great live shows. Check out Thom Yorke's solo at the Bridge School.
12/01/2006
120106  2
I couldn't find this info easily anywhere on the web, so here you go. Lossless color transforms :
Color transform notes : JPEGLS (LOCOI) uses : {RGB} > { (RG), G, (BG) } and the inverse is obvious BTW this is very nice in that you can go 8bit > 8bit using modulo arithmetic, many of the others require 9 bit YUV coefficients. J2K (aka RCT aka CREW) uses : {RGB} > Y = (R + 2G + B)/4 U = RG V = BG G = Y  (U+V)/4; R = U + G B = V + G (the divides are floored) YCoCg is : (similar to "RCT" but decorrelates better) lossy : Y = (R + 2G + B)/4 Co= (RB)/2 Cg= (2G  R  B)/4 G = Y + Cg R = Y  Cg + Co B = Y  Cg  Co lossless : (Co and Cg are effectively scaled up by 2) Co = RB t = B + (Co/2) Cg = Gt Y = t + (Cg/2) s = Y  (Cg/2) G = Cg + s B = s  (Co/2) R = B + Co
In general with the lossless YUV type transforms, the chromatic coefficients are roughly 2X bigger than they should be in a perceptual space in order to have the bits to get back to RGB exactly.
Note that YUV was created for perceptual uniformity, NOT for decorrelation, so it's no surprise that you can beat it for decorrelation. In fact, doing the 3x3 KLT and sending the coefficients is totally worth it if you're doing compression. (actually YUV was created for some weird use in televisions)
120106  1
NOTE ON COLOR CONVERSIONS : int colors are in [0,255] Int pel "x" is considered to represent the span {x0.5,x+0.5}/255 float colors are in [0,1] int 0 > float 0 int 255 > float 1 Note that this way is nice in that it keeps the maximums the same, but it's bad in that it wastes color space. That is, there's only 1/255 precision instead of 1/256 which is totally possible. Another way of seeing it is that the int range [0,255] is actually being mapped to the float range { 0.5/255 , 255.5/255 } , that is, {0,1} is not the full range. One nice thing about this though is that it gives you some slop in your float>int. As long as the float is in { 0.5/255 , 255.5/255 } you don't need to clamp. /* // more pure quantizer way : // int value [i] represents the bucket (i,i+1)/256 inline int ColorFTOI(float f) { return ftoi(f * 256.f); } inline float ColorITOF(int i) { return (i + 0.5f) * (1.f/256.f); } */ // way that maps 0<>0 and 255<>1.0 inline int ColorFTOI(float f) { return ftoi(f * 255.f + 0.5f); } inline float ColorITOF(int i) { return i * (1.f/255.f); }
11/30/2006
113006  2
New Orleans San Francisco 7 ; bet NO
Washington Atlanta 1.5 ; bet WA
Cleveland Kansas City 5.5 ; bet CL
Denver Seattle 3 ; bet Den
Philadelphia Carolina 3 ; bet Phi
The last two games are messed up by weird injuries/substitutions so I'm laying off them.
113006  1
In sports betting, you can generally go big when you find an edge. (that's not completely true because books will identify you as a sharp and start giving you bad lines or setting limits, but I'll ignore that for now). The problem with sports is that you get very few gambling events and they're very high variance. If you're a good capper you win 55% or so of your bets. That's good EV, but in the NFL you only get maybe 30 bets a year. The variance on that is huge. In contrast, in poker you will often be getting 55% shots, if you play a lot you might get 10,000 a year, which means your profit is much more regular. Even if you go to something like the NBA where you can get your money on more games, it's still only a few 55%'s a day, whereas in poker you'll get maybe 100 a day.
The third option are the financial markets. There's tons and tons of opportunities to trade so you can smooth out your variance. The return generally doesn't decrease with investment (for my money scales, it does if you're getting into huge monies). So, that's the best of both worlds. The only problem with finance is A) high transaction costs, and B) the markets are much more efficient so the edges are smaller.
11/28/2006
112806  2
I'm going to write myself my guidelines. My computer system guesses the spread that it thinks is the "true spread". I want to decide to bet based on the position of the TS and the VL (Vegas Line).
1. If the TS and VL are on opposite sides of the zero, I like betting the Vegas dog to win. I especially like it if it can mean betting an underdog on the moneyline where I get +150 or some nice odds like that.
2. If the TS is close to zero and the VL is big, eg. if the TS is like 1 and VL is like 4, I again might bet the dog to win if my judgement of the intangibles is that they have a good shot. If not I drop down to the next rule :
3. If the TS/VL are on opposite sides of the 3 or 7 with at least 2 points of difference, bet the spread. eg. if the VL is a 9 and the TS is a 6, bet the spread. Teams will often win by exactly 3 or 7, so when you can bet across the 3 or 7 you're getting good value from the spread. On the other hand if the VL is a 6.5 and the TS is a 4, probably don't bet unless the intangibles really like the bet because you're not crossing a 3 or 7.
4. To make the intangibles a bit more concrete : A) Prefer betting home dogs. B) Prefer betting teams that "need a win" vs. teams that are either coasting to the playoffs are completely out of it. C) Beware of teams whose past performance is not a good indicator of their true quality, eg. if you judge they've gotten lucky in a lot of games. D) The computer isn't aware of news or injuries so mainly just avoid games where those are big factors.
Update : this week there seems to be a huge edge betting the Ravens to win vs. Cincinatti.
112806  1
I had a lot of ideas for games, but didn't really like any of them. A few were sort of decent and obvious, like :
1. Make a game where your play creates a song. Basically different user actions or environmental objects create certain sounds or turn on different tracks. By playing the game correctly you make a cool song. The true "synesthesia" moment of brilliance would come if you could play either by just doing the visual gameplay things, OR purely by closing your eyes and trying to create a cool song, and if you create a cool song that also beats the shooter/platformer/whatever.
2. Make a game with sound input where doing the gameplay causes you to make sounds that are cool and/or feel like the play. The way I was thinking was similar to #1, for example you could make a "beat boxing platformer" where different beats make you jump or duck or whatever, then the level sends things at you and you wind up making a sound track by playing. Casey had a really cool idea for a variant on #2 but I won't out it here, hopefully he'll actually finish the game and it will be cool to play.
Anyway, maybe I shoulda done something like those, but they just felt like the same old game. Most video games feel like a series of rote/mindless tasks that are presented to the user in sequence. I'm sick of making those and playing those, and this was really just another of those in different garb.
So, I tried to play with some toys that would feel like different experiences, and mostly failed. On the plus side, now I know my arm is okay to type again, though I was irritating it quite a bit.
11/21/2006
112106  1
In this theory, you could even make big money from those stocks that you get in spam emails. Assuming you're one of the first buyers, even if it's a junk stock people will send it up and you can ride the bubble. Certainly you can make money by buying a penny stock and then talking it up all over the internet, which is essentially what major banks have been doing for years.
11/20/2006
112006  1
Chicago +3 against NE
NO +3 against Atlanta
bet both to win at about +150. Personally I hate betting Chicago against NE but this is what the computer likes.
11/19/2006
11/18/2006
111806  5
111806  4
The fact is, democrat or republican, the people who are elected are generally totally unqualified to run the government. They usually have little experience (especially in this age when any record is a bad record). Even with experience government is a huge complex problem and you need longterm specialists, the beaurocrats who run the agencies and really make government work. Those people have been devalued and cast aside, ignored, and have resigned in droves. This has already been detrimental to the functioning of government, but if it becomes a pattern it could be even worse.
111806  3
Cleveland +4 vs. Pitt
San Francisco +4 vs. Seattle
This is using a system where I only bet when the computerpredicted spread is across the 7,3, or 0 from the betting line.
111806  2
Imagine your value(time) function was a sine function. The correct average rate of return is 0. However, depending on where you put your start and end, it could be largely positive or negative. In particular, with SF housing, if you looked at 1950 to 1998 you would see maybe a 2% annual gain. If you looked at 1950 to 2005 you would see a 5% annual gain. The problem is that the time span we're looking at is not very large compared to the magnitude of the noise. Take the ending price minus the starting price is super sensitive to the short term noise.
A much better way is to bestfit a line through the prices and take the slope of the line. This is way less sensitive to the edge effects. There are probably even better statistical ways, but it seems nobody does this even though it's totally obvious and everyone should do this.
111806  1
Watched "Miller's Crossing" last night. Good but rather pointless. Has that Coen brother unrealistic staged feel; their dialogue is really awful in that everyone is a charicature and noone ever says things like that, but their dialogue is really awesome in that the lines are clever and perfect for their surreal noir world. Anyway, that's not my point. The Gabriel Byrne character made me think about the way smart bad ass guys always are in movies  they think of a plan and then go with it. What they don't do is think of several plans, think of the advantages and disadvantages of each, compare those to inaction, think of how it affects other people, etc. get some advice from others, then make a decision. They just come up with something make a decision and then figure out how to make it work. I know it's movies and it's ridiculous but I think it's a major problem in the way I run my life. The truth is you can make any decent plan work, so just pick one and do it.
11/17/2006
111706  1
The housing cheerleaders say that spending is strong, job creation is strong, the US economy is flexible, and also that this isn't a housing bubble, it's a valid correction based on high demand for limited space. First of all, the valid pricing idea is nonsense. We haven't suddenly run out of land. Housing prices haven't changed significantly in the last 50 years and suddenly there's a huge shortage? Every time in the past that housing has shot up it's crashed back down to almost zero change inflationadjusted.
The reality is that consumer spending has been powered by housing (the saving rate has been negative indicating people are taking loans against their houses). 33% of job creation since 2001 has been in housing or related fields, and perhaps even more so due to the trickleout effects. Even despite that real spending power has decreased and we haven't actually cut unemployment. More and more people are not even showing up on the unemployment numbers because they're not seeking employment. Furthermore, our government has been desperately doing everything it can to stimulate the economy  interest rates have been rock bottom low, we severely cut taxes, both of which should have been huge shortterm boosts to the economy, and have only just kept us at a sustaining growth rate (our GNP has to grow several % a year just to match population growth). The government's been doing massive debt spending, which again should stimulate short term growth and generally leads to constrictions later on.
We're now in a position where we can't really do too much more for stimulation. If housing does start to lead us to recession, our vault of reserve measures is depleted, we can't keep lowering interest rates, spending on debt, cutting taxes.
Anyway, as I've written before I'm still not sure how to play this. I do have a lot of money overseas already but maybe I should move even more.
11/16/2006
111606  2
We also got some Muscat grapes at farmer's market. They're quite amazing, very sweet, they taste just like the wine! Everyone was going crazy over all the persimmons, but I just don't understant it, I find them quite awful. I had a persimmon tree all those years at Alrita and didn't even bother to eat them off the tree.
My arm is getting much better, but I seem to have caught another cold !?!? WTF. Dan seems to get them at work and bring them home :( So I'm in awful shape again. Sigh.
In other news, I got a wicked Trojan by clicking around porn sites. WTF. It makes me angry how unbelievably broken IE and Windows is. It shouldn't be so easy to run malicious apps. Anyway, System Restore seems to have gotten rid of it pretty easily. All of my many AntiVirus apps didnt detect it. It was a DLL in windows\system that inserted itself in winlogon. If you edited the registry to take it out of winlogon it would put itself back in.
I put some money on some football games partly out of boredom. I couldn't resist betting the Colts at 1 even though that's the "public" (sucker) bet and all the sharps are on the Cowboys. I believe Romo is probably overrated because we just haven't seen him tested yet and everyone thinks he's amazing, and Parcells has really made awful decisions of late. If the Cowboys just run the ball and keep it low scoring, they will win. If it's a shootout, the Colts win.
111606  1
"Crooklyn" is a god awful movie. If you just keep playing clips of songs from the 70's will it make it fun to watch? Yeah, it sort of does, but the complete lack of nonsuperficial story and characters is too bad for all the cool style and "authenticity" (what?) to cover. Anyway, the first 2 custom made "Crooklyn" hip hop tracks are hot yo . Somebody find me the mp3's.
11/13/2006
111306  2
111306  1
11/12/2006
111206  1
Similarly the Beit Hanun/Hanoon Accident/Massacre is a huge big deal in Israel and the Arab world but has hardly received any attention here.
11/10/2006
111006  1
11/09/2006
110906  4
For lossy coding, the literature is quite strong and J2K would be fine, but I guess it has patent problems cuz the design committee was full of cockmunchers where each company forced in bits that they had patented. For lossless image coding there really isn't even super strong literature. The "MRP" method seems okay, but it's incredibly obvious how to make a big gain in practice : use Volf switching or something like that to blend in a dictionary/text coder which will work well on binary/artificial images.
110906  3
110906  2
110906  1
On a related note, I'm sickened by the stupid fucking American electorate. They vote D because the war is going badly. WTF you morans. If the war happened to go well and the economy happened to swing up they would've voted R. You go all in with 23o preflop and happen to win against AA, most americans look at that result and decide you made a good play and get to wager their money.
11/08/2006
110806  3
110806  2
Do a float wavelet transform with the lowestentropy wavelet (D97,CDF22,Haar,whatever). The imagine is now an LL and a series of bands, HL1,HL2,...HH1,...
Fit the linear relationship of bands : HLn ~= C * HLn1. Fit the best C for all n. Seperate constant for HH and LH.
Use this linear relationship to add new LH,HL,HH bands with the scaled data. Perhaps also apply an additional userspecified scale. Do the inverse transform. The resulting image is 2X bigger in each dimension.
This will do things like zoom hard diagonal lines perfectly due to the wavelet zoom scale invariance property.
ps. this doesn't actually work because Band (n) to (n1) correlation is very strong for coefficient magnitude, but not for signs, which are hard to predict.
110806  1
11/07/2006
110706  1
11/04/2006
110506  1
Searching for arbitrary f() is NP hard. In fact, you can see that finding an optimal f() is equivalent to the Kolmogorov compressor. If we use an "MDL" error we think of f() as a coder; to reproduce the output y_i we have to transmit the specification of f() and also the residuals (y_i  f(x_i)).
For all the approaches we will be approximating f() as a combination of simple functions. You can of course get arbitrarily close to any f() by using a linear sum of enough functions. One thing I've come to realize is that all the ML techniques are very very similar. Almost all of them have a form like this :
f(X) = Sum[j to M] w_j * h( A_j * (X  L_j) )
Where h() is some function, "w" are weights, "A" are parameters, and "L" is a location in the same space as X. There are M terms. Possibly there are other transforms or linear combos or whatever, those issues are sort of minor (but messy).
Anyway, all the schemes come down to how you pick these terms and optimize them. Almost always you pick your basis functions h() in advance. This could be your Kernel, your fit function, whatever. Common choices are the sigmoid or radial gaussian. Obviously you could just put a delta function at each sample, but that would give you a large model that would not generalize well. So you sort of want that smallest/smoothest model that fits the training set. This is a very hacky problem with ML and you really just have to tweak and test by crosstraining.
Neural Nets are a silly way of talking about basic linear algebra. You choose M and h() in advance and then optimize the parameters by greedy gradient descent. Obviously you try to choose the smallest M possible to get a decent fit on the data, but you can only do this by trial & error with crosstraining. Also the gradient descent can get stuck in local minima and you get no gaurantee of maximum margin or other ideas that the model will generalize well. The nice thing about neural nets is they're pretty simple and straightforward. Unfortunately NN's have been conflated with all this brain modeling and circuit diagram nonsense which makes it much harder to just see the linear algebra.
SVM's are actually extremely similar, but instead of greedy optimization, they set up a nice constrained optimmization which solves for maximum margin. The L's are not free parameters but are instead points chosen from the training set (called the support vectors). Instead of controlling M directly, you set a parameter which controls the model size vs. closeness of fit tradeoff, and the SVM solves for the best M. The h() can only be functions that work as "kernels" but that's not too bad, it's a large set.
There are some nasty problems with SVM's. The parameter that controls model size has to be tweaked by crosstraining which is hacky and time consuming. For clean seperable data svms do an awesome job of picking a minimal model (eg. for linearly seperable data with a plain dot kernel they pick a single perfect plane), however for noisy or ill fit data, M can get arbitrary large (up to N the training set site), and the time to train and test scale like M*N which is very bad if M ~ N. Finally h() must be chosen in advance and many factors you might want to tweak are actually implicity hidden inside the choice of h() such as the relative scaling of parameters. Most simple Neural Nets are equivalent to SVM's, and the SVM will find better solutions if you can make it behave the way you want.
Another method is local regression. Here the L are chosen to be some regions, and then some sort of standard regression is done on that region, such as linear regression, or splines, or whatever. The regression is easy so the whole problem is the choice of regions. The standard ways are Knearestneighbors, Kmeans, and Decision Trees. These are all pretty solid yet all very hacky in their own way. The big problem is that just like kernels they implicitly rely on a distance measure between points. It's crucial for the similarity of samples to be proportional to distance. With lots of different complicated parameters this is ill defined; you can do thing like SVD or PCA on the paramets and then take distances in that space, but that actually may or may not be good depending on the data. In fact if you knew the ideal distance metric you would have the solution to the whole problem. DT's cut up space to define the regions. With Decision Trees most people just build axisaligned trees and try all possible splits. This is slow and also ignores the fact that you could get much better splits by trying all possible offaxis splits. Ideally you would use an SVM to pick the optimal split at each level.
K nearest neighbors is pretty simple and solid, of all the methods it seems the least "tweaky" and requires the least a priori knowledge of the data, but in practice it is very sensitive to choice of K, choice of how you define "nearest", and you should probably use a distanceweighting and it will be sensitive to that. As usual these things can only be tested brute force by trial on crosstraining.
The choice of distance metric or pretransformation of the parameters is important, as is possible elimination of useless parameters (usually after SVD). Unfortunately this is basically a hacky trial & error thing and can have huge affects on the model's performance. Also with Neural Nets the exact preops and postops you perform can make the problem either much easier or much harder to model which strongly affects performance. Basically if you know a lot about your data (like it basically responds linearly to parameters, or it's generated by independent sources on the parameters, etc.) you can fit it well, but if you don't you will do badly and there aren't good ways to "learn to learn".
Fully Bayesian approaches are much more sound, but are computationally infeasible for anything but trivial models because they basically sum over all possible models & parameters weighted by the likelihood. Basically rather than guessing the params or optimizing them, you give them probabilities and sum on all possibilities. Check out the NIPS tutorial here Basically you correctly consider the model+parameters to be unknown and use the probability that they are the true model.
In general the more modern techniques like SVM don't perform significantly better on complex data than just something like a good local regression. I should add that the truly modern techniques seem to be blend models which are provably good as the number of blends gets large, things like AdaBoost with ensembles of decision trees, etc. It appears to me that these are computationally infeasable except on trivial problems.
Way too much of the literature compares their method to some garbage old method, like plain old kNN with no distance weighting or outlier discarding or local distance metric. The other big sin is to do comparisons in some really complex domain like image matching where a huge amount of performance comes from how you preprocess the data and take advantage of the specific nature of the problem.
11/03/2006
110406  1
Unfortunately even robust ML approaches like SVM are full of unsolved and unsolveable problems which you have to address through "tricks" and guesses and brute force optimization.
Anyway I've decided on my overall approach and believe I'm on the right track. Now I just need my arm to heal so I can type agian. I'm not even trying to code lefty, it's too frustrating and requires too much chording.
One nice online resource is http://www.learningwithkernels.org/ ; it's a really good book and they put the best chapters online. Chris Burges and Radford Neal also have great tutorials online. I have yet to find really good Neural Net material, there's a ton of garbage on Nets and it's hard to sift through. The Wikipedia pages on all this stuff are good jumping off points, they have good intros then links out to more info.
11/02/2006
110306  3
110306  2
110306  1
10/28/2006
10/25/2006
102506  1
"supervised learning" , via nnets or whatever, perhaps CART decision trees. find f() to minimize y  f(X).
categorizing/clustering. related to DT problem. how to find planes or rules to split a huge set of Nd points into related groups that minimize some overall cost function.
svm
10/24/2006
102406  1
10/20/2006
102006  2
102006  1
I now seem to be getting around 0.920 and there still a million little areas for improvement. Every time I check off one "todo" I add five more as more questions are opened up (the vast majority of which are dead ends but need to be explored).
One thing's been troubling me lately  I've tried various schemes and thrown out the ones that weren't as good. Perhaps that's wrong though? They aren't as good over all the data, but maybe they're better on portions of the data, and if I could choose that portion of the data, they would help overall. This was part of the surprise of Volf's compression work  adding in predictors that are much worse overall can still be a huge win if they are selected on the portions they do well on. With CF it's intuitively obvious that you have these similaritem based predictors and similaruser based predictors. In general similaritem does better. However, on certain queries you might be able to guess that they will be more usercorrelated than itemcorrelated, so you can user your similaruser predictor on that case and win.
10/19/2006
101906  4
101906  3
Consider two vectors A and B. The "MSE" of their difference is just AB^2, the "rmse" is AB. You can define a scaleinvariant version of MSE as (AB^2)/ab (where a = A, b = B). I'm not sure this scale invariance is a good thing or not, in practice is does weight things to the weightings, but people seem to like it because it's "elegant". Anyway, the Cosine law tells us that :
Cos(theta) = (a/b + b/a)/2  (AB^2)/2ab
The second term is familiar, the first term is this weird sort of ratio average. In the special case of unit vectors a = b = 1 this reduces to
Cos(theta) = 1  (AB^2)/2
So for unit vectors a "cosine" similarity and an MSE similarity will produce identical orderings. For nonunit vectors, we have this ratio term. The ratio term is minimized when a=b, so it gets bigger for length differences. In a sense it cancels out the length difference from AB. Consider two parallel vectors of different length. The MSE metric considers these very different, but cosine says they are the same.
In practice with movie ratings there are some very weird things that happen here. The "A" vector is the vector of user A's ratings over N movies *subtracted by his average rating*. This subtraction is important, but it does a weird thing. It makes the vector of random direction (by definition of average), and generally makes it close to zero. If you look at a subset of movies that he rates close to average, his vector will be very close to zero and very random. The "B" vector is the same thing. If you now consider two vectors that are close to {0}, the MSE error between them is of course tiny, but the cosine between them is completely random. Obviously that's not a very sensible weighting. Similarly, consider the "scale invariant" version of MSE. You're dividing by the vector lengths. What that does in practice is make the error much larger for vectors near zero. (dividing by length does sort of work for Cosine because larger cosine = better, the opposite for mse)
Despite all this the "Pearson" correlation for useruser similarity does in fact seem to perform well.
101906  2
I was thinking that to have any chance of winning I need to get some teammates / helpers & use the power of the internet. I don't want to deal with signing people up & working out contracts where we'd have to negotiate shares & so on. I was thinking I'd do something like just give away my app, but make people agree that they won't do their own submission or something. Then I at least might attract some hobbyists who want to play with the problem and they might find some improvements or try some ideas that are beneficial.
101906  1
10/18/2006
101806  1
10/17/2006
101706  6
1. They all use this "cosine" similarity measure. The cosine similarity is basically the dot product of the normalized vectors of ratings. What that means is that two movies with ratings like {2,2,2} and {4,4,4} are considered to be identical by this measure. Now, that's an okay thing assuming you're compensating for movie average, since if you subtract the average off both they are both {0,0,0}. However, the standard way of making an itembased prediction does NOT subtract off the average! It's reported in the literature as
Pred = Sum{other items} Weight(this,other) * Rating(other item) / sum of weights;
If you were to correct for averages it would be :
Pred = Average(this item) + Sum{other items} Weight(this,other) * (Rating(other item)  average(other item) / sum of weights;
But that's not what they report.
2. The exception to this (not correcting for average) is the "Similarity Fusion" paper (sigir06_similarityfuson.pdf) which DOES correct for averages, however, they specifically claim that their method in the case of itemonly weights reduces to the standard itembased predictor which it in fact does NOT. It reduces to the 2nd term above which is the averagecorrected item based predictor, which is quite different.
It seems to me the cosine weighting is very bad if you don't correct for averages, and yet the standard in the literature is to use the cosine weighting and not correct for averages. WTF. The standard "ItemBased Collaborative Filtering" paper (www10_sarwar.pdf) does try using a linear slope+bias to shift ratings between items, but they find that it hurts performance with lots of neighbors or dense data. Weird.
101706  5
101706  4
101706  3
101706  2
Some of the current leaders have finally spoken up in the Netflix forum. They seem to all be using some sort of standard academic bayesian/neural net learning system. This could be hard to beat if they're good & have powerful server clusters to do the training, which they presumably do at universities.
101706  1
10/16/2006
101606  2
So, I've read through a bunch of the literature. It seems like I independently invented most of the same techniques. Some interesting tidbits : I'm currently using 2 similar movies & 2 similar users, while the accepted good # in the academic community seems to be 30 (!!). (using more doesn't increase my run time at all since I'm looking at all of them anyway to pick the best 2). Also, the way they're comparing vectors is generally a dot product, while I've been using like a Euclidean distance. I'm not sure if that will make much of a difference. They also do fudging on the similarity to penalize small intersections and such; I have similar fudging and I've found that the details of those fudge factors are enormously important in practice. I can't tell in the literature if people have really investigated different similarity functions, or everyone is just using the one that was published first and has become standard.
One cool approach is the "ContextBoosted Collaborative Filtering" paper, though it seems completely unusable on large data sets because the first step is turning the sparse matrix into a full matrix by filling in the gaps with contentbased predictions. That is, it would fill in the full 17k x 480k matrix of ratings, which is 8 billion ratings, which aside from taking about 8 GB of storage (not bad), would take around 1,000,000 hours to compute.
101606  1
10/14/2006
101406  1
10/13/2006
101306  4
101306  3
Having the probe data contained in the training set is actually really annoying, because it means I can't fairly precompute very much based on the training set. For example, if you precompute the "most similar" other movie for each movie, you will be using the probe data as part of that, and that helps a *TON* which totally throws off testing on the probe.
There's one case maybe y'all can give me ideas on. It's surprisingly common to have a movie for which I can't find any good "similar" movies. These "red herring" movies wind up contributing a lot to the RMSE score because they tend to be predicted badly. Right now I'm just using the average rating of them as their prediction, but there must be something better.
101306  2
One thing I'm surprised at. 25% of movies fall into a category of "super predictable" cases. In these cases, I gave the same rating to the two most similar movies, and the most similar user gave the same rating to the query movie. The RMSE of these queries is still 0.85 ; I would have thought these cases would be more predictable.
Part of the problem with making a really amazing predictor is that the ratings are so discretized. If somebody felt like a "3.5" about a movie, they will sort of randomly shove it to a "3" or a "4" depending on how they feel, but the difference is a whole 1.0 which is big.
101306  1
BTW I find the whole "Prize" aspect sort of annoying because it means no one is talking about their approaches publicly. Of course, even in fields where there is no prize, people are still supersecretive, because they always think they have some brilliant idea which is going to make them famous in the academic world, or get them patents and make them rich, blah blah blah.
Oh, I still could use some volunteers to run the heavy process if you have a fast machines that are idle.
10/12/2006
101206  1
10/10/2006
101006  2
101006  1
10/09/2006
100906  1
10/06/2006
100606  3
RMSE for guessing each movie gets its average rating : 1.051938
RMSE for Netflix' Cinematch : 0.9474
RMSE for simple onemovie predictor : 1.037155
So, that's not very encouraging. Obviously I knew it wasn't going to be close to Cinematch, but the improvement over just using the average for each movie is microscopic, and cinematch is like light years ahead.
I'm starting to think that beating Cinematch by 10% is all but impossible. It's a lot like data compression where the closer you get to the true entropy the harder it is to make advances. The Netflix Prize FAQ justifies their 10% criterion by saying the difference between the "average" predictor and Cinematch is about 10%, so they want 10% more. That fails to address two major factors : "average" is actually not an awful predictor at all, and the next 10% is way way harder than the first 10%.
100606  2
100606  1
Anyone know anything about Memory mapping? Is there a reason I should memory map instead of just reading the whole file into big buffers?
Also, does anyone have a powerful machine that I can run a heavy process on for a long time? I estimate it will take about 200 hours of CPU time, 2 Gigs of RAM is necessary.
10/04/2006
100406  2
Postman. Apparently very hard to get. I like walking around
Parks worker. Similar situation, I like planting plants and such, being outside
Apprentice Carpenter. I've seen some adds for this; I have no skills but would like to learn. I could offer to work the first few weeks with no pay
Bike shop retail or repair. Again my bike repair skills are only passable not expert, but I'd like to learn.
Cook. I'd love to be a chef but that's a long hard road and being an entrylevel prep cook or something is pretty brutal
Other ?
100406  1
Anybody who knows Windows  what's the best way to hold an 800 MB data structure on a machine with 1 GB of RAM ? I'm assuming if I just hold it in RAM I'm going to page like crazy and get killed, cuz Windoze is Stoopid. I guess I could write my program in basic C and boot to command line Linux or something ridiculous like that but that's not fun.
10/03/2006
100306  2
At the moment the site seems to be a mess. The documentation sucks. The forums are full of people who don't know how to program asking nonsense questions.
100306  1
10/02/2006
100206  3
The new bill was tacked on the "SAFE ports" bill by Bill Frist primarily. It will soon be signed and go into law. Basically what it does is makes it illegal to provide online gambling or transactions to online gambling. It does not make it illegal for an individual/consumer to partake of said online gambling. Offshore sites will continue to operate. It will be illegal for US banks or other funding instruments to provide transactions with them, so the hard part will be getting your money in & out of them, but of course there will be ways. Another problem is that major publicily traded gambling companies will be shutting off their US customers because of this. This is a bit of a funny thing, because they're foreign companies they aren't directly affected by this law, but they have to comply with it if they want their stock traded in US markets, and so on and complicated stuff I don't really understand. Anyway, it seems all the major publicly trading companies will soon be shutting off US customers. That includes PartyPoker/PartyGaming, PokerStars, 888 (BetonSports), etc. There are other major private companies which will continue to allow US customers, the biggest being Full Tilt, Ultimate Bet, etc. While the law will go into affect immediately, the bank transfer prohibition won't go into affect into the bank regulatory body draws up the detailed regulations, which will take a while.
100206  2
100206  1
10/01/2006
100106  1
9/29/2006
092906  1
I almost got a ticket today cuz I forgot to move my car for the street sweeping. I saw the little meter maid guys swoop into the street ahead of the sweeper, dropping tickets like bombs. I was playing poker in my pajamas and had just open raised my AK (fuck! AK = monies!), I scrambled to find my keys, then almost ran out without my house key (bad news, doors autolock, locked out, no fun), ran out and got to my car just as he was putting a ticket on it. Fortunately the guy was cool and took it off for me. Fucking street parking is a nightmare.
Some new developments mean the online poker ban might pass after all. I guess I'd have to get a job of some kind pretty quick because my finances aren't so hot these days, I need a regular income.
Late news  yep, the online gambling ban passed. It's not actually law yet, but it got tacked on the port security bill which is going to be passed for sure. On the plus side, I didn't actually make my big stock bet against the bill by buying Neteller or something. I'm sure Neteller stock is in a nosedive now.
9/27/2006
092706  1
9/26/2006
092606  2
1. When hitting your draw can complete a higher one card straight. For example if you have something like 78 and the board is 9TQ. It looks like you have an openend straight, but hitting a J is not great because any onecard K is a higher straight. Your draw really sucks in this case.
2. When hitting your draw makes a one card lower straight. For example if you have AT and the board is 9JQ. If you're up against another T and you hit a K, he will also have a straight, but you'll have the nuts and likely get paid off. In this case your draw is better than a normal straight draw because your implied odds are very good.
092606  1
In related news, for those who don't know, ESPN's coverage of WSOP is a ridiculous distortion of reality. It's just like Fox News or some other agendabased "news", they're trying to tell some story and they pick and choose bits of the truth and create characters and tell the story they want. Just for example  early on they showed Negreanu playing well. In reality he was playing a wild loose game, his chip stack was going way up and way down all the time; ESPN mainly only showed the hands he won which makes him look great, but he donked it up horribly in tons of pots. The other huge distortions are Jamie Gold related. Jamie was being a major ass and bluffing a LOT at the table and they mainly don't show it; he also didn't give any money to charity.
9/25/2006
092506  2
092506  1
9/24/2006
092406  3
ps. I've noticed a trend recently which I find quite sickening. There are more and more books being written which are basically onesentence concepts padded out to full book length. Nobody actually reads these books, the whole point is so that the author can go on talk shows or write newspaper articles and be known as the "author of such and such". Somebody might write a book called "How College Killed The Family" about how education reduces marriage rates; one paragraph of information padded with life stories and such garbage, then they do the talk show circuit where they just say their one sentence over and over, and are basically used as a springboard for the host to show their own feelings on the subject. I should write some of these books and just fill them with blank pages, that way I can be an "expert" and spout my nonsense.
old rants

▼
2006
(654)

▼
December
(26)
 123106  1
 123006  2
 123006  1
 122906  1
 122506  1
 122206  3
 122206  2
 122206  1
 122106  3
 122106  2
 122106  1
 121906  1
 121606  3
 121606  2
 121606  1
 121506  2
 121506  1
 121306  1
 121206  1
 121006  1
 120906  1
 120606  1
 120306  1
 120206  1
 120106  2
 120106  1

►
November
(32)
 113006  2
 113006  1
 112806  2
 112806  1
 112106  1
 112006  1
 111906  1
 111806  5
 111806  4
 111806  3
 111806  2
 111806  1
 111706  1
 111606  2
 111606  1
 111306  2
 111306  1
 111206  1
 111006  1
 110906  4
 110906  3
 110906  2
 110906  1
 110806  3
 110806  2
 110806  1
 110706  1
 110506  1
 110406  1
 110306  3
 110306  2
 110306  1

►
October
(39)
 102806  1
 102506  1
 102406  1
 102006  2
 102006  1
 101906  4
 101906  3
 101906  2
 101906  1
 101806  1
 101706  6
 101706  5
 101706  4
 101706  3
 101706  2
 101706  1
 101606  2
 101606  1
 101406  1
 101306  4
 101306  3
 101306  2
 101306  1
 101206  1
 101006  2
 101006  1
 100906  1
 100606  3
 100606  2
 100606  1
 100406  2
 100406  1
 100306  2
 100306  1
 100206  3
 100206  2
 100206  1
 100106  2
 100106  1

▼
December
(26)