Archive for May, 2009

R and data

May 26, 2009

My fellow bloggers John and Scott have posted recently about the free statistical programming language R.  How does it compare to an expensive language like SAS?

If you’ve done any statistical analysis, then you’ll know that getting and cleaning the data is a major step in any project.  SAS does a pretty good job at this, and will complain if the data is not in the format you think it is.  As for R, here’s an excerpt from the R FAQ:

7.10 How do I convert factors to numeric?

It may happen that when reading numeric data into R (usually, when reading in a file), they come in as factors. If f is such a factor object, you can use

as.numeric(as.character(f))

to get the numbers back. More efficient, but harder to remember, is

as.numeric(levels(f))[as.integer(f)]

In any case, do not call as.numeric() or their likes directly for the task at hand (as as.numeric() or unclass() give the internal codes).

As one of my favorite musicals says, “It ain’t no joke, that’s why it’s funny”.  Maybe when you do an uncommon operation like reading in a file, your numbers will be silently converted into factors / categorical variables.  Or maybe not.  Ha ha.   But certainly, don’t do anything silly like thinking as.numeric(f) would convert f into numbers you might want.  Ha ha ha.  Oh, and that “more efficient” way of doing things?  It crashes if f was actually numeric to start with.  Ha ha ha ha.  Stop, you’re killing me!  [or at least, my productivity].

To complete the joke, here’s an excerpt from the R manual:

In general, coercion from numeric to character and back again will not be exactly reversible, because of roundoff errors in the character representation.

That’s fair enough.  It’s not as if you have a good reason for doing this, except perhaps when you’re reading numbers in from a file.

Wolfram Alpha

May 18, 2009

New search engines make for easy posts for lazy bloggers – type in your favorite search term (often your own name), snark that you can’t find it, and hit the publish button.  This post might not be an exception.

Valleywag claims that Wolfram Alpha “excels at providing information people don’t care about“.    Harsh, but is it fair?  I tried it on something I do care about, finding the average rate for statistical consulting.  Unfortunately, “what is the average statistical consulting rate?” or even just “statistical consulting rate”  gives nothing on Wolfram Alpha, while Google returns as its second hit a pdf from the American Statistical Association – a survey of statistical consulting rates.  Score one for Google.  Just for fun, I tried out cuil [remember them?], which gave similar results to Google, without the most useful one, but with a photo of college students being served lunch.

What’s going on here?  It seems to be an example of a general principle – it can be very hard to beat a simple idea, whether it be exponential smoothing, index funds, or the script for Friends.  Google’s basic idea is pretty simple – look at pages that get the most links from other pages.  Wolfram Alpha’s is more complicated and probably “smarter”.  Wolfram might object that this was not a fair test, and I should have asked for the integral of x^2 sin x, and that eventually they will get around to working out other quantitative questions such as average consulting rates, how much it costs to rent an apartment in Chicago or what the most fuel-efficient small car is.  In the meantime, it seems that Valleywag has a point, and the lesson to be learned is that useful beats smart any day of the week.