My fellow bloggers John and Scott have posted recently about the free statistical programming language R. How does it compare to an expensive language like SAS?
If you’ve done any statistical analysis, then you’ll know that getting and cleaning the data is a major step in any project. SAS does a pretty good job at this, and will complain if the data is not in the format you think it is. As for R, here’s an excerpt from the R FAQ:
7.10 How do I convert factors to numeric?
It may happen that when reading numeric data into R (usually, when reading in a file), they come in as factors. If f is such a factor object, you can use
as.numeric(as.character(f))
to get the numbers back. More efficient, but harder to remember, is
as.numeric(levels(f))[as.integer(f)]
In any case, do not call as.numeric() or their likes directly for the task at hand (as as.numeric() or unclass() give the internal codes).
As one of my favorite musicals says, “It ain’t no joke, that’s why it’s funny”. Maybe when you do an uncommon operation like reading in a file, your numbers will be silently converted into factors / categorical variables. Or maybe not. Ha ha. But certainly, don’t do anything silly like thinking as.numeric(f) would convert f into numbers you might want. Ha ha ha. Oh, and that “more efficient” way of doing things? It crashes if f was actually numeric to start with. Ha ha ha ha. Stop, you’re killing me! [or at least, my productivity].
To complete the joke, here’s an excerpt from the R manual:
In general, coercion from numeric to character and back again will not be exactly reversible, because of roundoff errors in the character representation.
That’s fair enough. It’s not as if you have a good reason for doing this, except perhaps when you’re reading numbers in from a file.
May 26, 2009 at 10:26 pm |
Does Matlab allow you to dump variable precision text? I don’t think it does. That would be hard. You can easily represent things in base10 text which don’t map exactly to 4 or 8 byte floats. I mean, you probably won’t unless you’re truncating somewhere when you print, but I think it’s a reasonable warning to statisticians who might not be down with the latest flop radix baloney.
If you’re importing “square” data into R, read.table() or read.csv() or read.fwf or read.delim are going to work better than DIY. Data importation is a weakness, but it often works pretty handily this way. it’s way better than what you have to do in Lush, where Regex is absurdly slow: you end up writing sed scripts to match to a C template. Fast, but awkward.
May 27, 2009 at 8:09 am |
It’s like that joke about boats: the worst thing you can do with them is put them in the water. The worst thing you can do with R is give it data…
We make a point of doing as much of the data massaging and cleaning as we can outside of R (a point John conveniently ignored in his post), usually in Java. As Scott says, read.table() is probably the most reliable way to import data into R. The factor/numeric conversion issue generally only arises when the data is integer, so rounding error is not actually that much of an issue. Also, the conversion often occurs at the function level, not the import level (e.g. some functions seem to automatically convert integer y-data to levels, unless told otherwise).
That said, I have to agree: there are certain operations that are easy in any other data analysis language that are insanely difficult in R, and R has the worst “help” system it’s ever been my misfortune to use. But its data visualization capability is a thing of beauty.
October 15, 2009 at 10:07 am |
[...] some of our, perhaps wiser, colleagues ( see: Choose your weapon: Matlab, R or something else? and R and data ). While we do like R (see: Exciting Technique #1: The “R” language ) we also understand the [...]
November 18, 2009 at 7:31 am |
I’ve used R for eight years and never had numeric data from textfiles coerced to factors.
Certainly there are real-life quirks and nuisances with R, but this example isn’t a common one.
November 18, 2009 at 4:18 pm |
Rodney, thanks for writing. I haven’t used R much, but this example came up the first time I used it, debugging code written by someone else. Is this example uncommon? Maybe I was just unlucky. OTOH, it has made it into the R FAQ, so the F part means it can’t be too uncommon. I also see people’s search terms for this post indicating they’re trying to solve this problem. I’ll give you that I don’t think the numeric to character and back again problem is too much of an issue, as Nina says in her comment. But rest assured, the example is very much a real life one.
You do have a point though, that the example is uncommon enough for a solution to be hard to find on the Web. This is not really a point in R’s favor – it just makes the problem harder to solve when it does come up.
My point is that reading data correctly and easily should be job 0 of any statistical language. SAS has many faults, and is a much uglier language to program in than R (or just about anything), but it does seem to realize this basic fact.
February 23, 2010 at 8:52 pm |
[...] extremely frustrated trying to import a simple dataset into R today I stumbled upon a post by Erehweb who sarcastically dissects the difficulty of importing numeric data into R and having it [...]
May 12, 2010 at 7:52 am |
[...] Unlike my fellow bloggers at Win-Vector, I’m not a big fan of R. But you can do a lot of statistics in it, and it’s free, so no need for your clients to [...]
March 14, 2011 at 9:10 am |
[...] is a really great function. But it has some gotchas – e.g. default options may convert numbers to factors. Dealing with data is a whole other post, but you can always convert back using [...]
March 14, 2011 at 11:40 am |
Granted, automatic conversion to factors is really annoying (though when it happened to me, it actually picked up errors in my dataset (i had replaced 00 with NA).
But its worth noting (in case someone finds this page, that the read functions have a ColClasses argument which can be used to ensure that data is imported correctly.
March 14, 2011 at 1:16 pm |
Thanks for the comment, disgruntledphd. Yes, definitely worth looking through the read.table documentation and colClasses, stringsAsFactors. As you say, conversion to factors is often a sign that there’s some error in your data, and data cleaning is a whole other post….
April 23, 2012 at 7:11 pm |
I had this issue importing tab delimited text files into R that had been exported from excel.
Turned out #N/A (used by excel) in numeric columns caused conversion from numeric to factor (even when there was no #N/A in the first five rows).
Fixed with modifier na.string=”#N/A” in the read.delim command