R and data

My fellow bloggers John and Scott have posted recently about the free statistical programming language R.  How does it compare to an expensive language like SAS?

If you’ve done any statistical analysis, then you’ll know that getting and cleaning the data is a major step in any project.  SAS does a pretty good job at this, and will complain if the data is not in the format you think it is.  As for R, here’s an excerpt from the R FAQ:

7.10 How do I convert factors to numeric?

It may happen that when reading numeric data into R (usually, when reading in a file), they come in as factors. If f is such a factor object, you can use

as.numeric(as.character(f))

to get the numbers back. More efficient, but harder to remember, is

as.numeric(levels(f))[as.integer(f)]

In any case, do not call as.numeric() or their likes directly for the task at hand (as as.numeric() or unclass() give the internal codes).

As one of my favorite musicals says, “It ain’t no joke, that’s why it’s funny”.  Maybe when you do an uncommon operation like reading in a file, your numbers will be silently converted into factors / categorical variables.  Or maybe not.  Ha ha.   But certainly, don’t do anything silly like thinking as.numeric(f) would convert f into numbers you might want.  Ha ha ha.  Oh, and that “more efficient” way of doing things?  It crashes if f was actually numeric to start with.  Ha ha ha ha.  Stop, you’re killing me!  [or at least, my productivity].

To complete the joke, here’s an excerpt from the R manual:

In general, coercion from numeric to character and back again will not be exactly reversible, because of roundoff errors in the character representation.

That’s fair enough.  It’s not as if you have a good reason for doing this, except perhaps when you’re reading numbers in from a file.

About these ads

11 Responses to “R and data”

  1. Scott Locklin Says:

    Does Matlab allow you to dump variable precision text? I don’t think it does. That would be hard. You can easily represent things in base10 text which don’t map exactly to 4 or 8 byte floats. I mean, you probably won’t unless you’re truncating somewhere when you print, but I think it’s a reasonable warning to statisticians who might not be down with the latest flop radix baloney.

    If you’re importing “square” data into R, read.table() or read.csv() or read.fwf or read.delim are going to work better than DIY. Data importation is a weakness, but it often works pretty handily this way. it’s way better than what you have to do in Lush, where Regex is absurdly slow: you end up writing sed scripts to match to a C template. Fast, but awkward.

  2. Nina Zumel Says:

    It’s like that joke about boats: the worst thing you can do with them is put them in the water. The worst thing you can do with R is give it data…

    We make a point of doing as much of the data massaging and cleaning as we can outside of R (a point John conveniently ignored in his post), usually in Java. As Scott says, read.table() is probably the most reliable way to import data into R. The factor/numeric conversion issue generally only arises when the data is integer, so rounding error is not actually that much of an issue. Also, the conversion often occurs at the function level, not the import level (e.g. some functions seem to automatically convert integer y-data to levels, unless told otherwise).

    That said, I have to agree: there are certain operations that are easy in any other data analysis language that are insanely difficult in R, and R has the worst “help” system it’s ever been my misfortune to use. But its data visualization capability is a thing of beauty.

  3. Win-Vector Blog » Survive R Says:

    [...] some of our, perhaps wiser, colleagues ( see: Choose your weapon: Matlab, R or something else? and R and data ). While we do like R (see: Exciting Technique #1: The “R” language ) we also understand the [...]

  4. Rodney King Says:

    I’ve used R for eight years and never had numeric data from textfiles coerced to factors.

    Certainly there are real-life quirks and nuisances with R, but this example isn’t a common one.

  5. erehweb Says:

    Rodney, thanks for writing. I haven’t used R much, but this example came up the first time I used it, debugging code written by someone else. Is this example uncommon? Maybe I was just unlucky. OTOH, it has made it into the R FAQ, so the F part means it can’t be too uncommon. I also see people’s search terms for this post indicating they’re trying to solve this problem. I’ll give you that I don’t think the numeric to character and back again problem is too much of an issue, as Nina says in her comment. But rest assured, the example is very much a real life one.

    You do have a point though, that the example is uncommon enough for a solution to be hard to find on the Web. This is not really a point in R’s favor – it just makes the problem harder to solve when it does come up.

    My point is that reading data correctly and easily should be job 0 of any statistical language. SAS has many faults, and is a much uglier language to program in than R (or just about anything), but it does seem to realize this basic fact.

  6. Random conversion of imported data to factors in R | Techonomist Says:

    [...] extremely frustrated trying to import a simple dataset into R today I stumbled upon a post by Erehweb who sarcastically dissects the difficulty of importing numeric data into R and having it [...]

  7. Reflections on consulting part 5 – what languages and tools to learn? « Erehweb’s Blog Says:

    [...] Unlike my fellow bloggers at Win-Vector, I’m not a big fan of R.  But you can do a lot of statistics in it, and it’s free, so no need for your clients to [...]

  8. Things I wish I’d known before I started using R « Erehweb’s Blog Says:

    [...] is a really great function.  But it has some gotchas – e.g. default options may convert numbers to factors.  Dealing with data is a whole other post, but you can always convert back using [...]

  9. disgruntledphd Says:

    Granted, automatic conversion to factors is really annoying (though when it happened to me, it actually picked up errors in my dataset (i had replaced 00 with NA).

    But its worth noting (in case someone finds this page, that the read functions have a ColClasses argument which can be used to ensure that data is imported correctly.

    • erehweb Says:

      Thanks for the comment, disgruntledphd. Yes, definitely worth looking through the read.table documentation and colClasses, stringsAsFactors. As you say, conversion to factors is often a sign that there’s some error in your data, and data cleaning is a whole other post….

  10. Abus Says:

    I had this issue importing tab delimited text files into R that had been exported from excel.

    Turned out #N/A (used by excel) in numeric columns caused conversion from numeric to factor (even when there was no #N/A in the first five rows).

    Fixed with modifier na.string=”#N/A” in the read.delim command

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: