R in production systems

R is great for prototyping models.  Not so great if those models have to run a business.  Here’s some tips to help with that:

  1. Validate, alert, and monitor
  2. Sink
  3. Use 64-bit Linux
  4. Write your own functions
  5. tryCatch

Validate, alert, and monitor:  Sooner or later something is going to go wrong with your model.  Maybe some parameter will get the wrong sign and it will recommend selling iPads for a nickel.  You can guard against this by constrained optimization, but really you need to have an automated check on any results before they go into production.  If model results change a lot between runs, you should be automatically notified.  And even if the model is running fine, you should produce summaries of its performance, and how it’s changing over time.  To email yourself the string message_text with subject my_subject in Unix, do:

string_to_execute <- paste("echo -e ", "\"", message_text, " \"",  " | mutt -s ", "\"", my_subject, "\" ", "erehweb@madeupemail.com", sep = "")
system(string_to_execute)

Sink: When things go wrong, you’ll need to debug your code.  Make it easier by writing all R output from print, cat and errors to a log file – e.g.

log_file <- file("/mydir/my_log_file.txt")
sink(log_file)
sink(log_file, type = "message")    # So you catch the errors as well

# Your code goes here

sink(type = "message")
sink()
close(log_file)

If you want to get fancy, you can build the date/time into the file name  by

log_time <- Sys.time()

file_suffix <- paste(format(log_time, "%m"), format(log_time, "%d"), format(log_time, "%y"), "_", format(log_time, "%H"), format(log_time, "%M"), sep = "")

log_file <- file(paste("/mydir/my_log_file_", file_suffix, ".txt", sep = "")

Use 64-bit Linux: R is bad at memory management.  You can try to use smaller structures, garbage collection (gc), rm structures where possible, but the best solution is to run it on 64-bit Linux with lots of memory.  Anything else is gambling.

Write your own functions: R has a lot of functions.  Unfortunately, many of them are buggy.  For example, here’s what the author of bayesglm has to say about it and glm:

… glm(), and its workhorse, glm.fit(), are a mess: They’re about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a couple hundred lines of naming, exception-handling, repetitions of chunks of code, pseudo-structured-programming-through-naming-of-variables, and general buck-passing. I still don’t know if my modifications [to produce bayesglm] are quite right–I did what was needed to the meat of the function but no way can I keep track of all the if-else possibilities.

Do you really want that code in a production system?  Copy it and call it my_glm or my_bayesglm.  That way it’s under your control, and will be easier to debug and fix.

tryCatch: Well, at least if you do run into an error, you can send yourself a nice email saying what went wrong and where – a little more elegant than just relying on your log file.

So should you use R in a production system?  Well, it’s free, and quick to develop in, so go ahead, but definitely keep your eyes open.

Advertisements

Tags: ,

27 Responses to “R in production systems”

  1. Ben Bolker Says:

    Counterargument about glm: it may be a mess internally, but it has been tested a lot more thoroughly than anything you write to replace it will be! For anything this important to your business you’d better start writing unit tests — then, if you do decide to replace it, you’ll be able to do so in a sane way.

    • erehweb Says:

      Thanks for the comment, Ben. I’ll agree that rewriting glm from scratch would be madness. But I’m more optimistic on being able to improve it via small alterations – I did this to correct a bug I found where glm would just not converge at all. Non-core code is even worse, I think. And yes, unit tests are a good idea.

      • Ben Bolker Says:

        Just out of curiosity: did you try to submit a patch/bug report? I will be the first one to admit that the barriers to bug reporting in R are *way* too high — this is probably my primary complaint about the administration of R.

      • erehweb Says:

        Ben, I did not try to submit a bug report. Mostly b/c there would be a fair amount of work involved in checking if the bug still existed in the latest version of R (I was on 2.10.1), creating a repeatable example, etc. Having listened in on the R-help list for a while, I was also not particularly eager to get involved with that process.

        But, if you’re interested, the bug is that sometimes glm’s fitting step will take you to a region far away from the solution, and then it will get lost and fail to converge. Interestingly, glm.fit does have a way of dealing with this if deviance becomes infinite, just halve the step and try again. But it didn’t apply the same logic if deviance just increases from step to step – in particular my problem was that the first step took me to a very worse place than initial position. So I just added that check in.

        In defence of the authors of glm.fit, this was a pretty rare situation, and I believe there was a problem with the starting position I was using. Still, it happened, and I had to deal with it.

      • Ben Bolker Says:

        Hmm. I don’t want to be a jerk, but … I feel that there’s a bit of an implied social contract with open source software, that if you find it useful and are capable of contributing to its improvement (at whatever level) that you should try to do so. Of course you’re right that creating a good bug report is not trivial, and as I stated above I think that the R community has a lot of room for improvement in leveraging this potential community feedback, but how much time and money did you save by using R in the first place?

      • erehweb Says:

        Fair point, Ben. I may yet do it – it was a comparatively recent thing, so I’ll put it back on my to-do list. Other reasons / excuses for not doing it are:

        1) I don’t actually have a patch – what I have is a hack that works for me and probably causes some unforeseen problems (definitely has some foreseen ones).

        2) How likely is it that I have actually found a bug in glm.fit, some pretty heavily-used code? I think so, but a Bayesian would proceed with caution 🙂 And the fact that no-one seems to have mentioned it, AFAIK, makes me think that it can’t be hitting too many people.

        3) I kind of feel like this sort of thing is best left to the experts in generalized linear models and their fitting – academics.

        4) There’s a sense in which it’s not really a bug. So it fails to converge? Lots of things fail to converge. The replies I have seen on r-help for similar problems have suggested changing scale, using a better model, better initial conditions etc.

        But I’ll concede your point.

      • Ben Bolker Says:

        One last response (I promise to shut up after this 🙂 )

        1. Fair enough.
        2. “No one else has mentioned it”– but perhaps they have all kept quiet just as you have (up until now)!
        3. Academics are great at fixing things, but not at finding weird edge cases. They tend to stick to a narrow subset of kinds of data.
        4. Fair enough, but again: if the software can be made more robust and work better for more problems, that’s better. It’s at least worth exploring.

        Maybe posting the problem, even incomplete, will (1) encourage other reports (and give a sense of how widespread the problem really is; (2) leave a record in the mailing list archives for someone else to find later; (3) possibly get the attention of someone (R-core or not) who is interested/could fix the problem. Obviously I can’t control whether you will get scolded for posting an “incomplete” bug report, but if you think your self-esteem can handle it, I would strongly encourage you to put together whatever you can do with a reasonable level of effort and post it to r-devel (you can blame it on me if you like).

        This kind of situation is why I get so irritated when people are scolded on the R lists for not preparing a sufficiently good bug report. Getting stupid bug reports from people who didn’t read the documentation is really irritating, but dissuading people from even mentioning problems that they have had is just pathological. It’s pretty easy to tell the difference between a clueless newbie and a savvy user who has encountered a problem.

    • John Mount Says:

      Definitely nothing you implement yourself will be as thoroughly tested as a core-R package (like glm). But for even fairly popular 3rd party R libraries you really get the feeling you are the first one to call the code (obvious errors, incompatibilities with standard ways of manipulating data, no ability to get results out in machine readable format and so on).

  2. Homer Strong Says:

    Another useful habit is to set your R process to dump frames on errors. They’re tremendously useful in finding issues. Of course logs are also helpful, but I still like to be able to poke around the environment when the shit hit the fan. I use something like

    options(error=function() { dump.frames(to.file=TRUE) })

  3. Alex Guazzelli Says:

    Interesting thoughts. I have used R for quite sometime now and have found it quite flexible and powerful. Like with any other software though, one should always to test results before moving a model to production.

    In terms of production deployment, R may not be the best way, but it does offer support for PMML, the Predictive Model Markup Language. PMML is the standard used to represent predictive models and is supported today by all the major commercial and open-source statistical packages. The R PMML package, available for download in CRAN supports the export of many different modeling techniques, including export for models built using glm. For more, please see article we published about the PMML package in the R Journal:

    http://journal.r-project.org/2009-1/RJournal_2009-1_Guazzelli+et+al.pdf

    Once a model is export into PMML, it can easily be moved around. For example, the ADAPA scoring engine consumes PMML models. Predictive models in ADAPA can be executed in real-time or batch-mode via a Web Console, Excel, or web-services. With a tool such as ADAPA, the operational deployment of models built in R is a walk in the park. Since ADAPA is also available as a service on the Amazon Cloud (besides a traditional on site license), it is being used by people all over the world to get the most out of their data for as little as $1 per hour. For more, please make sure to check the slides of a presentation I gave to the R users group in the bay area:

    http://prezi.com/nyxnpa4kqbdo/predictive-modeling-with-r-pmml-and-adapa/

    or visit zementis.com

    Thanks!

    Alex

  4. Scott Locklin Says:

    I’m a big proponent of Homer’s trick of dumping frames. This is real important if your code takes a long time to execute. Nothing like wondering what went wrong in an hour long process, and trying to guess at the offending function using mtrace.

    While I agree with John that stuff like GLM is going to be better tested in general, I’ve gotten hosed by core functions before: stuff as basic as getReturns!

    One huge one you didn’t mention is maintaining your own distribution of R for production purposes. This is particularly important if you’re depending on packages, as they often change drastically without warning. I’ve been hosed on this one twice now. If you’re not doing this: start doing it!

    • Ben Bolker Says:

      Interesting that in this community ‘getReturns’ is considered a core function. (I’m an ecologist/biologist/epidemiologist/statistician and had never heard of it.)

    • erehweb Says:

      Scott – yes, good thought, packages often do change drastically – I have been burned by that once.

  5. Wayne Says:

    I think you’re taking his complaint a bit out of context. This is a guy who is a superb statistician, complaining about how hard (and tedious) it is to program something. Several of the things he complains about make the function MORE robust and flexible, not less. It’s just that he wants to focus solely on the statistical aspects (the 10 lines that do something) and have the rest handled for him.

    Not a crazy request, and originally part of a discussion on a language to replace R, but I don’t think it really supports your idea that by taking over his code you’re making it better. Particularly since most heavy-duty R code will go to C at some point anyhow, making your task a lot more complex than simply maintaining 20 or 30 lines of R. (Which, of course, no really useful R function is.)

    • erehweb Says:

      Wayne, thanks for your comment. Here’s the example that prompted me to take over bayesglm.

      In the version of bayesglm I was using, the control options (to set trace, tolerance etc.) simply did not work. They would get superceded by glm’s own default control options, so you couldn’t see what was going on from step to step, or change the tolerance. Like John Mount above says, it looked like I was the first one to try this code. So I took it over and fixed it.

      Now, maybe this bug in bayesglm has been fixed in subsequent versions, but keeping up-to-date with a constantly changing package has its own problems, as Scott says above.

      Agreed, the heavy-duty R code goes to something else – the real work of glm.fit is done in a one-line fitting step. I’d have no intention of touching that, but there are improvements that can be made around the edges.

  6. Top Posts — WordPress.com Says:

    […] R in production systems R is great for prototyping models.  Not so great if those models have to run a business.  Here’s some tips to help […] […]

  7. mpiktas Says:

    I must be using different R, since I rarely encounter some of the problems described here. I started using R in 2002 and it is basically my 2nd main tool for earning my living (1st being my brain :)) and I have found 3 bugs in core R packages, 2 typos in font encoding files and a bug in predict.lmList from nlme package. Of course add-on packages have more bugs, but strictly they cannot be considered R, since for bug fixing you need to contact package maintainers.

    For memory management R really sucks under Windows, but that is a problem of Windows memory management according to R team. Under Linux it works very nicely. Of course if you do not want to run into swap, you need to be careful with your code. R is prone to duplicating the variables, but with some care it can be avoided.

    As another poster said the complaint you cite is of famous statistician (Andrew Gelman), not a programmer. Yes at first glance R code looks unnecessarily messy, but this is because the idea of R is that you pass data and formulas which are easy to understand and get understandable results. R acts as a gateway between the algorithms and your data. If you already formatted your data and you need to fit only one model, yes R has an overhead. But it was designed to help people to create models, and that process entails trying out many alternative models. This is where this overhead is useful, since you do not need to write a lot of house-keeping code to try the models out.

    Here is the example. Suppose I have data with two variables y and x. Here is R code for trying out 3 different models:

    df <- data.frame(y=rnorm(100),x=rnorm(100))

    lm(y~x,data=df)
    lm(y~log(x),data=df)
    lm(log(y)~log(x),data=df)

    Now do this with lsfit.

    X <- df$x
    y <-df$y

    lsfit(X,y)

    X<- log(df$x)

    lsfit(X,y)

    y <- log(df$y)

    lsfit(X,y)

    I have to prepare my data each time and I do not get any of the additional features from summary.lm. On the other hand if I am interested in the coefficients only and I already know the model I want to fit, using lm is overkill.

    And to end my lengthy reply it might be a good idea to give more context when you complain about something. Now your post implies that R is not suited in running any business, when in fact it is not suited in running yours. It does run my business just fine. Business is a very vague term.

  8. erehweb Says:

    Thanks for your comments, mpiktas. Without getting into definitions too much, I think one really has to use add-on packages to get value from R.

    R’s memory management problems under Windows may well be Windows’ fault. That doesn’t really matter to the end user – I think both of us say that using Linux is the way to go with R. I’d still say you need a lot of memory, having run into memory problems even on Linux.

    I’ll note that when Gelman writes a package, he *is* a programmer. My point in the post is that even if you can’t improve on the statistics, you can often improve on someone else’s programming. See also my reply to Wayne above, talking about the bug in bayesglm.

    As for more context, I’m basically thinking of a case where you have finalized some models, and R code is automatically recalibrating them from new data every so often.

  9. Paul Martin Says:

    When I read the title, the application which occurred to me was quite different: rolling out SPC charts in a factory so that they are accessible to every operator at every tool. This is notoriously expensive to do with commercial statistical packages which go for $500 a seat and up. You are also using 1% of the package without any discount. This is an ideal application for R. I wonder to what extent R has penetrated the industrial quality control community.

    • Kirk Mettler Says:

      Paul-

      I am sad to say I have seen little use of R in factories. For 15 years before getting involved with R, I ran manufacturing companies. I would have loved to have a tool like R on the floor! Maybe some day.

  10. Things I wish I’d known before I started using R « Erehweb’s Blog Says:

    […] a lot of useful resources out there.  It’s my blog, so I’m going to point you to my “R in production systems” post, but John Mount and Nina Zumel’s posts on R are an excellent read, particularly Survive […]

  11. datanalytics » Consejos para utilizar R “en producción” Says:

    […] otro día di con una entrada en una bitácora con cinco consejos para utilizar R en producción. Cuatro de ellos son […]

  12. None Says:

    Thanks for the comment about sink! Didn’t manage before reading this to output both output and warnings to same file.

  13. None Says:

    However it seems that you can use
    file_suffix <- format(log_time,"%m%d%y_%H%M")
    instead of your complicated paste (not sure how well this works on other locales)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: