No, shut up. What statistical programming languages can learn from Dropbox.

If you’ve ever worked with business users (people with an MBA, or who might as well have one), you’ll know that they want just two things: pivot tables, and graphs.  And the graphs always have to have two different scales for the y-axis, for some reason.  You’ll want a third thing – to read in and parse data.  So why is it so hard to do all of these in any one statistical programming language?

I’ve picked on R for a while (and it actually gets a passing grade with the plyr and doBy packages), so let’s look at Python and pivot tables.  This page shows how to create summary tables in Python – the key lines to define a summary function are:

from itertools import groupby
from operator import itemgetter
def summary(data, key=itemgetter(0), value=itemgetter(1)):
    """Summarise the supplied data.

       Produce a summary of the data, grouped by the given
       key (default: the first item), and giving totals of
       the given value (default: the second item)
    The key and value arguments should be functions which,
       given a data record, return the relevant value.
    """
   for k, group in groupby(data, key):
        yield (k, sum(value(row) for row in group))

What a mess. [*]  Obscure auxiliary functions?  Check.  List comprehension?  Check.  Iterator object?  Check.  Overcomplicated, overabstracted, and hard to understand?  Check, check and check.  (And see comments to the original link for an extension to a .csv file, which needs lambda functions)  This recipe extends it to multiple keys, with an example of use that looks like this:

for (region, city), total in summary(sales, key=set_keys(0,1),
       value=itemgetter(3)):
    print "%-10s  %-10s : %8d" % (region, city, total)

Why so convoluted? It’s not as if this is a new problem – SAS basically solved it in the 70s – compare the use of PROC MEANS (example adapted from “The Little SAS book”). Even if you don’t know SAS, you can make a pretty good guess at what it’s doing.

PROC MEANS NOPRINT DATA = sales;
    BY CustomerID;
    VAR Petunia SnapDragon Marigold;
    OUTPUT OUT = totals MEAN(Petunia SnapDragon Marigold) =
       MeanPetunia MeanSnapDragon MeanMarigold
       SUM(Petunia SnapDragon Marigold) =
       TotPetunia TotSnapDragon TotMarigold;
RUN;

(OK, I’m partly comparing the apple of a function definition with the orange of a function use – so what? The point is that the base SAS function is already defined, so you don’t have to worry about it, and that 30 years of “progress” in language development has given us something that is much less flexible and easy to use.  Don’t get too smug though, SAS – you’re expensive, and your graphing is awful.)

Where does Dropbox come in?  Recently on Quora somebody asked “Why is Dropbox more popular than other tools with similar functionality?” The winning answer, by Michael Wolfe, was:

Well, let’s take a step back and think about the sync problem and what the ideal solution for it would do:

•There would be a folder.
•You’d put your stuff in it.
•It would sync.

They built that.

Why didn’t anyone else build that? I have no idea.

“But,” you may ask, “so much more you could do! What about task management, calendaring, customized dashboards, virtual white boarding. More than just folders and files!”

No, shut up. People don’t use that crap. They just want a folder. A folder that syncs.

“But,” you may say, “this is valuable data…certainly users will feel more comfortable tying their data to Windows Live, Apple Mobile Me, or a name they already know.”

No, shut up. Not a single person on Earth wakes up in the morning worried about deriving more value from their Windows Live login. People already trust folders. And Dropbox looks just like a folder. One that syncs.

“But,” you may say, “folders are so 1995. why not leverage the full power of the web? With HTML 5 you can drag and drop files, you can build intergalactic dashboards of stats showing how much storage you are using, you can publish your files as RSS feeds and tweets, and you can add your company logo!”

No, shut up. Most of the world doesn’t sit in front of their browser all day. If they do, it is IE 6 at work that they are not allowed to upgrade. Browsers suck for these kinds of things. Their stuff is already in folders. They just want a folder. That syncs.

That is what it does.

Memo to designers of statistical programming languages:  You may say “What about tuples, lambda functions, generator objects?”  No, shut up.  People don’t use that crap.  But they do want pivot tables, graphs, and to be able to read in and parse data.

[*] I should be clear that my complaint is with Python rather than the code as such.

Advertisements

Tags: , ,

30 Responses to “No, shut up. What statistical programming languages can learn from Dropbox.”

  1. BAU Says:

    Dude, get over yourself, what you are doing is playing with data.

    The real statistical computing is much more convoluted, in fact real statisticians need to C functions from python via swig and what not.

    Just use Excel.

  2. Anon Says:

    Your example was downright terrible. The SAS code looks (at least) as complicated as the Python code.

  3. ELM Says:

    I’m with BAU on this one. R and Stata are used by PhD researchers in Economics, Mathematics, Statistics, Political Science and so forth, and none of them care about pivot tables or graphs.

  4. asdf Says:

    I love that your “Messy” python is two lines of logic. Two lines.

  5. Alessio Says:

    I couldn’t agree more. I have been doing a lot of thinking about this and my final outcome is that there is nothing out there that makes decent “business” reporting.

    People use Excel, but that is just because there is no other option. And the time wasted in doing the same stuff again and again, instead of using a script, could be easily reduced by a graph/table/easy script/ centered solution.

    I tried to develop this myself, but I am a bad designer. With Qt and Python, however, should not be impossible to start something decent that eventually can grow.

    Ps Bau, you totally missed the point.

  6. Andrew Says:

    I agree with the sentiment re: simplicity, but I’m not sure it’s appropriate to compare a polished consumer facing product like dropbox, and a language designed for extensibility… It’s like admiring the nice patterns a snake leaves in the sand, then complaining about your dog not leaving the same.

    • erehweb Says:

      Thanks for your thoughts, Andrew. I would say that the language designed for extensibility should ideally have a polished core.

  7. ggruschow Says:

    Why do you think your very specific problem is best solved with a programming language?

    • erehweb Says:

      The problem is very specific, but also very common. Programming languages have lots of advantages – can do other analysis, reuse the script etc. etc.

  8. euromix Says:

    I agree it’s hard to give what business customer want. they want very very simple tools that hides the under complexity.

    They want a database that gives reports as problems falls on their head and they don’t care at all if it’s consistent with the database schema or the rules they have requested before for something else. (as it’s for something else isn’t it ?)

    To balance this and show i still care for them nevertheless, i can say that software designer are not any better when asking an architect to design their house , or asking a lawyer to design their legal protection or asking the teacher to deal with their kids. (as far i could see)

    🙂

  9. Scott Locklin Says:

    R sucks at usability and documentation. It’s one of the worst things I’ve ever used in those regards, and I use an obscure machine learning lisp written by 2 Frenchman and a German (which is enormously better in both usability and documentation, demonstrating to me at least that R’s badness is completely unnecessary).
    There is Stata and SPSS -I hear they’re OK for this sort of thing.

  10. Alex Farquhar Says:

    I’ll just pitch in on R’s side – I agree that the standard R is a bit of a nightmare, and the documentation is…opaque, to say the least. I was about to give up on it until I got into reshape, plyr and ggplot2. They’re make the whole thing awesome, and I’ll never be going back to excel. And as for business users only wanting pivot tables and charts, maybe because that’s what they’ve been trained to expect?

  11. Adam Bard Says:

    I don’t think anyone’s ever claimed that python was a statistical programming language. It’s a general purpose language that happens to have tools that let it do so.

    Besides, as someone that’s pretty familiar with python, I have no problem reading that code. Many of the constructs used are done for efficiency, not readability, and are really meant for the python-centric. Besides, itertools is to a python enthusiast as a fixed-gear bike is to a Brooklyn hipster.

    Here’s what someone who didn’t know about list comprehensions, itertools, or lambda functions might write to do the same thing: https://gist.github.com/839705

    • erehweb Says:

      Thanks for the comment and the code, Adam. Well, people have claimed that it is a good idea to do statistical analysis in Python, and offered this as a good alternative to R. So that seems enough like a statistical programming language to me.

      Your code is definitely simpler, but I think would have problems generalizing to something like standard deviation – there I think you would want to use Python’s constructs. Partly your point? 🙂

      My main point is that creating a summary table is such basic functionality that it should be included in the core language, and in a very simple way.

      • Adam Bard Says:

        I disagree that statistical tools should be included in the core language. Python has pretty decent package-management tools (easy_install and pip) that make installing libraries painless, and there are definitely well-documented and very useful libraries out there.

        For example, NumPy (and its big brother SciPy) is a python library that goes a long way into making python a reasonable MATLAB replacement, and segues nicely into a solution for you:

        import numpy; numpy.std(data) # Do a standard deviation.

        I think someone’s oversold you on python. It’s a programming language — a scripting language even, comparable to ruby or perl or what-have-you — not a computational software package.

        Pivot tables are not something that everybody needs, and just maybe, python isn’t the right tool for you for this job.

  12. Mike Says:

    The funny thing is, I can take the answer you copied from Quora and change DropBox to Microsoft Excel and the answer stays the same.

    People are so hell-bent on replacing Microsoft Excel, but Microsoft Excel WORKS. I’ve been working with users for over a year now who do their ENTIRE job in Excel. NOTHING you build can replace the flexibility they get unless you rewrite Excel.

    What about a real programming language? No, shut up. Users care about formulas. They care about pivot tables.

    But don’t they want to see it on the web? No, shut up. They want to copy. Paste. Save. Print. And see pretty graphs. That’s IT.

    But… no. Shut up. I hate to say it, but Microsoft simply won’t be supplanted in the spreadsheet space because Excel really is a superior product. OpenOffice bites, Google’s spreadsheet solution is ok, but too limited, and Numbers is a toy. Excel uber alles.

  13. jhasdlkjfhas Says:

    I use tuples, anonymous functions and maybe not so much generators.

    My problem with R is the mess of mapply, sapply and apply and the data structures which are hard to convert into one another. It’s a fucking mess.

  14. Christian Gunning Says:

    I’m shocked to see the hate! I’ve used R daily for years and I love it. It gets better every year. The documentation for core R is great once you learn to read it (user-contributed packages are another story). Extending R via C or C++ is now easy, thanks to the Rcpp and inline packages. In short, R is a *programming language*.

    “R and Stata are used by PhD researchers in Economics, Mathematics, Statistics, Political Science and so forth, and none of them care about pivot tables or graphs.”

    Are you craaaaazy? Have you ever read a peer-reviewed paper? Have you read more than one, such that some included figures and some didn’t? Then you might have noticed that figures are *fundamental* to the scientific process. If not, do please consult Tukey.

    “Pivoting” is just a catchy way of saying **marginalize over a variable**.
    With the R apply function, and more recently the plyr package, i can write *eadable one-liners to compute pivot matrices on complex data structures. Who ever needed to know the mean and sd weight of all samples in an experiment by sex, for example?

    “I’ve been working with users for over a year now who do their ENTIRE job in Excel. NOTHING you build can replace the flexibility they get unless you rewrite Excel.”

    I’ve graded papers from students that use excel, and it defaults to _ugly_. I’ve tried to open .csv files with more than a million rows — no dice, don’t know what the limit is now, but i’m guessing it’s less than a billion rows? Not my definition of flexible.

    R is a programming language. It assumes a level of maturity and curiosity in order to make the hard possible.

  15. Reg Says:

    A) Python absolutely does not claim to be a statistical programming language. Or to be specific to any one domain, for that matter. Python sets out to be a beautiful and well-documented *general* purpose (Turing complete) computing language that one can also use, if so inclined, to do statistical modeling.

    B) If you have the wherewithal to want to use Python for statistics, you almost certainly have no problem reading that code snippet above and understanding it.

    C) Once the complexity of the task moves beyond basic summary tables, SAS is a black-box, pain in the ass, expensive, and poorly documented example of classic vendor lock-in. The only reason it is still so popular is because it is required almost by law in the pharmaceutical, insurance, and banking industries. The best combination is to use SAS or (much better) SQL to manipulate and query data then do the hardcore analysis in Python or (much better) R.

    D) For the specific class of users you mention, Microsoft Excel has that market cornered and covered. And for good reason. For a wide range of tasks, with small (less than 50,000 observations) data sets nothing beats Excel.

    • erehweb Says:

      Thanks for your thoughts, Reg:
      A) See my comment to Adam Bard above.

      B) Sure. And you can also work your way through multiple keys, different functions, etc. But why should you have to do the work? If you were just given the sine function, you could recreate cosine too.

      C) Agreed that SAS is hard to use for more advanced task. But the dirty little secret of statistical analysis is that there’s a lot of data cleaning, summary, simple regression, … tasks at which SAS is pretty good. Again, wouldn’t it be better if it were easier to do the data prep / summary stage in Python or R?

      D) Yes and no. The problem is that larger datasets are becoming more common. And if you’re a statistical user who still needs to produce pivot tables etc. for less technical end-users, it would be nice to do this all in the one language, rather than skipping from one to the other.

      • Reg Says:

        Knife, fork, spoon, chop-sticks….you use each when it is appropriate. What you want is a spork. Funny those haven’t really caught on either.

  16. ggruschow Says:

    “The problem is very specific, but also very common.”

    Right..

    When you get a new computer, you probably don’t write a new OS.

    When you browse a new web site, you probably don’t write a new browser.

    When you run into this very specific, very common problem again though, you write a new program?

    “Programming languages have lots of advantages”

    Advantages over what? What are the other choices?

    • erehweb Says:

      Not sure I follow you. Creating a pivot table is a pretty common problem. Let’s say I just used this code for it. Then I’d need to do something slightly different for multiple keys, or different summary functions, …

      The big other choices would be Excel or SQL.

  17. no comprehende Says:

    As RH posted on the referenced page, you’re better off doing just

    for k, group in groupby(data, key):
    yield (k, sum(imap(value, group)))

    Complain all you want but that’s pretty concise to me.

  18. Edward Says:

    @Erehweb, what about using Octave to do all the stuffs?

  19. Robert Says:

    It seems to me your Python example is using features that are advanced and completely optional , like “yield” and generators, which is not entirely fair.

    Python has an unusual combination of usability(good learning curve, good readability) and expressive power, a combination that is very difficult for language designers to pull off.

    It’s unfortunate that the statistical libraries that are compatible with Python are not getting more attention. R seems to be sucking a lot of oxygen away from other projects.

    Robert

    • erehweb Says:

      Thanks for your comment, Robert. Adam points out that you can do the same task more simply. Not sure about “not entirely fair” – I was trying to figure out how to do pivot tables in Python and this is what I found, so it’s not an example just to bash Python. I do like a lot of what’s in Python. R probably wins out in the stat field as it has a lot of stuff built in, although I’m not an R cheerleader by any means.

  20. jerzysblog Says:

    The Dropbox example is cute … but irrelevant to what seems to be Erewheb’s point: “wouldn’t it be better if it were easier to do the data prep / summary stage in Python or R?”

    The Dropbox story illustrates how a product can succeed because it ONLY does one thing and it does that one thing well… That may be true, but it has nothing to do with why R or Python do or do not succeed. They are NOT meant to do one thing — they are meant to be flexible enough to do many things. If you ONLY need pivot tables and graphs, Excel already exists, so there you go. For my work, Excel and SAS won’t do the job at all, so the question is which more-advanced tool (or set of tools) will let me get the advanced analysis done at all, even though it’s more trouble than Excel.

    However, do the basics HAVE to be more trouble than Excel? I agree that it’d be awesome if R and Python’s core libraries ALSO made this basic stuff intuitively easy. The more interesting question to ask here is, Are there any downsides to adding this functionality to core R and Python? Is there any reason why a “language designed for extensibility” can’t have a “polished core”? Maybe someone who knows more about designing programming languages could answer that question.

    (Or, equally valid: could the R documentation make it easier for novices to learn these simple tasks? I think that core R’s apply() and by() functions are easy enough already. But I don’t know about Python.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: