Culture, Statistics, and Society: R

Showing posts with label R. Show all posts

Thursday, May 10, 2012

90+ Two-Minute Videos on R

I highly recommend Anthony Damico's excellent two-minute videos on programming in R. You can find the full list of 90+ videos here. This is the first of the series, which tells you how to download and install R:

More generally, Anthony's video collection is another reminder of the immense sociological benefits that come from sharing educational materials and expert knowledge in the style of the Khan Academy.

Sunday, May 06, 2012

I've Converted to R Full-Time

It's been over four years that I've been using both R and Stata, but as of last week I've become an R convert. For several years I had conducted statistical analyses in R (since many complex models can only be programmed in R), but I used Stata before and after the analyses. In essence I'd merge and clean data sets in Stata, call R from Stata for the statistical analyses, export R objects into Stata, and then use Stata's graphics utilities to display the results. This setup quickly unraveled last month when I began merging and recoding data in R, which is much aided by John Fox's fantastic "car" package.

The problem is that if you want to do Bayesian analysis or graph modeled coefficients (or work with complex data structures more generally), then R is much easier than Stata due to the object-oriented programming environment. It's unbelievably liberating to be able to save vectors, matrices, data frames, and so on from multiple data sources and manipulations in the same conceptual space. Additionally, R has fantastic graphics capabilities (3-D plots, rotating hyperplanes, social network graphs, and so on), offers excellent tools for analyzing and displaying so-called big data (for example, check out the "tabplot" command from Google), and is (frankly) a fun, intuitive programming language. If you need additional reasons to be an R convert, keep in mind that R is completely free, open-source, and extensible, with over 5,300 statistical packages (as of April 2012).

Sunday, March 25, 2012

Popularity of Programming Languages

As you can see, R is relatively popular (but more so on StackOverflow than GitHub):

For the original graph, click here. This scatter plot is a reminder that R is useful to learn not only for statistical modeling (since there are so many excellent packages available), but also as a way to become familiar with programming more generally.

Tuesday, March 13, 2012

Scatter Plot Matrix in R

Stata has a large number of graphics capabilities (and I highly recommend Stata over other statistical packages for a variety of reasons), but in a few instances R is more useful. In particular, I find R useful for creating beautiful scatter plot matrices and 3-D graphical displays. To my knowledge, currently these kinds of graphics are very difficult (if not impossible) to create in Stata 12. What I like about scatter plot matrices is that can have a high data-to-ink ratio, packing together fitted lines, scattered data, histograms, correlations (proportional to the size of the correlation), and statistical significance "stars" (since reviewers seem to like them). Moreover, I like that all the information effectively puts the "stars" associated with statistical significance in appropriate context: there is an incredible amount of variability in the size of correlations and distribution of data among all the "three-star" correlations, underscoring the limited usefulness of statistical significance as a tool for understanding the social reality given to us by data.

Saturday, March 03, 2012

R versus Stata Redux

I've used both R and Stata for a long time, but these days I use Stata much more frequently than R. While R is useful for some kinds of graphics (especially three-dimensional graphics) and some statistical procedures (for example, finite mixture models), in general I prefer Stata as the go-to statistical program. The reasons are clear: Stata has superior help files for almost all ado files, Stata graphics are excellent (even contour plots are available in Stata), cleaning data is a breeze in Stata but awkward in R, labeling data is much efficient in Stata (in fact, as far as I can tell R does not allow for labeling variable names, while Stata allows for labeling levels of a variable, the variable itself, and the data set), and for many procedures Stata's syntax is much more parsimonious than R's.

Yet, R is worth learning because the 3-D graphics available are often extremely useful for exploring the data, and there will certainly be cases in which R will have statistical procedures that are unavailable or cumbersome in Stata (Bayesian analyses and finite mixture models come to mind, for example).

Wednesday, December 30, 2009

R in the NYT

The statistical package R received a positive overview in the New York Times recently.

Tuesday, December 29, 2009

Top Ten Must-Have R Packages for Social Scientists

The political scientist Drew Conway has come up with a useful list of his ten "must-have" R packages for social scientists. I agree with him for the most part, and his list highlights the usefulness of R (vis-a-vis Stata) for social network analysis (see statnet/igraph) and graphics (see ggplot2). In some respects, his list also underscores the fact that R is arguably more suited for sociological data analysis than Stata, given the former's unique packages not only for social network analysis but also multilevel modeling and a variety of non-parametric methods (including more recent forms of matching and classification techniques), which were especially popular in sociology before the "path analysis" revolution of the 1960s.

Tuesday, December 22, 2009

Multilevel and Longitudinal Modeling in Stata

For my "off-task" reading I recent perused an excellent book on multilevel and longitudinal modeling in Stata by Sophia Rabe-Hesketh and Anders Skrondal. The second edition (which I read) has been updated by including several chapters providing an overview of regression modeling and ANOVA (analysis of variance) as well as additional background information on models with nonlinear outcomes (e.g., logistic regression). The authors even include a self-test near the beginning of the book to ensure that readers can confidently progress through the rest of the material. The book has many great features, including ease of data accessibility (simply go to this website and you instantly have all the datasets used in the book), clarity of presentation, and numerous applied examples with accompanying Stata code. The only problem, which is not a problem with the book, is that multilevel modeling in Stata (as the authors note) can be rather slow, especially for nonlinear outcomes with many levels. (For this reason, when using nonlinear outcomes other statistical packages may be more desirable than Stata, such as R.) Yet overall the book is an excellent overview of an important class of statistical models, and can even be viewed as a way of take advantage of Stata beyond the realm of "econometric" approaches (which seems to be Stata's strength) and toward the realm of putatively more "sociologic" methods of data analysis, in which clustered data are viewed as something important in their own right rather than as statistical nuisances.