Culture, Statistics, and Society: 2012

Thursday, May 31, 2012

Using Indirect Survey Techniques to Measure Zombie Outbreaks?

Zombies are now a common topic of discussion. In fact, the data we have available from Google Trends (for the phrase "zombie attack") strongly suggest an increasing risk of zombification across the world:

However, academic research on zombies is limited (i.e,. non-existent), mainly because of the lack of high quality data. For those interested in studying zombies, I refer readers to Andrew Gelman's paper (co-written, apparently, by the great zombie film director George Romero) on how to measure zombie outbreaks via indirect survey techniques. You can find his article here. Even if you're not interested in zombies, his paper offers some good ideas on how to sample difficult-to-reach populations more generally.

Friday, May 11, 2012

The Promising Future of Mathematical Sociology

I'm now an occasional blogger at Permutations, the official blog of the Mathematical Sociology Section of the American Sociological Association. You can read my blog post here, in which I outline why I think global trends in information technology and the meta-theroetical foundations of sociology provide conditions for a promising future for sociology in general and mathematical sociology in particular.

Thursday, May 10, 2012

90+ Two-Minute Videos on R

I highly recommend Anthony Damico's excellent two-minute videos on programming in R. You can find the full list of 90+ videos here. This is the first of the series, which tells you how to download and install R:

More generally, Anthony's video collection is another reminder of the immense sociological benefits that come from sharing educational materials and expert knowledge in the style of the Khan Academy.

Tuesday, May 08, 2012

Global Online Conference on Statistics

The Consortium for the Advancement of Undergraduate Statistics Education is hosting a global online conference titled "eCOTS: Electronic Conference on Teaching Statistics." You can view the full program here. It only costs $15 to register and participate in the online conference. For at least the past five years I've thought that conferences are obsolete in many respects, so I'm delighted to see this conference developed. By not having a physical place, with food, beverages, and equipment, not to mention lodging and transportation costs, the costs of attendance are much lower, thus enabling more and more people to learn and contribute to knowledge production. (Of course, we'll still want some conferences for face-to-face socialization!)

Sunday, May 06, 2012

I've Converted to R Full-Time

It's been over four years that I've been using both R and Stata, but as of last week I've become an R convert. For several years I had conducted statistical analyses in R (since many complex models can only be programmed in R), but I used Stata before and after the analyses. In essence I'd merge and clean data sets in Stata, call R from Stata for the statistical analyses, export R objects into Stata, and then use Stata's graphics utilities to display the results. This setup quickly unraveled last month when I began merging and recoding data in R, which is much aided by John Fox's fantastic "car" package.

The problem is that if you want to do Bayesian analysis or graph modeled coefficients (or work with complex data structures more generally), then R is much easier than Stata due to the object-oriented programming environment. It's unbelievably liberating to be able to save vectors, matrices, data frames, and so on from multiple data sources and manipulations in the same conceptual space. Additionally, R has fantastic graphics capabilities (3-D plots, rotating hyperplanes, social network graphs, and so on), offers excellent tools for analyzing and displaying so-called big data (for example, check out the "tabplot" command from Google), and is (frankly) a fun, intuitive programming language. If you need additional reasons to be an R convert, keep in mind that R is completely free, open-source, and extensible, with over 5,300 statistical packages (as of April 2012).

Friday, May 04, 2012

Complex Sociotechnical Systems

In a fascinating, informative talk, the interim director of the Engineering Systems Division at MIT makes the case for a new field of study on complex sociotechnical systems. I ask a question near the end of the video, pointing out that the core concepts of the proposed new field are in fact those endemic to sociology: mixed methods, open systems, social change, and so forth. You can watch the full video here.

Saturday, April 21, 2012

Feynman on Curiosity

This video is one of the effective advertisements I've seen for the value of gathering and systematizing empirical knowledge, by none other than the late Richard Feynman:

Also, since you are probably wondering: the music is Primavera by Ludovico Einaudi.

Tuesday, April 17, 2012

The Future of the Academy in 2032

Just before he died, for a few years I helped the great sociologist Dan Bell with using his computer, and as a result I got to know him very well. One thing I learned from him (besides the distinction between "criticism" and "critique") is the usefulness of prediction as an endeavor in itself (as opposed to explanation). In this spirit, I offer five predictions about the future of the academy in 2032:

First, despite opposition from many established institutions, there will be an enormous increase in open-source education. Classes on any topic will be available online for free, with lecture notes, videos, presentations, and chat services (with other students) available to anyone with a computer. Exemplars of this trend include MIT OpenCourseWare, Khan Academy, and videolectures.net.
Second, academic publishing will be increasingly online, with peer review a continuous process. Rather than books and articles published at one time in paper form after a process of peer review, academic projects will be ongoing, process-oriented, available online, and subjected to a continual process of peer review. In essence, everything that academics produce will be works-in-progress, and updated when errors are noted. Early indications of this trend include the NBER archive and arxiv.org.
Third,due to technological changes and increased monitoring of people's activity, academics will have to be adept with managing and analyzing big data. Common statistical methods will often be difficult to use on such large data sets, straining the computational capacities of computers. While not common in the academy yet, big data is one of the top buzzwords of 2012, and I expect this to spread to academic work relatively soon. An exemplar of this kind of academic work is the Google ngrams project. (One danger, however, is that private corporations might be hostile to information-sharing, and the values of profit-making may severely inhibit the availability of big data to academics.)
Fourth, big ideas will actually be in greater demand in the future. Precisely because there will increasingly be an excess of information, grand theories and master narratives will be increasingly desired to help guide attention, avoid fragmentation of different research traditions, and unify otherwise disparate theories. For example, Josh Tenenbaum's efforts at unifying artificial intelligence (which suffers from disciplinary fragmentation) with probabilistic graphical models is a promising endeavor.
Finally, the skills in demand will be increasingly modular rather than topical. For example, as part of the Cold War in the 1960s, the United States government funded various "area studies" programs to educate Americans on the traditions, customs, and practices of various geographic regions around the world. In the future, there will be less emphasis on this kind of topical knowledge, and greater emphasis on modular skills such as critical analysis of any kind of texts or arguments, understanding the basic structures of any set of languages, and gathering and analyzing various kinds of qualitative and quantitative data.

To the extent any of these predictions are correct, sociology is particularly well-suited to take advantage of these trends. Sociologists are generally supportive of the democratic, inclusive principles of open-source education and online publishing, and sociology has an unparalleled tradition of big ideas. Moreover, modularity is ingrained in the discipline; in fact, sociology is almost by definition a modular discipline, inasmuch sociology is an approach to a particular subject matter rather a particular subject matter per se.

Wednesday, April 11, 2012

The Quantified Self

This site on the quantified self shows a small but growing revolution: using quantitative data for self-improvement. I can only expect this to grow in importance. Despite their popularity, means, modes, medians (in their conditional variants as well) simply capture central tendencies, and that there is nearly always substantial heterogeneity within and across populations. Accordingly, basic proscriptions and prescriptions, such as "Take an aspirin a day" may not apply to all individuals, and thus individual tracking is potentially extremely useful. For example, see Seth Robert's blog post on how eating butter might improve cognitive functioning (for him, at the very least).

Misc. Lectures Online

I highly recommend the following lectures for anyone interested in social science research using quantitative methods:

The late Sam Roweis (a brilliant educator who died unexpectedly several years ago) gives a superb introduction to machine learning and probabilistic graphical models here, complete with lecture slides. In case you aren't aware, probabilistic graphical models are in effect a unifying approach to a wide range of statistical models, from hidden Markov models to hierarchical Bayesian models.
Salman Khan, the MIT graduate who started the eponymous Khan Academy, offers a superb series of lectures on probability, available here. Probability is actually the foundation for quantitative research in the social sciences, since much of the goal of inference is to quantify uncertainty through the use of probability distributions such as the Gaussian, Poisson, Gamma, and so forth.
Although for programmers in python, the computer scientist Allen Downey gives a thorough, intuitive, and entertaining overview of Bayesian analysis, which you can view in its entirety here.

Tuesday, April 10, 2012

Biplots in Stata

I've been examining qualitative data using biplots, which are readily available in Stata using Ulrich Kohler's excellent package. For example, here is a biplot of a rich data set of poor white men on variables such as drug use and other risk factors:

There are several useful features of biplots: first, they concisely summarize a wealth of information in one graph, including relationships among both cases and variables; second, in line with Tufte's dictum, biplots have a high data-to-ink ratio; third, since cases are not directly modeled, biplots help with integrating qualitative and quantitative data (i.e., cases are not "hidden" by a hyperplane, as in a classical linear regression model); finally, there are absolutely no frequentist statistics to deceive the analyst.

Wednesday, April 04, 2012

Tuesday, April 03, 2012

Making Books

This video makes me wonder how, although technology has innumerable benefits, some aspects of culture will be lost if we don't retain at least some working knowledge of older technologies:

Monday, April 02, 2012

The Limits of Formal Theory in Sociology

Sociologists and economists often disagree about the role of so-called "formal" theory in understanding social behavior. For the most part, sociologists are much more skeptical that mathematical models (with little reference to data) can clearly and accurately describe, explain, and predict how humans act, think, and feel. I take a middle-of-the-road position: such models of human behavior can be helpful for illuminating arguments, but often they are such crude approximations of reality that they can obscure what is actually going on. I'm reminded of Max Tegmark's brilliant article on the mathematical universe hypothesis, in which he claims that the universe is a giant mathematical structure. In fact, the disciplines can be understood in reference to derivations from known mathematical laws, as shown in this diagram:

The problem, as Tegmark suggests in this diagram, is that until we understand how to reconcile mathematically general relativity and quantum field theory, as well as how this reconciled theory is related to other fields in physics and related fields, mathematizing sociology will at best be a set of (possibly crude) approximations of reality.

Friday, March 30, 2012

Physics Envy

The NYT published an op-ed today by a pair of political scientists on "physics envy" by sociologists, economists, and political scientists. The authors mainly argue that theory can be useful even when it is wrong or unsupported by data, and briefly mention that data analysis is useful even if theoretical contributions are not obvious. I disagree with the former, but not the latter. For a similar view, see this post by the theoretical physicist Sean Carroll.

Thursday, March 29, 2012

Irving Louis Horowitz

The eminent political sociologist died a few days ago, according to an obit in the NYT. Long ago I read, and took seriously, his book The Decomposition of Sociology, in which he argues (essentially) for more empirical analysis and less left-wing politics in sociology. Reflecting on his book, he neglects a fundamental, possible cultural contradiction: to the extent social reality exhibits facts consistent with liberalism and inconsistent with conservatism, empirical analysis will result in more liberal than conservative belief systems (but not values, since those cannot be proven "right" or "wrong" by scientific analysis). For example, evidence is accumulating that economic inequality (which is of little concern to most conservatives in the United States), has numerous deleterious effects, thus forcing conservatives either to hold beliefs inconsistent with the evidence (i.e., inequality is unrelated to deleterious effects) or alter their values (i.e., it is a "good" thing to have high rates of violence, low social mobility, and so forth).

Wednesday, March 28, 2012

MyPersonality

I highly recommend this website for learning about your attitudes, values, beliefs, and overall personality.

Sunday, March 25, 2012

Why are Economists so (Consistently) Led Astray About Inequality?

In a recent Boston Globe article Ed Glaeser, a conservative urban economist at Harvard, wrote an article titled Why income disparity in Boston isn't a bad thing. Glaeser is right that inequality increases in a city such as Boston can be due to selection effects, since poor people are moving into Boston for economic and cultural opportunities. Yet these selection effects (i.e., poor people moving into a geographic area in the hopes of upward mobility, which is generally considered a good thing) is drastically different from the observed outcomes (i.e., large disparities in people's wealth due to their social positions in a system of occupations, which is generally considered a bad thing). Yet Glaeser conflates the two, confusing the reader and, perhaps, himself. A more accurate title for the article would have been "Why poor people moving into Boston isn't a bad thing." This raises a question: why are economists so (consistently) led astray about the causes and consequences of economic, social, and political inequality?

Popularity of Programming Languages

As you can see, R is relatively popular (but more so on StackOverflow than GitHub):

For the original graph, click here. This scatter plot is a reminder that R is useful to learn not only for statistical modeling (since there are so many excellent packages available), but also as a way to become familiar with programming more generally.

Saturday, March 24, 2012

Big Science and Sociology

I highly recommend this video featuring Dirk Helbing, a sociologist and erstwhile physicist who is (along with others) attempting to create a CERN-like society-simulating project for the social sciences by combining information from large data sets with simulated models of complex social systems:

Thursday, March 22, 2012

Statistical Lexicon

Anyone doing statistical analysis (or contemplating it) should read Andy Gelman's informative, humorous, and dead-on correct post on statistical lexicon.

McKinsey on Big Data

McKinsey has a full report (from March 2011) describing the meaning and potential impact of so-called big data. You can read the report here. One problem, which the authors of the report do not discuss in detail, is the that since so much of what constitutes big data will be collected by private firms there are possibilities of restricted information pockets. In other words, only certain private actors will have access to big data, and academics might very well be left very few big data sources.

Wednesday, March 21, 2012

Inequality: Everyone's Thinking About It

I ran into the following articles on inequality, which has not only been increasing structurally but culturally (in that more policy elites and journalists are discussing the topic openly). Here are some recent posts on inequality:

Reuters is reporting findings from a group of researchers showing that Sweden has undergone an enormous increase in inequality, especially since the rise of the center-right in the political system. For those of us in the United States who look to Sweden as a model of development, in recent years even this country has regressed from the ideals of social democracy.
Based on an online survey (with all the caveats about sampling procedures, of course), a group has surveyed wealthy Americans on their views on inequality. The biggest finding, which reinforces the importance of class-based analyses of electoral politics: among the wealthy there is a huge gap between self-identified Republicans and Democrats, with over 84% of the latter favoring policies taxing the rich while around 29% of the former.

Universal Limits in High-Dimensional Statistics

The MIT Center on Operations Research is hosting a talk tomorrow on universal limits in high-dimensional statistics. The basic idea is that, for all fields of empirical study from sociology to high-energy physics, some criterion for "statistical significance" is crucial for making decisions based on the data. (The current hunt for the Higgs Boson particle is in fact based on a modified criterion for statistical significance.) The problem, however, is that we are entering a world of big data, in which data structures have many dimensions, thus altering the potential usefulness of such criterion for statistical significance.

Sunday, March 18, 2012

Rethinking Tragedy and Success

The social theorist Alain de Botton presents a creative rethinking of the meaning of tragedy and success in a TED talk, shown here:

In essence, he argues that success needs to be rethought using insights from sociology, including an understanding of the limits of the ideal of a meritocratic society (since there is always random chance involved in social mobility), a deeper awareness of how failure as a concept involves particular beliefs and values (so that we can conclude that Hamlet is not a "loser" even though he "lost"), and a sensitivity to the fact that even when particular social and cultural distinctions appear to be irrelevant economic differences certainly are not (so that comparing oneself to Bill Gates rather than the Queen of England is just as absurd, even though the former wears "business casual").

Saturday, March 17, 2012

Why Inequality Matters

The conservative magazine Commentary has published an article on how social inequality is on the political agenda and on the minds of most Americans, even though many conservatives would prefer the case to be otherwise. The authors argue that, in part, the discussion of inequality should be oriented toward social mobility and poverty, as well as the "injustices" of government policy. What the authors apparently fail to realize is the possibility that inequality causes poverty and immobility, not to mention "unjust" government policies perpetuating inequality. In particular, higher inequality can cause low social mobility by increasing socioeconomic distances between the highest and lowest rungs of society, higher rates of poverty by segregating groups and distorting resource allocations, and inequality-perpetuating government policies by shifting costs from the wealthy to the general population (through, for example, cutting funds for widely-available public services and increasing take-home profits from private organizations).

Friday, March 16, 2012

Inequality "Crisis" of Marriage

The Atlantic Monthly posted a fascinating article today on the inequality "crisis" of marriage. My favorite line in the article: "Gone are the days when the Harvard grad marries the girl with the high school degree simply because, well, she's pretty."

Thursday, March 15, 2012

Corporate Culture Revisited

Greg Smith has a popular post in the NYT titled Why I am Leaving Goldman Sachs. His reason is that the organizational culture is now "as toxic and destructive" as has "ever seen it." In particular, Smith criticizes that the values and norms of the organization are oriented almost exclusively toward profit-making, with little or no regard for the well-being of other organizations and people, including their clients.

Misc. Links

MIT students are having a Pi Day recitation and celebration today (since today is 3.14, of course).
The Financial Times discusses Goldman Sachs' corporate culture without, unfortunately, describing what is meant by the phrase; however, I'm glad to see that cultural factors are mentioned, since clearly faulty beliefs, norms, and values contributed to financial crisis.
The U.S. Census Bureau recently released a report describing the inequality levels (expressed as Gini coefficients) of all counties in the United States from 2006 to 2010; the findings show, as one would expect, that more populous counties are more unequal.
Finally, a new study suggests that first-generation immigrants face a disadvantage in attending college due a "cultural mismatch" in values and norms from between working-class youth and those from middle- and upper-class backgrounds.

Tuesday, March 13, 2012

MIT Inequality Talk

As part of the technology and culture forum at MIT, I attended a talk featuring the notable economists Frank Levy (Professor of Urban Economics at MIT), David Autor (Associate Chair of the MIT economics department), Peter Diamond (MIT Institute Professor Emeritus), and Arjun Jayadev (Assistant Professor Economics at UMass-Boston). I've read quite a bit of their work, and they have all conducted important research on inequality, poverty, and policy; for instance, Frank Levy's The New Dollars and Dreams: American Incomes and Economic Change is still (over a decade later since the last edition was published) one of the best overviews of trends in economic conditions in the United States since World War II. The panelists focused on the causes and consequences of income and wage inequality, as well as possible solutions, with moderation by David Autor.

Scatter Plot Matrix in R

Stata has a large number of graphics capabilities (and I highly recommend Stata over other statistical packages for a variety of reasons), but in a few instances R is more useful. In particular, I find R useful for creating beautiful scatter plot matrices and 3-D graphical displays. To my knowledge, currently these kinds of graphics are very difficult (if not impossible) to create in Stata 12. What I like about scatter plot matrices is that can have a high data-to-ink ratio, packing together fitted lines, scattered data, histograms, correlations (proportional to the size of the correlation), and statistical significance "stars" (since reviewers seem to like them). Moreover, I like that all the information effectively puts the "stars" associated with statistical significance in appropriate context: there is an incredible amount of variability in the size of correlations and distribution of data among all the "three-star" correlations, underscoring the limited usefulness of statistical significance as a tool for understanding the social reality given to us by data.

Monday, March 12, 2012

Taxes and Inequality

The economist Daren Acemoglu and his colleague James Robinson have an excellent article on the problems with inequality in the United States. You can find it here. In general, I agree with them entirely, and they are persuasive in outlining the negative aspects of political inequality.

3-D Scatter Plots Redux

One weakness of Stata versus R is the lack of 3-D graphing capabilities, in particular 3-D scatter plots. However, with some modifications, Stata can indeed provide a suitable substitute for R in most graphical problems, as shown here (I use the infamous auto data set available in Stata with the sysuse command). The main weakness is that the x-y and y-z planes do not have grid lines; nevertheless, this graph is another indication that Stata's graphing capabilities are much stronger than many R users (and perhaps even Stata users) realize. Here's the graph:

Saturday, March 10, 2012

Checking Weather in Stata

I added a useful Stata command to my computer today: Neal Caren's weathr command in Stata (note that there is no "e"). The command is great: now you can check your day's weather entirely within Stata! The command obtains the current weather conditions and forecast for the next 36 hours from yahoo.com for any zip code in the United States.

Friday, March 09, 2012

Is Everything Culture?

In my readings on culture, I've found a fascinating set of theories called digital physics. These theories posit that the universe fundamentally consists of information (i.e., the "it for bit" doctrine that every particle, atom, quark, and so on is describable as a dichotomous "yes or no" categorization), and thus that the universe is in principle computable. Opponents to digital physics claim that reality is continuous, but the rejoinder is that reality only appears continuous, and is fundamentally categorical (for example, the Planck length suggests that reality is quantized). More relevant to sociology, these perspectives suggest that everything is culture -- i.e., information -- and thus that societies can be usefully modeled as information systems.

Thursday, March 08, 2012

Ternary (or Triaxial) Plots

One rarely-used graphic is the ternary (or triaxial) plot, which is a very useful way of examining a tripartite decomposition of a variable. For example, the graph in this post displays the composition (which I constructed in Stata using Nicholas J. Cox's commands) of an economy over time. Note that the three percentages add to 100 (or, equivalently, the three proportions add to 1).

It's a bit surprising that this graph appears so infrequently; it would appear to be especially useful for political scientists showing voting fractions over time (with the three most prominent parties for each axis), economists examining the composition of an economy (such as above), or sociologists examining over-time trends in any three-part categorical variable (such as "agree," "disagree," or "neutral" on a question of values or attitudes).

However, note that simply because a graph looks like it's a ternary plot does not make it one! For example, Junk Charts dissects this pseudo-ternary plot in the New York Times.

Wednesday, March 07, 2012

Causality and Ethnography

The University of Chicago is hosting a conference on causality and ethnography on March 8th and 9th. Full details are available here. My own view on the relationship between causality and ethnography is that ethnographers should use counterfactuals, and in fact usually do whether or not they are explicit about them. In modern statistics (in particular, the work of Donald Rubin at Harvard, among others, on the potential outcomes model), the counterfactual model of causaltiy clarifies the conditions under which any particular data set can be interpreted as causal, and shows that these assumptions are extremely strong. Contra the prevailing view of many economists, even instrumental variables regression, regression discontinuity design, and related methods require exceptionally (and often implausibly) strong assumptions for causal interpretation.

The Mystery of Power-Law Distributions

One criticism of sociology, and the macro social sciences more generally (such as political science, anthropology, and economics), is that there are very few "laws" of social reality. There are, however, some sociological regularities that are as yet not fully explained, and which seem bizarre. The most enduring and puzzling of these are power-law distributions (a well-known special case of this is "Zipf's Law"), which is the fact that "large" instances of things are extremely rare, while "small" occurrences of things are extremely common (where size can refer to frequency in a population, population size, geographic space, and so on). In practice this means that a handful of words are much more frequent than other words (and most words are rarely used), wealth is concentrated in a small number of people (and most people are poor), there are a handful of really popular songs (and a vast number of unpopular tunes), and so on. Even the sizes of sand particles on a beach follow a power-law distribution: how often have you seen a boulder on a beach?

What might explain the ubiquity of power-law distributions? As far as I can tell, nobody is entirely sure, although we have some good guesses. For example, the sociologist Herbert Simon outlined a theory of preferential growth attachment (also known as the "rich get richer" effect), in which songs that are already fairly popular will become more popular, cities that are already large will become even larger, and words already used widely will become even more widely used. Note that this explanation hinges on a positive feedback effect: the probability that any thing gets "larger" is directly proportional to the current "largeness" of the thing; or, to put it another way, large values get amplified rather than cancelled out (as in a normal distribution).

Power-law distributions have important cultural, statistical, and political implications.

Culturally, there are several implications. First, most cultural constructs are rarely used and only a handful are common among any group of people. To put it another way, the shared part of culture is likely to be relatively small, while the particular part of culture is vast. Second, frequently used cultural constructs are particularly stable over time; that is, 500 years from the word "the" will still be used, while "sesquipedalian" has a more uncertain future. Third, the stability of a cultural system is derived from the more frequently used cultural constructs, while the dyanmism is among the less frequently used constructs. Fourth, initial conditions are extremely important for the frequency and hence durability of cultural constructs: for instance, small, random fluctuations led to the popularity of "the" in the English language. Finally, following from the previous point, the consequences of initial conditions are highly unpredictable; given small initial changes English speakers today might instead be using the word "tha" or "se" instead of "the."

Statistically, the presence of power-law distributions is a reminder that classical linear regression (based on the normal distribution) is not always the appropriate fit to a scatter plot of two variables, and that summarizing a distribution as a mean or median can be highly misleading.

Politically, power-law distributions have a unique implication for efforts to deal with wealth inequality: one effective way to alter the distribution of wealth is to remove the positive feedback effects from wealth. The desired distribution of wealth would thus be described by a normal rather than power law function. Importantly, removing the positive feedback effects of wealth would not lead to the removal of inequality, but rather a change in the distribution so that the mean, median, and mode are the same. From this perspective, policies should be in place so that (in principle) a person's change in wealth is independent of their current level of wealth. Such policies might include very high taxes on capital gains, restrictions on the influence of wealth in political decision-making, rules specifying equal monetary amounts from promotions for all occupational levels in a firm, and so on.

Monday, March 05, 2012

Visualizing a Correlation Table

Correlation tables are ubiquitous in social science research, but very rarely they are visualized. As I've emphasized in previous posts, I'm a strong advocate for visualizing data and models whenever possible. For example, for my research I graphed correlations using Adrian Mander's plotmatrix command in Stata. Using Mander's package, I could create a graph that clearly shows all the information in a parsimonious way; moreover, unlike a correlation table, correlation patterns are intuitively grasped from the shading of the cells, and there is an implicit emphasis on the correlation size rather than statistical significance.

Sunday, March 04, 2012

Why Models are Not Data

In doing research, sometimes it can be easy to think that the models one is using are in fact the data -- but this is clearly not true. Even the mean of a sample of data is a model of the central tendency of the data, and not the data itself. One clear example of why models are not data is Anscombe's quartet. For example, take the following:

What is remarkable about this quartet is that for all of these scatter plots the mean of x is the same (exactly), the variance of x is the same (exactly), the mean of y is the same (to two decimal places), the variance of y is the same (to three decimal places), the correlation between x and y is the same (to three decimal places), and the linear regression equation is the same (to two or three decimal places). In other words, the models of the data (e.g., mean, variance, correlation, etc.) are the same, but the data are not!

So what's the solution? As I've mentioned in previous posts, graphing the data is crucial, because we're forced to confront the actual data, and not models of the data.

Saturday, March 03, 2012

R versus Stata Redux

I've used both R and Stata for a long time, but these days I use Stata much more frequently than R. While R is useful for some kinds of graphics (especially three-dimensional graphics) and some statistical procedures (for example, finite mixture models), in general I prefer Stata as the go-to statistical program. The reasons are clear: Stata has superior help files for almost all ado files, Stata graphics are excellent (even contour plots are available in Stata), cleaning data is a breeze in Stata but awkward in R, labeling data is much efficient in Stata (in fact, as far as I can tell R does not allow for labeling variable names, while Stata allows for labeling levels of a variable, the variable itself, and the data set), and for many procedures Stata's syntax is much more parsimonious than R's.

Yet, R is worth learning because the 3-D graphics available are often extremely useful for exploring the data, and there will certainly be cases in which R will have statistical procedures that are unavailable or cumbersome in Stata (Bayesian analyses and finite mixture models come to mind, for example).

Friday, March 02, 2012

Culture and Poverty

The New York Times has an article covering the concept of the culture of poverty here. The article is fairly accurate, and does a good job highlighting that the study of culture and poverty had its origins in left-wing Marxists (although I would have mentioned Bowles and Gintis, who emphasized that cultural values and norms of obedience to capitalist ideologies rather than intelligence contribute to the social reproduction of inequality). The author elides the fact that the problem with the concept of the "culture of poverty" is that such a thing does not, and never has, existed: culture is everywhere, not just among the a subset of the economically disadvantaged. The appropriate question, then, is: given that we know that culture is a constituent part of the human experience, how does it matter not just for poverty, but for happiness, well-being, inequality, wealth, and so on?

Thursday, March 01, 2012

Values and Politics

I'm a bit biased, but the front page of the Huffington Post highlighted a fascinating study on education, culture and politics today.

Wednesday, February 29, 2012

Reading the New York Times in Stata

One useful command for taking a break from research is Neal Caren's "nytimes" ado file. This command lists the most recent headlines with brief summaries from the New York Times. Best of all, no subscription is required!

Tuesday, February 28, 2012

Utility Theory as Naive Cultural Theory

Here's a fascinating presentation by the economist Steve Keen on utility theory and neoclassical economics. From the perspective of a cultural sociologist, what is of particular interest is that the utility theory underlying neoclassical economics has the appearance of a naive cultural theory. Specifically, the indifference curves that constitute supply and demand curves in neoclassical analysis are based on strong, disproved assumptions about how people value things in the world: first, completeness (i.e., that the individual knows their evaluative ranking of all combinations of things); second, transitivity (i.e., if thing A is valued to B, and B to C, then A is valued over C); third, non-satiation (i.e., more things are always valued to less); fourth, convexity (i.e., for each thing, additional value falls); fifth, structural independence from culture (i.e., what an individual values is independent of how much income the have); finally, no curse of dimensionality (i.e., information processing abilities are unlimited). No cultural theory in sociology has even approached the disbelief required for these kinds of assumptions. Fortunately, some sociologists (for example Michael Hechter), have sought to correct this naive cultural theory, and have advocated eloquently and convincingly for a richer understanding of values in economic models of human behavior.

Monday, February 27, 2012

The Phil Gramm Effect

I recently re-read Andrew Abbott's brilliant article on the problems with classical linear regression. One of the most persuasive criticisms is that statistical models are extremely difficult to use for examining small changes with big effects (but big changes with small effects can be modeled). I like to call this the "Phil Gramm Effect" because arguably one of the most important causes of the 2008 financial crisis (an undoubtedly big effect) was Phil Gramm (a small change), since he was the driving force for gutting the Glass-Steagall Act and shifting government regulations in favor of private companies (often called "deregulation," but more accurately termed "re-regulation").

Sunday, February 26, 2012

Big Science in Sociology

The search for the Higgs Boson particle has captivated a wide range of people all over the world, and the construction of the Large Hadron Collider is the reason for this widespread interest. Is such a "big science" approach possible in the social sciences, including sociology? Although the details to me seem obscure, researchers in Europe have developed a proposal for what they call the FuturICT, a "big science" project for the social sciences (ICT stands for "Information and Communication Technology") in the mode of the Manhattan Project, Apollo Project, and Large Hadron Collider. But what is it, exactly, that they are proposing? I get the sense it's a giant computer simulation, but it doesn't seem entirely clear.

Saturday, February 25, 2012

Social Learning is Efficient

I encountered this clever article by several social scientists, including the cultural anthropologist Rob Boyd. Through various data, they show that it is beneficial to copy others (i.e., engage in social learning) rather than innovate by oneself. This highlights clearly the fiction of the "self-made" man, and the importance of one's cultural and social environment in leading to human flourishing.

Friday, February 24, 2012

3-D Bar Graph "Masterpiece"

I encountered this post on how to turn a "boring" bar graph into a 3-D "masterpiece." What's striking to me is that most of the people commenting actually want to replicate this graph, even though it violates the basic principles of effective statistical graphics, according to Tufte and others. For example, the 3-D effect distorts the information displayed by the "boring" bar graph, making comparisons difficult, and the visualization effects distract from the underlying data as conveyed by the differing heights of the bars. Here's the "masterpiece" in its full glory:

Thursday, February 23, 2012

Violin Plots

Violin plots are an excellent way of displaying the distribution of a continuous variable by levels of a categorical variable. In essence, violin plots are box plots and kernel density plots combined. For instance, here are a set of violin plots from Stata's auto data:

These same data could also be displayed in tabular form, but again this is case in which a graphical display is a more effective way to examine and convey the patterns in the data.

Wednesday, February 22, 2012

Big Data and the End of Theory?

An article in The Guardian gives appropriate caution to claims that data analysis (and only data analysis) is the solution for all or even most academic and research problems. As Max Weber observed in his brilliant essay on objectivity in the social sciences, even the process of data analysis depends on values that cannot be empirically proven as right or wrong: "The 'objectivity' of the social sciences depends [..] on the fact that the empirical data are always related to those value-ideas which alone make them worth knowing and the significance of the empirical data is derived from these value-ideas. But these data can never become the foundation for the empirically impossible proof of the validity of the value-ideas."

Saturday, February 11, 2012

Era of Big Data

The New York Times has a great article discussing the era of big data. This might have a Kurzweil-esque ring to it, but due to technology change big data is becoming increasingly available and ready for analysis: in fact, there are more data sets out there than brains to analyze them, especially when one notes the incredible number of combinations of analyses that could be conducted even on a single data set with 100 variables (in what is known as the curse of dimensionality). However, one problem with big data is that, since so much of the data are collected by private entities, much of it may not be available to academics and independent researchers.

Thursday, February 02, 2012

Theory of Everything?

A new journal in biology called Life has published an unusual article in its inaugural edition: a paper by Erik Andrulis titled the "Theory of the Origin, Evolution, and Nature of Life." You can find the paper here. At 105 pages and 800 references, his paper seems Sokal-like, except it apparently is not a hoax at all. As a result this paper is unusual, but especially so for two reasons: first, Andrulis is apparently a well-respected biological scientist who has done important work on RNA, and second, Life appears to have all the trappings of a well-respected, peer-reviewed scientific journal, including a well-respected editorial board.

In essence, Andrulis outlines a theoretical framework that (supposedly) unifies the microcosmic and macrocosmic realms, validates predicted laws of nature, and explains the origin and evolution of cellular life. Like most non-biologists encountering this paper, I've only skimmed it, but apparently reality consists of geometric entities "gyres." Sounds good, except Andrulis provides no evidence (as far as I can tell) that these gyres exist.

It's easy to criticize this paper, if only for ambition of his theory. In one section he purports to unify all laws of nature, while in another he addresses the meaning of life. Even more astounding is the offhand way he presents his theory. For example, on page 55, Andrulis briefly remarks: "Please note the unity of reality and life as revealed by this theory." Can the unity of reality even be "noted"? However, my favorite part of the paper is on page 61, simply because of the sheer grandiosity of his assertion: "I refer the reader to the Theory section for a complete presentation of theoretical answers to many of science’s most challenging questions."

Questions abound how this paper was published despite peer review (perhaps it was a publicity stunt for the journal), and about the sanity of Erik Andrulis. From the position of a sociologist of culture, however, the more interesting question concerns why this paper was so heavily criticized, and whether or not papers such as these have a place in scientific journals. Andrulis' paper, I suspect, is filled with flaws and inconsistencies, but I contend there is often insight from theoretical frameworks that we "know" are generally wrong. Thus, the problem, from my perspective, is not that Andrulis wrote this paper, but rather that there is not a biological journal (to my knowledge) where scientists can publish speculative or half-formed theories that are probably "wrong" but nonetheless help us think about the world in a different way. (Sociology, in contrast, in part because of our methodological pluralism and historical connections with philosophy, has a number of journals in which theories, even those that are highly speculative, can be developed and publicized.)

Saturday, January 21, 2012

Murray on Cultural Inequality

The conservative sociologist Charles Murray has written a new book on cultural inequality, and he's written about his main arguments here in the Wall Street Journal. There are two glaring problems with his argument, however. First, although I appreciate his attempts to examine cultural factors of the economy, he frequently conflates behaviors with culture (which consist of values, attitudes, beliefs, not behaviors arising from these symbolic constructs). This muddles his argument, and leads to a profusion of of ad hoc claims that are weakly supported by the data, if at all. Second, his
explanation for cultural inequality falls short: in particular, he ignores how lack of public investments and conservative economic policies (for example, lack of investment in public transportation, public spaces, universal welfare systems, and the growth of car-based urban sprawl based on the profit-making concerns of private developers, among other things) are leading causes of the cultural fragmentation he is concerned about.

Thursday, January 12, 2012

Inequality versus Dispersion

I'm glad to see that Alan Krueger, chairman of the Council of Economic Advisers (a fancy name for a panel of three economists), discussed the problems with inequality in his address today. You can find his remarks and graphs here. I liked his graphs, and he shows convincingly many of the standard findings in sociology and political science on politics and inequality in the United States. However, I found the following comments puzzling:

Although I have done much research in my career on inequality, I used to have an aversion to using the term inequality. The Wall Street Journal ran an article in the mid-1990s that noted that I prefer to use the term “dispersion.” But the rise in income dispersion – along so many dimensions – has gotten to be so high, that I now think that inequality is a more appropriate term.

The mixing of the statistical concept of dispersion with the sociological concept of inequality muddles the discussion. It's true that any distribution is often described by some measure of dispersion (e.g., standard deviation) and central tendency (e.g., mean or mode). But inequality encompasses a concept of equity, as well as some concept of disparity (or disparities), neither of which is analogous to the statistical concept of dispersion. Moreover, if we use Krueger's logic it's unclear at what threshold "dispersion" is labeled "inequality"; for instance, his comments imply that Sweden currently has dispersion, while the United States has inequality, although many Swedes would probably disagree.

Tuesday, January 03, 2012

Congratulations to the Digging into Data Recipients

The list of the round two award recipients for the 2011 Digging into Data challenge are listed here.