Sunday, March 04, 2012

Why Models are Not Data

In doing research, sometimes it can be easy to think that the models one is using are in fact the data -- but this is clearly not true. Even the mean of a sample of data is a model of the central tendency of the data, and not the data itself. One clear example of why models are not data is Anscombe's quartet. For example, take the following:

What is remarkable about this quartet is that for all of these scatter plots the mean of x is the same (exactly), the variance of x is the same (exactly), the mean of y is the same (to two decimal places), the variance of y is the same (to three decimal places), the correlation between x and y is the same (to three decimal places), and the linear regression equation is the same (to two or three decimal places). In other words, the models of the data (e.g., mean, variance, correlation, etc.) are the same, but the data are not!

So what's the solution? As I've mentioned in previous posts, graphing the data is crucial, because we're forced to confront the actual data, and not models of the data.