Monday, April 21, 2014

It's Official: Statistics is "Sexy"

If you've seen the new Captain America movie, you might notice that statistics (and data mining more generally) are featured prominently in the film. I can't imagine a more remarkable shift in the perception of statistics, which has historically been claimed to be "dull" or "boring" (a view that is at odds -- pun intended -- with that of any practicing statistician, past or present). In fact, in a 1998 talk the statistician C.F. Jeff Wu even argued that "statistics" should be replaced with the phrase "data science" in part to remove the negative connotations with data analysis and statistical theory!

Yet now more and more people are realizing that statistic is "hot," as exemplified in the following clip, in which Scarlett Johansson is suggesting that the superhero Captain America go on a date with -- yes! -- a statistician: 

And of course if a movie trailer isn't convincing enough to you of how the public perception of statistics has been shifting, I refer you to the Chief Economist for Google, Hal Varian, who has been saying (correctly) for years that statistics is the "sexy" dream job of the 2010s:

Sunday, April 20, 2014

Python in R: Examples

How to call Python within R in Windows? This is a project I'd like to dedicate myself once I have more time, since every native R user would love to have at least pseudo-connectivity with Python.

Here's a short overview of how to run some Python code in R:

# (1) basic python commands called from R
system('python -c "a = 2 + 2; print a"') 
system('python -c "a = \'hello world\' ; print a; import pandas"')

# (2) if you have a python file you've already created (which I've 
# referred to as ""), then you can run it in R as follows:
system("python C:\\Users\\Name\\Desktop\\")

# or alternatively:
system('python -c "import sys; sys.path.append(\'C:\\Users\\Name\\Desktop\');
import my;"')

Saturday, April 19, 2014

Class-Conditional Response Probabilities

One issue that I've been trying to resolve is how to graph class-conditional response probabilities manually using the package 'poLCA' in R. I've figured out one approach using the code below:

# extracting response probabilities
R <- br="" lc="" length="" p="" probs="" ti="" y="">R <- matrix="" nrow="length(probs),ncol=R)<br" pi.class="" probs="">for (j in 1:length(probs))
  pi.class[j,] <- br="" category="" first="" for="" j="" probability="" probs="">dimnames(pi.class) <- br="" list="" names="" round="" y=""># if you want to specify your own rownames: rownames(pi.class) <- br="" row.names="">
# extracting standard errors <- lc="" matrix="" nrow="length(,ncol=R)<br""" se.class="">for (j in 1:length(
  se.class[j,] <- br="" category="" j="" level="""" specifies="">dimnames(se.class) <- br="" list="" names="" round="" y=""># if you want to specify your own rownames: rownames(se.class) <- br="" row.names="">
## creating an augmented dataset
# class-conditional probabilities and standard errors
df.probs <- data.frame="" lasses="as.vector(col(pi.class)),<br">                            Manifest.variables=as.vector(row(pi.class)),

## (1) LINE PLOT (No Std. Errors): line plot of latent classes
p <- aes="" br="" df.probs="" ggplot="" x="factor(Manifest.variables)," y="value,">                          color=factor(Classes)))
p + geom_freqpoly(stat="identity",aes(group=Classes)) + #NB!!!
  geom_point(stat="identity",aes(group=Classes)) + #NB!!!
  scale_color_hue(name="Latent Class") + xlab("Manifest Variables") +
  ylab('P(Y = "Too Little")') +
  ggtitle("Class-Conditional Response Probabilities by Latent Class") +
  theme_bw() + scale_x_discrete(labels=unique(df.probs$names)) +
  # to add variable names manually for the manifest variables:
  # + scale_x_discrete(labels=c(""))

## (2) RIBBON PLOT (has Std. Errors): ribbon plot of response probabilities
# (with standard errors) using ggplot to graph the predicted probabilities

df.probs$lower <- -="" br="" df.probs="" se="" value="">df.probs$upper <- br="" df.probs="" se="" value="">df.probs$Classes <- br="" df.probs="" factor="" lasses="">
# using ggplot to graph the predicted probabilities
ggplot(df.probs, aes(x = Manifest.variables, y = value, group=Classes)) +
  geom_ribbon(aes(ymin = lower, ymax = upper, fill=Classes),
              alpha = 0.2) +
  geom_line(aes(colour = Classes), size = 1) + theme_bw() +
  ggtitle("Class-Conditional Response Probabilities by Latent Class") +
  xlab("Manifest Variables") +   ylab('P(Y = "Too Little")') +
  scale_fill_discrete("Latent Class") +
  scale_linetype_discrete("Latent Class") +
  scale_shape_discrete("Latent Class") +
  scale_colour_discrete("Latent Class") +
  scale_x_discrete(labels=unique(df.probs$names)) + coord_flip()
  # to add variable names manually for the manifest variables:
  # + scale_x_discrete(labels=c(""))

Wednesday, April 09, 2014

Twitter Extraction

For several weeks I've been working on examining Tweets using code from R. Here's one approach to analyzing Twitter feeds:

# see how many unique Twitter accounts in the sample

# Create a new column of random numbers in place of the usernames and redraw the plots
# find out how many random numbers we need
n <- br="" df="" length="" screenname="" unique="">
# generate a vector of random number to replace the names (four digits just for convenience)
randuser <- 1000="" 9999="" br="" n="" round="" runif="">
# match up a random number to a username
screenName <- as.character="" br="" df="" sapply="" screenname="" unique="">randuser <- br="" cbind="" randuser="" screenname="">
# Now merge the random numbers with the rest of the Twitter data, and match up the correct
# random numbers with multiple instances of the usernames:
rand.df  <- br="" by="screenName" df="" merge="" nbsp="" randuser="">
# determine the frequency of tweets per account
counts <- br="" rand.df="" randuser="" table="">
# create an ordered data frame for further manipulation and plotting
countsSort <- br="" data.frame="" user="unlist(dimnames(counts)),">                         count = sort(counts, decreasing = TRUE), row.names = NULL)

# create a subset of those who tweeted at least 5 times or more
countsSortSubset <- count="" countssort="" subset=""> 0)

## extract counts of how many tweets from each account were retweeted
# (1) clean the twitter messages by removing odd characters
rand.df$text <- br="" function="" iconv="" rand.df="" row="" sapply="" text="" to="UTF-8">
# (2) remove @ symbol from user names
trim <- br="" function="" sub="" x="">
# (3) pull out who the message is to
rand.df$to <- br="" function="" name="" rand.df="" sapply="" text="" trim="">
# (4) extract who has been retweeted
rand.df$rt <- br="" function="" rand.df="" sapply="" text="" tweet="">trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))

# (5) replace names with corresponding anonymising number
randuser <- br="" data.frame="" randuser="">rand.df$rt.rand <- as.character="" br="" match="" rand.df="" randuser="" rt="">                                                         as.character(randuser$screenName))]

# (6) make a table with anonymised IDs and number of RTs for each account
countRT <- br="" rand.df="" rt.rand="" table="">countRTSort <- br="" countrt="" sort="">
# (7) subset those people RT’d at least twice
countRTSortSubset <- countrt="" countrtsort="" subset="">2)

# (8) create a data frame for plotting
countRTSortSubset.df <- br="" data.frame="" user="as.factor(unlist(dimnames(countRTSortSubset))),">                                  RT_count = as.numeric(unlist(countRTSortSubset)))

# (9) combine tweet and retweet counts into one data frame
countUser <- br="" by.x="randuser" by.y="user" countssortsubset="" merge="" randuser="">TweetRetweet <- br="" countrtsortsubset.df="" countuser="" merge="">                      by.x = "randuser", by.y = "user", all.x = TRUE)

# (10) creating a random subset for the graph below
TweetRetweet.sub <- font="" tweetretweet="">