If you've seen the new Captain America movie, you might notice that statistics (and data mining more generally) are featured prominently in the film. I can't imagine a more remarkable shift in the perception of statistics, which has historically been claimed to be "dull" or "boring" (a view that is at odds -- pun intended -- with that of any practicing statistician, past or present). In fact, in a 1998 talk the statistician C.F. Jeff Wu even argued that "statistics" should be replaced with the phrase "data science" in part to remove the negative connotations with data analysis and statistical theory!
Yet now more and more people are realizing that statistic is "hot," as exemplified in the following clip, in
which Scarlett Johansson is suggesting that the superhero Captain America go on a date with
-- yes! -- a statistician:
And of course if a movie trailer isn't convincing enough to you of how the public perception of statistics has been shifting, I refer you to the Chief Economist for Google, Hal Varian, who has been saying (correctly) for years that statistics is the "sexy" dream job of the 2010s:
How to call Python within R in Windows? This is a project I'd like to dedicate myself once I have more time, since every native R user would love to have at least pseudo-connectivity with Python.
Here's a short overview of how to run some Python code in R:
# (1) basic python commands called from R
system('python -c "a = 2 + 2; print a"')
system('python -c "a = \'hello world\' ; print a; import pandas"')# (2) if you have a python file you've already created (which I've
# referred to as "my.py"), then you can run it in R as follows:
system("python C:\\Users\\Name\\Desktop\\my.py")# or alternatively:
system('python -c "import sys; sys.path.append(\'C:\\Users\\Name\\Desktop\');
import my;"')
One issue that I've been trying to resolve is how to graph class-conditional response probabilities manually using the package 'poLCA' in R. I've figured out one approach using the code below:
# extracting response probabilities R <- br="" lc="" length="" p="" probs="" ti="" y="">R <- matrix="" nrow="length(probs),ncol=R)<br" pi.class="" probs="">for (j in 1:length(probs)) pi.class[j,] <- br="" category="" first="" for="" j="" probability="" probs="">dimnames(pi.class) <- br="" list="" names="" round="" y=""># if you want to specify your own rownames: rownames(pi.class) <- br="" row.names=""> # extracting standard errors probs.se <- lc="" matrix="" nrow="length(probs.se),ncol=R)<br" probs.se="" se.class="">for (j in 1:length(probs.se)) se.class[j,] <- br="" category="" j="" level="" probs.se="" specifies="">dimnames(se.class) <- br="" list="" names="" round="" y=""># if you want to specify your own rownames: rownames(se.class) <- br="" row.names=""> ## creating an augmented dataset # class-conditional probabilities and standard errors df.probs <- data.frame="" lasses="as.vector(col(pi.class)),<br"> Manifest.variables=as.vector(row(pi.class)), value=as.vector(pi.class),names=rownames(pi.class), se=as.vector(se.class))
## (1) LINE PLOT (No Std. Errors): line plot of latent classes win.graph() p <- aes="" br="" df.probs="" ggplot="" x="factor(Manifest.variables)," y="value,"> color=factor(Classes))) p + geom_freqpoly(stat="identity",aes(group=Classes)) + #NB!!! geom_point(stat="identity",aes(group=Classes)) + #NB!!! scale_color_hue(name="Latent Class") + xlab("Manifest Variables") + ylab('P(Y = "Too Little")') + ggtitle("Class-Conditional Response Probabilities by Latent Class") + theme_bw() + scale_x_discrete(labels=unique(df.probs$names)) + coord_flip() # to add variable names manually for the manifest variables: # + scale_x_discrete(labels=c("")) dev.off()
## (2) RIBBON PLOT (has Std. Errors): ribbon plot of response probabilities # (with standard errors) using ggplot to graph the predicted probabilities
For several weeks I've been working on examining Tweets using code from R. Here's one approach to analyzing Twitter feeds:
# see how many unique Twitter accounts in the sample length(unique(df$screenName))
# Create a new column of random numbers in place of the usernames and redraw the plots # find out how many random numbers we need n <- br="" df="" length="" screenname="" unique=""> # generate a vector of random number to replace the names (four digits just for convenience) randuser <- 1000="" 9999="" br="" n="" round="" runif=""> # match up a random number to a username screenName <- as.character="" br="" df="" sapply="" screenname="" unique="">randuser <- br="" cbind="" randuser="" screenname=""> # Now merge the random numbers with the rest of the Twitter data, and match up the correct # random numbers with multiple instances of the usernames: rand.df <- br="" by="screenName" df="" merge="" nbsp="" randuser=""> # determine the frequency of tweets per account counts <- br="" rand.df="" randuser="" table=""> # create an ordered data frame for further manipulation and plotting countsSort <- br="" data.frame="" user="unlist(dimnames(counts)),"> count = sort(counts, decreasing = TRUE), row.names = NULL)
# create a subset of those who tweeted at least 5 times or more countsSortSubset <- count="" countssort="" subset=""> 0)
## extract counts of how many tweets from each account were retweeted # (1) clean the twitter messages by removing odd characters rand.df$text <- br="" function="" iconv="" rand.df="" row="" sapply="" text="" to="UTF-8"> # (2) remove @ symbol from user names trim <- br="" function="" sub="" x=""> # (3) pull out who the message is to rand.df$to <- br="" function="" name="" rand.df="" sapply="" text="" trim=""> # (4) extract who has been retweeted rand.df$rt <- br="" function="" rand.df="" sapply="" text="" tweet="">trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
# (6) make a table with anonymised IDs and number of RTs for each account countRT <- br="" rand.df="" rt.rand="" table="">countRTSort <- br="" countrt="" sort=""> # (7) subset those people RTd at least twice countRTSortSubset <- countrt="" countrtsort="" subset="">2)
# (8) create a data frame for plotting countRTSortSubset.df <- br="" data.frame="" user="as.factor(unlist(dimnames(countRTSortSubset))),"> RT_count = as.numeric(unlist(countRTSortSubset)))
# (9) combine tweet and retweet counts into one data frame countUser <- br="" by.x="randuser" by.y="user" countssortsubset="" merge="" randuser="">TweetRetweet <- br="" countrtsortsubset.df="" countuser="" merge=""> by.x = "randuser", by.y = "user", all.x = TRUE)
# (10) creating a random subset for the graph below TweetRetweet.sub <- font="" tweetretweet="">->->->->->->->->->->->->->->->->->->->->->