Wednesday, April 09, 2014

Twitter Extraction

For several weeks I've been working on examining Tweets using code from R. Here's one approach to analyzing Twitter feeds:

# see how many unique Twitter accounts in the sample

# Create a new column of random numbers in place of the usernames and redraw the plots
# find out how many random numbers we need
n <- br="" df="" length="" screenname="" unique="">
# generate a vector of random number to replace the names (four digits just for convenience)
randuser <- 1000="" 9999="" br="" n="" round="" runif="">
# match up a random number to a username
screenName <- as.character="" br="" df="" sapply="" screenname="" unique="">randuser <- br="" cbind="" randuser="" screenname="">
# Now merge the random numbers with the rest of the Twitter data, and match up the correct
# random numbers with multiple instances of the usernames:
rand.df  <- br="" by="screenName" df="" merge="" nbsp="" randuser="">
# determine the frequency of tweets per account
counts <- br="" rand.df="" randuser="" table="">
# create an ordered data frame for further manipulation and plotting
countsSort <- br="" data.frame="" user="unlist(dimnames(counts)),">                         count = sort(counts, decreasing = TRUE), row.names = NULL)

# create a subset of those who tweeted at least 5 times or more
countsSortSubset <- count="" countssort="" subset=""> 0)

## extract counts of how many tweets from each account were retweeted
# (1) clean the twitter messages by removing odd characters
rand.df$text <- br="" function="" iconv="" rand.df="" row="" sapply="" text="" to="UTF-8">
# (2) remove @ symbol from user names
trim <- br="" function="" sub="" x="">
# (3) pull out who the message is to
rand.df$to <- br="" function="" name="" rand.df="" sapply="" text="" trim="">
# (4) extract who has been retweeted
rand.df$rt <- br="" function="" rand.df="" sapply="" text="" tweet="">trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))

# (5) replace names with corresponding anonymising number
randuser <- br="" data.frame="" randuser="">rand.df$rt.rand <- as.character="" br="" match="" rand.df="" randuser="" rt="">                                                         as.character(randuser$screenName))]

# (6) make a table with anonymised IDs and number of RTs for each account
countRT <- br="" rand.df="" rt.rand="" table="">countRTSort <- br="" countrt="" sort="">
# (7) subset those people RT’d at least twice
countRTSortSubset <- countrt="" countrtsort="" subset="">2)

# (8) create a data frame for plotting
countRTSortSubset.df <- br="" data.frame="" user="as.factor(unlist(dimnames(countRTSortSubset))),">                                  RT_count = as.numeric(unlist(countRTSortSubset)))

# (9) combine tweet and retweet counts into one data frame
countUser <- br="" by.x="randuser" by.y="user" countssortsubset="" merge="" randuser="">TweetRetweet <- br="" countrtsortsubset.df="" countuser="" merge="">                      by.x = "randuser", by.y = "user", all.x = TRUE)

# (10) creating a random subset for the graph below
TweetRetweet.sub <- font="" tweetretweet="">