R Studio and Instagram Scraping
So, one of my recent projects lately requires me to scrape metadata from instagram and then analyse it. I’ve written about that project in other places, and also the scraping process too, here, but I thought the way that I am analysing the data might be of interest in this space.
There are lots of different ways of analysing the data, and tools for doing so. I have had some experience with Python, which is becoming more and more popular from a data perspective, but perhaps the best known for analysing statistical data is R. I have wanted to upskill with R for a while, so this seemed like a good opportunity to combine the two.
Having made that decision, I needed to determine what tools I was going to use. Again, I have heard good things about JUPYTER notebooks, but in this instance, I went with R Studio. The reason that I did that was because the Linkedin learning course that I was using to learn R suggested that was a good starting point.
- Importing the Data
My first challenge was to import the data that I had gathered into R Studio. This was actually fairly straightforward, using the import data function. I believe that this has required console commands in the past, but the inbuilt data importer function in R studio opens a file browser and allows you select the excel file, then choose how the data is interpreted (i.e. is it a string, a number etc).
It also spits out the code from R (here’s mine):
library(readxl) Scraping_Master <- read_excel("Desktop/Scraping Master.xlsx", col_types = c("text", "numeric", "skip", "skip", "text", "numeric", "numeric", "text", "numeric", "numeric", "numeric", "skip", "skip", "numeric", "skip", "numeric", "text", "skip", "text", "numeric", "date", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "skip", "text")) View(Scraping_Master)
- Factorising the Character Variables
However, this is where my problems started. I wanted to do what I thought was something that was ridiculously simple – plot some names (as character data types) versus total followers on social media. However, I keep getting information saying that this wasn’t numeric data and not something that I could do. The answer lay in turning the character data into factors.
This is how I do that:
df <-Scraping_Master2 names <- factor(df$Name)
In the example above, I create a variable called df (short for data frame) based on my imported excel file. I think assign the variable names to the name variable within df, after passing it through the factor method. This then let me chart names against total followers, for example. See below:
And thank you to Stack Overflow for that answer: https://stackoverflow.com/questions/13009203/plot-a-character-vector-against-a-numeric-vector-in-r?answertab=votes#tab-top
- Stepping it up
Of course, I wanted to do a lot more with the data than that. My next step was to investigate, in a very basic and limited way, one of my hypothesis. I wanted to know if increasing likes led to increased comments on a post. To do this, I wanted to map my follower to like ratio against the follower to comment ratio, in a form of scatter plot, with the points labelled. This is how I did it:
df <-Scraping_Master2
df$Name
names <- factor(df$Name)
names
nlevels(names)
options(scipen=10000)
plot(names,df$`Total Followers`)
barplot(names, Scraping_Master2$`Total Followers`)
hist(as.numeric(names), df$`Total Followers`)
plot(df)
plot(names)
plot(table(names), df$`Total Followers`)
And this is what it looked like:
Obviously, a long way to go, but a good start.