So, for my research, I’ve been looking into social media and learning. Specifically, I’m interested in investigating the learning that takes place on social media. There is a lot more to it than this, and I will write about that on my blog (www.blog.drheggart.com) but I thought I would document the process here.
The first thing that I wanted to do was work out what kind of things that I could scrape from the internet. Obviously, there are some legal factors here, but I believe that, as I was going to only scrape publicly available data, and I was not going to publish anything from individuals, I was acting in both a legal and ethical way. My preference would have been to use the Instagram API, but that’s no longer available.
I will need to make this case a little more thoroughly for the ethics panel at my university, but I’m in the process of doing that, so I’m confident I will be able to address their concerns. This post is more about the possibility of getting the data that I was seeking.
Scraping Instagram
The first step was to see what tools were out there for scraping instagram. Not surprisingly, there are lots of off the shelf products, but I didn’t really want to use any of them – mainly because there’s a cost involved. Instead, I looked into GitHub and discovered a few options there. I tried a few out, but many of them were limited, or required other dependencies, or didn’t really seem to be updated. However, I did find one that might meet my needs: Instagram-Scraper.
The first thing that I did was clone the Instagram-Scraper script to my Dropbox folder. I realize that I probably should have done all of this from my personal test environment, but I realised that far too late. Something to keep in mind for the future.
The second thing that I did was try out what Instagram-Scraper could do. I did this on my own profile – keithheggart, which, to be honest, is pretty limited.
Instagram-scraper keithheggart
This scraped my profile, and downloaded the 70 odd photos that I had, putting them in a folder called keithheggart in the same folder as Instagram-Scraper. Good, but not really what I wanted. After all, these images are interesting, but of limited utility for my research.
However, Instagram-Scraper has a lot of different options too. My next effort was this:
instagram-scraper keithheggart -t none –media-metadata
This didn’t collect any of the pictures or stories, but it did collect all of the metadata, and stored it in a JSON file. This was much more like what I wanted. However the problem here was reading the data in JSON – while I could probably process it manually for small scrapes, I was planning on doing much bigger scrapes in the future.
Converting JSON to CSV
So, it was back to GitHub to find a script that would do this for me. I quickly found json_to_csv by XXXX which was well attended to and updated. This worked by processing JSON based on the node. A quick look at the node in my JSON file showed me that the node was called GraphImages. So, to call the script, I did this:
python json_to_csv.py GraphImages ../Instagram_Scraper/keithheggart/keithheggart.json keithheggart.csv
This outputted a csv file that I could then examine a bit more easily. There were lots of problems with data – a great deal of columns that I didn’t really need – but it was a good start.
Fixing the Date
However, there was one more problem. One piece of data that I was interested in was when these posts had been made. However, there was only a UTC time stamp – not an actual date. Luckily enough, this is easily fixed with this formula:
=([CELL]/86400)+DATE(1970,1,1)
Scraping a hashtag
Of course, scraping an individual is all well and good, but I also wanted to be able to scrape a hashtag – that is, to find everyone using a particular hashtag. Fortunately, Instagram-Scraper offers this feature too, by using the —tag command. In the below example, I am scraping the hashtag fridays4thefuture just for the metadata, but also limiting the number of results to the latest 50 – mostly so I don’t overload my computer.
instagram-scraper fridays4thefuture –tag -t none –media-metadata -m 50
But what about USER IDs?
This worked fine, but I did notice one small problem. When I scraped the hashtag, I didn’t get Instagram usernames. Instead, I got user IDs. I thought, for a moment, that this meant that the data might be anonymized, but actually, that was quite simple to resolve via websites like this one: https://commentpicker.com/instagram-username.php
Some useful links here:
https://exceljet.net/formula/convert-unix-time-stamp-to-excel-date
https://github.com/vinay20045/json-to-csv
https://github.com/arc298/instagram-scraper
https://stackoverflow.com/questions/59997065/pip-python-normal-site-packages-is-not-writeable