Kyle Larsen
SA8905 – Cartography and Geovisualization
Fall 2019
Instagram is a wealth of information, for better or worse, if you’ve posted to Instagram before and your profile is public, maybe even if it’s not, then your information is out there just waiting for someone, someone maybe like me, to scrape your information and put it onto a map. You have officially been warned.
But I’m not here to preach privacy or procure your preciously posted personal pics. I’m here to scrape pictures from Instagram, take their coordinates, and put them onto a map into a grid layout over Toronto. My target for this example is a quite public entity that thrives off exposure, the notorious BlogTo. Maybe only notorious if you live in Toronto, BlogTo is a Toronto-based blog about the goings on in the 6ix as well as Toronto life and culture, they also have an Instagram that is almost too perfect for this project – but more on that later. Before anything is underway a huge thank-you-very-much to John Naujoks and his Instagram scraping project that created some of the framework for this project (go read his project here, and you can find all of my code here)
When scraping social media sometimes you can use an API to directly access the back end of a website, Twitter has an easily accessible API that is easy to use. Instagram’s API sits securely behind the brick wall that is Facebook, aka it’s hard to get access to. While it would be easier to scrape Twitter, we aren’t here because this is easy, maybe it seems a little rebellious, but Instagram doesn’t want us scraping their data… so we’re going to scrape their data.
This will have to be done entirely through the front end, aka the same way that a normal person would access Instagram, but we’re going to do it with python and some fancy HTML stuff. To start you should have python downloaded (3.8 was used for this but any iteration of python 3 should give you access to the appropriate libraries) as well as some form of GIS software for some of the mapping and geo-processing. Alteryx would be a bonus but is not necessary.
We’re going to use a few python libraries for this:
- urllib – for accessing and working with URLs and HTML
- selenium – for scraping the web (make sure you have a browser driver installed, such as chromedriver)
- pandas – for writing to some files
If you’ve never done scraping before, it is essentially writing code that opens a browser, does some stuff, takes some notes, and returns whatever notes you’ve asked it to take. But unlike a person, you can’t tell python to go recognize specific text or features, which is where the python libraries and HTML stuff comes in. The below code (thanks John) takes a specific Instagram user and return as many post URLs as you want and adds them to a list, for your scraping pleasure. If you enable the browser head you can actually watch as python scrolls through the Instagram page, silently kicking ass and taking URLs. It’s important to use the time.sleep(x) function because otherwise Instagram might know what’s up and they can block your IP.
But what do I do with a list of URLs? Well this is where you get into the scrappy parts of this project, the closest to criminal you can get without actually downloading a car. The essentials for this project are the image and the location, but this where we need to get really crafty. Instagram is actually trying to hide the location information from you, at least if you’re scraping it. Nowhere in a post are coordinates saved. Look at the below image, you may know where the Distillery District is, but python can’t just give you X and Y because it’s “south of Front and at that street where I once lost my wallet.”
If you click on the location name you might get a little more information but… alas, Instagram keeps the coordinates locked in as a .png, yielding us no information.
BUT! If you can scrape one website, why not another? If you can use Google Maps to get directions to “that sushi restaurant that isn’t the sketchy one near Bill’s place” then you might as well use it to get coordinates, and Google actually makes it pretty easy – those suckers.
(https://www.google.com/maps/place/Distillery+District,+Toronto,+ON/@43.6503055,-79.35958,16.75z/data=!4m5!3m4!1s0x89d4cb3dc701c609:0xc3e729dcdb566a16!8m2!3d43.6503055!4d-79.35958 )
I spy with my little eye, some X and Y coordinates, the first set after the ‘@’ would usually be the lat/long of your IP address, which I’ve obviously hidden because privacy is important, that’s the takeaway from this project right? The second lat/long that you can gleam at the end of the URL is the location of the place that you just googled. Now all that’s left is to put all of this information together and create the script below. Earlier I said that it’s difficult to tell python what to look for, and what you need is the xpath, which you can copy from the html (right-click an element and then right-click that html and then you can get the xpath for that specific element. For this project we’re going to need the xpath for both the image and the location. The steps are essentially as follows:
- go to Instagram post
- download the image
- copy the location name
- google the location
- scrape the URL for the coordinates
There are some setbacks to this, not all posts are going to have a location, and not all pictures are pictures – some are videos. In order for a picture to qualify for full scraping it has to have a location and not be a video, and the bonus criteria – it must be in Toronto. Way back I said that BlogTO is great for this project, that’s because they love to geotag their posts (even if it is mostly “Toronto, Ontario”) and they love to post about Toronto, go figure. With these scripts you’ve built up a library of commands for scraping whatever Instagram account your heart desires (as long as it isn’t private – but if you want to write script to log in to your own account then I guess you could scrape a private account that has accepted your follow request, you monster, how dare you)
With the pics downloaded and the latitudes longed it is now time to construct the map. Unfortunately this is the most manual process, but there’s always the arcpy library if you want to try and automate this process. I’ll outline my steps for creating the map, but feel free to go about it your own way.
- Create a grid of 2km squares over Toronto (I used the grid tool in Alteryx)
- Intersect all your pic-points with the grid and take the most recently posted pic as the most dominant for that grid square
- Mark each square with the image that is dominant in that square (I named my downloaded images as their URLs)
- Clip all dominant images to 1×1 size (I used google photos)
- Take a deep breath, maybe a sip of water
- Manually drag each dominant image into its square and pray that your processor can handle it, save your work frequently.
This last part was definitely the most in need of a more automated process, but after your hard work you may end up with a result that looks like the map below, enjoy!