How I built my first mobile app scraper

Scraping web pages is a well documented process. There are plenty of guides on how to pull information using plugins like Python’s Beautiful Soup or browser extensions like Kimono. Many web applications even provide public APIs for gathering information, such as Facebook’s Graph API.

Yet, there is a growing set of popular mobile apps that do not have a public API. Apps like Yik Yak, Tinder, and others contain a wealth of information about the communities around us, but there are no common tools for easily collecting data from these platforms.

Information about these mobile communities has become increasingly relevant in understanding and reporting the news. Yik Yak, for example, recently played a role in highlighting the oppressive social tones at University of Missouri.

So how can we scrape from mobile apps? After being inspired by this blog post about mining Yik Yaks from university areas, I decided to try creating my own scraper for Whatsgoodly. I’ll share my process.

Overview

For web scraping, the general strategy is to pull information that is within the HTML contained on a given web page. For mobile scraping, we take a different approach. Since there are no HTML files to crawl through, the strategy is to monitor the endpoints that the app loads data from so that we can use the same endpoints in our custom scraper.

More specifically, the steps I took were:

  1. Setup a Genymotion virtual device and install Whatsgoodly
  2. Monitor the app's network activity with a Charles proxy
  3. Discover the relevant endpoints and write a scraping script

Installing the application on a Genymotion Simulator

For this project, I used the latest version of Genymotion and added a virtual device running Android 5.1.0. Genymotion is helpful because it allows you to run virtual Android devices from your computer. Here’s a guide to help you get started. After you’ve added a device, start it!

The next step is to download the application you want to scrape. Generally, this is as easy as simply finding the Android Application Package (.apk file) for the application from one of many websites such as APKPure or AndroidAPKsFree and dragging it onto your device’s screen.

While trying to install Whatsgoodly using this method, I ran into some problems with getting the app to run. So instead, I installed Google Play by following anp8850’s answer on this Stack Overflow post. When following these instructions, I found that I did not need to run any of the terminal commands. Instead, I just restarted the virtual device after loading files. Once Google Play was on the device, I simply logged in and downloaded Whatsgoodly.

Monitoring Network Activity with Charles

The next step is to learn how the application loads data by tracking the HTTP/HTTPS requests it sends. To do this, I used a free trial of Charles — a web debugging proxy application.

After opening Charles, you should be able to see activity coming from the pages that are open in your web browser, but you will not be able to see any traffic from your Genymotion virtual device. This is because Genymotion’s virtual network adapter operates independently from your computer’s internet protocol stack. We can remedy this by using a Charles proxy to intercept the traffic from the virtual device. I followed Scrums of Anarchy’s first few instructions on how to connect the device to the Charles proxy. While following the instructions, remember to use the computer’s IP address for the “Proxy Hostname” field.

If everything works, you should be seeing something similar to the example below.

An example of Charles when it is blocked from capturing details about HTTPS requests from Whatsgoodly.

We’re almost there, but the issue is that we’re not seeing much information about the requests. Notice that we only see CONNECT methods, and that there is no information in Path field. This is because the app is using HTTPS request, which Charles is not allowed to collect details about. To allow Charles to see details about HTTPS requests, simply open a browser on the virtual device and use it to navigate to the Charles SSL download page. This should automatically initiate the installation of a Charles Root Certificate onto your virtual device. After it’s installed, restart Genymotion and Charles. Charles should now be able to capture information about HTTPS requests.

Finding the the relevant endpoints and writing a scraper

Now that we can see information about all of the requests coming from our app through Charles, all we need to do is track down the useful ones.

The first step here is to go through the actions you want to capture on the virtual device. Doing things like signing in, refreshing a page, or posting a comment while Charles is recording will help you to find out what endpoints handle what actions in the app.

Charles’ Path field will be helpful once you’ve recorded some actions to analyze, as well as the Request and Response tabs on the bottom half of the screen. We just need to look the recorded requests, and then create custom versions of these requests programmatically from our scraper program.

An example of Charles when it is allowed to capture details about HTTPS requests from Whatsgoodly.

I chose to write my program for scraping Whatsgoodly in Python, and used the Requests library to create structured GET requests to get the polls at a specified location. The tricky part here is to understand what HTTP headers to use for the requests. Using Charles’ Request tab, you can see the headers that were sent with each call so that you can use the same header structure in your program. This is a game of trial and error, but one thing that can help here is testing out your requests using a REST client like DHC!

That’s it! You can view the progress I have made as an example implementation at the Whatsgoodly Scraper repository. Please reach out if you have any comments or questions about the process!

 

 

About the author

Bomani McClendon

Student Fellow

Latest Posts

  • A Big Change That Will Probably Affect Your Storymaps

    A big change is coming to StoryMapJS, and it will affect many, if not most existing storymaps. When making a storymap, one way to set a style and tone for your project is to set the "map type," also known as the "basemap." When we launched StoryMapJS, it included options for a few basemaps created by Stamen Design. These included the "watercolor" style, as well as the default style for new storymaps, "Toner Lite." Stamen...

    Continue Reading

  • Introducing AmyJo Brown, Knight Lab Professional Fellow

    AmyJo Brown, a veteran journalist passionate about supporting and reshaping local political journalism and who it engages, has joined the Knight Lab as a 2022-2023 professional fellow. Her focus is on building The Public Ledger, a data tool structured from local campaign finance data that is designed to track connections and make local political relationships – and their influence – more visible. “Campaign finance data has more stories to tell – if we follow the...

    Continue Reading

  • Interactive Entertainment: How UX Design Shapes Streaming Platforms

    As streaming develops into the latest age of entertainment, how are interfaces and layouts being designed to prioritize user experience and accessibility? The Covid-19 pandemic accelerated streaming services becoming the dominant form of entertainment. There are a handful of new platforms, each with thousands of hours of content, but not much change or differentiation in the user journeys. For the most part, everywhere from Netflix to illegal streaming platforms use similar video streaming UX standards, and...

    Continue Reading

  • Innovation with collaborationExperimenting with AI and investigative journalism in the Americas.

    Lee este artículo en español. How might we use AI technologies to innovate newsgathering and investigative reporting techniques? This was the question we posed to a group of seven newsrooms in Latin America and the US as part of the Americas Cohort during the 2021 JournalismAI Collab Challenges. The Collab is an initiative that brings together media organizations to experiment with AI technologies and journalism. This year,  JournalismAI, a project of Polis, the journalism think-tank at...

    Continue Reading

  • Innovación con colaboraciónCuando el periodismo de investigación experimenta con inteligencia artificial.

    Read this article in English. ¿Cómo podemos usar la inteligencia artificial para innovar las técnicas de reporteo y de periodismo de investigación? Esta es la pregunta que convocó a un grupo de siete organizaciones periodísticas en América Latina y Estados Unidos, el grupo de las Américas del 2021 JournalismAI Collab Challenges. Esta iniciativa de colaboración reúne a medios para experimentar con inteligencia artificial y periodismo. Este año, JournalismAI, un proyecto de Polis, la think-tank de periodismo...

    Continue Reading

  • AI, Automation, and Newsrooms: Finding Fitting Tools for Your Organization

    If you’d like to use technology to make your newsroom more efficient, you’ve come to the right place. Tools exist that can help you find news, manage your work in progress, and distribute your content more effectively than ever before, and we’re here to help you find the ones that are right for you. As part of the Knight Foundation’s AI for Local News program, we worked with the Associated Press to interview dozens of......

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More