Google’s Published Data is as poor as it can be
Every now and then Google releases the latest numbers regarding the Android platform (which you can find Here).
According to Google this Dashboard :
“provides information about the relative number of devices that share a certain characteristic, such as Android version or screen size. This information may help you prioritize efforts for supporting different devices by revealing which devices are active in the Android and Google Play ecosystem”.
This summarized data helps the developer and there’s no doubt about it but, given the ammount of data Google has, don’t you think those Dashboards should be richer ?
If Google won’t release any detailed data, lets gather it ourselves
I have recently tried to gather data on the web about the actual situation of the Google Play Store so that i could generate some richer data for our fellow developers. Needless to say that i failed. There’s no such data available online.
At this moment i’ve had all the questions in my head but no way to find the answers. That was when i decided to dig into my own crawler. For those who are not familiar with this word, crawlers are programs designed to scrape (or Mine) data out of a source, in an automatic way, without the need of any human intervention during the process.
The crawler should be able to visit pages (of apps) of the Google Play Store, parse data about each page (Number of Ratings, Price, Developer, Stars, Downloads, Version…) and store it into my own Database in a structured way so that i could query it the way i wanted to.
After a week of work i’ve had my own server and crawler setup for running without any problems (which you can find the code in My GitHub Repository).
All i needed now was to wait for it to do it’s work, so i waited, and waited, and waited.
My Result: Google Play Store in Numbers
After running my crawler (in multiple processes) for 7 days i started to get IP Banned/Blocked by Google’s black magic. Even using proxies and changing my crawler’s behavior dinamically I would still get blocked really fast. As slow as it was, I have managed to mine data out of roughtly 1.010.000 apps (which is something near 95% of the Play Store according to Wikipedia).
If you’re curious about the result, you can find the Excel file i created right here. Note that the database is public, so you can hit the GitHub page and connect to it in order to do your own queries if you like, as well as running the crawler in your home to help “Crowdsource” our own Database of Play Store Apps data.
In the end I have learned that we shouldn’t be depending on information given to us by the “Big Players”. There are tons of public data sources on the web, waiting to be mined and studied. We might be surprised with some of the results we find.
If you have any questions, considerations, comments or suggestions, leave it on the comments or find me at About.Me
Update: Thanks to the help of this blog readers, the crawler was able to populate the database with around 1.1 Million apps from the Playstore, which may translate into roughly 95% of Reach. That’s how big “Crowd Crawling” can be
Update 2: By popular demand i have decided to try and implement a crawler on the same model of the Play Store one. You can find it on this Public Github Repository
Update 3: As of late 2016, the project was sold to a company, and the Github Repository is not available anymore. If you want to chat about this project (or any future one) with me, shoot me an email at: email@example.com