Google Play Store in Numbers – The numbers that Google won’t share


Google’s Published Data is as poor as it can be

Every now and then Google releases the latest numbers regarding the Android platform (which you can find Here).

According to Google this Dashboard :

“provides information about the relative number of devices that share a certain characteristic, such as Android version or screen size. This information may help you prioritize efforts for supporting different devices by revealing which devices are active in the Android and Google Play ecosystem”.

This summarized data helps the developer and there’s no doubt about it but, given the ammount of data Google has, don’t you think those Dashboards should be richer ?

If Google won’t release any detailed data, lets gather it ourselves

I have recently tried to gather data on the web about the actual situation of the Google Play Store so that i could generate some richer data for our fellow developers. Needless to say that i failed. There’s no such data available online.

At this moment i’ve had all the questions in my head but no way to find the answers. That was when i decided to dig into my own crawler. For those who are not familiar with this word, crawlers are programs designed to scrape (or Mine) data out of a source, in an automatic way, without the need of any human intervention during the process.

The crawler should be able to visit pages (of apps) of the Google Play Store, parse data about each page (Number of Ratings, Price, Developer, Stars, Downloads, Version…) and store it into my own Database in a structured way so that i could query it the way i wanted to.

After a week of work i’ve had my own server and crawler setup for running without any problems (which you can find the code in My GitHub Repository).

All i needed now was to wait for it to do it’s work, so i waited, and waited, and waited.

My Result: Google Play Store in Numbers

After running my crawler (in multiple processes) for 7 days i started to get IP Banned/Blocked by Google’s black magic. Even using proxies and changing my crawler’s behavior dinamically I would still get blocked really fast. As slow as it was, I have managed to mine data out of roughtly 1.010.000 apps (which is something near 95% of the Play Store according to Wikipedia).

If you’re curious about the result, you can find the Excel file i created right here. Note that the database is public, so you can hit the GitHub page and connect to it in order to do your own queries if you like, as well as running the crawler in your home to help “Crowdsource” our own Database of Play Store Apps data.

In the end I have learned that we shouldn’t be depending on information given to us by the “Big Players”. There are tons of public data sources on the web, waiting to be mined and studied. We might be surprised with some of the results we find.

If you have any questions, considerations, comments or suggestions, leave it on the comments or find me at About.Me

Update: Thanks to the help of this blog readers, the crawler was able to populate the database with around 1.1 Million apps from the Playstore, which may translate into roughly 95% of Reach. That’s how big “Crowd Crawling” can be

Update 2: By popular demand i have decided to try and implement a crawler on the same model of the Play Store one. You can find it on this Public Github Repository

Update 3: As of late 2016, the project was sold to a company, and the Github Repository is not available anymore. If you want to chat about this project (or any future one) with me, shoot me an email at: marcello.grechi@gmail.com

Advertisements

6 thoughts on “Google Play Store in Numbers – The numbers that Google won’t share

  1. Hi Marcello.

    You did a really good job on this crawler and I agree with you that Google should provide more statistics about apps in Play store and if they don’t want to do it, we should collect this data by ourself.

    I’m currently starting similar project which will collect data from Google Play and Apple App Store. I’d like to ask you some questions if you don’t mind. Maybe you could share some experience 🙂

    1. How did you overcome blocks/bans from Google? You posted here that you managed to collect data about 300000 apps before bans/blocks started, but not so long ago you wrote on GitHub that there is 1.1 million apps in your database. So how did you manage to collect it all? Did you found a way to bypass Google blocks or was this a success of crowd crawling? 🙂

    2. How long did it take to collect all the data? Could you share some numbers (how many hours/days, how many processes/instances, what kind of AWS machine did you use, how many apps were collected by you and how many by others)?

    3. Do you have a backup of this database or do you have some protection mechanism against data pollution? There are bad people out there and they could destroy your database (by polluting it with fake data) with no reason at all :/ It’s good to keep that in mind.

    4. Would you agree if I would like to copy your whole database? As I said I need those data and more from Apple in my project, but I’m going to use different database structure. I will also do lot of operations on this data, so I don’t want to use your machine so extensively. It would be great if I could copy your database, but I don’t want to do it without asking.

    Again: really good job and really good attitude (by sharing your results and opening database for others). Thanks 🙂

    • I’m glad you liked the whole project (which is available on Github by the way, in case you might want to check).

      Let me try to answer your questions:

      1 – This number (1.1 Million) was the result of the “crowdcrawling” in action. As far as i know, there was 10-15 other people using the project (which i’ve had to stop for a while, due to some problems with my AWS EC2 instance). So, in theory, thats how i was able to “trick” google. Their blocking system is very robust, so there’s no way (AFAIK) to crawl that much from a single PC

      2 – The database is hosted into a t1.micro AWS Instance (which is lame i know, but hey, its also free for one year). Since i’ve never imagined anyone would actually use this project, the T1.micro was doing it’s job until then. Me and 2 friends (so thats 3 different ips) were able to fetch roughtly 400k apps, the rest got crawled by users spread around the globe (from india to america). I actually only realised that we’ve had that much records last week, when my AWS disk got full and the MongoDB service got shutdown due to lack fo space (it is up again now). It’s impossible for me to tell how much time it took to reach this number, since i can’t map the exact moment people actually started using it.

      3 – I have no backup. To be honest, the database is just the result of a really well designed project, which people can’t break. So if some troll ruins the database, we can populate it again in no time. I plan tho, to host it (a snapshot) on AWS S3 and make it public so that people can use their own mongo instances it they want to.

      4 – Go for it, it would be a pleasure. Also, if you don’t mind, i would like to take a look on your work/project. I might have some knowledge to share with you.

      Email me : marcello.grechi@gmail.com so we can discuss the best way for you to get your hands on this data. If you just read all my records and export it, it will cost me some cash since AWS charges me per Network outbound traffic, so i will have to think about other way to do it. Maybe AWS S3.

      Again, im glad you liked, and feel free to use it however you like to.

  2. It is very good project. it is nice of you have the project as source code and provide the database” result” as well. I have one questions to you , which is about the app reviews. there is any way to retrieve all the comments posted by user on each app.

    • Thanks for showing interest into this project Rabe.

      Regarding the Reviews, i haven’t wrote any code to parse this piece of information yet due to some issues:

      1 – The ammount of request needed to parse out all the comments (given that i can only retrieve 10 at a time) would make it way easier for google to block the crawler

      2 – The storage needed to keep record of all the comments of each app (or even the top 1000 for instance) would be too big for me to maintain a “Low Cost” AWS billing as i am managing to do right now. I could store the reviews as a compressed string, but one could not execute queries on top of it (like Regexes or string matches).

      Let me know if you need help with that anyway, email me with your needs so i can, at least, give you a direction to go for achieving your needs.

  3. 1. Why don’t you publish db file on some server that downloading side will pay for, like payed files on rapidshare or something?
    2. Maybe you can scrap and store in db only bad reviews, as for concurrent apps only bad reviews are useful (i.e. what they can do better).

    • Hello @Fifi,

      1. Im not familiar with paid file-sharing, I will definetely look closer to something like this. Nice tip 🙂

      2 – What do you mean by “Bad” reviews ? (one or two stars, maybe ?)

      Thanks for the inputs 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s