Trying to Make Sense of PyPI Download Statistics

Trying to Make Sense of PyPI Download Statistics

March 10, 2025

A few explanations for millions of downloads per day

If you are a Python user and enjoy diving into all kinds of data (I hear these two predicates are often found in combination), you have probably already looked up PyPI download statistics and stumbled upon the Most downloaded PyPI packages page of pypistats.org. If you were left scratching your head at this ranking, you are not alone. Let me take you through my journey of falsified assumptions and unexpected findings, a journey that changed the way I understand Python package popularity.

Bubble plot of Python packages sized by download counts, with non-optional dependencies represented as arrows from the dependent package to the dependency.

Assumption 1 (spectacularly wrong): download count equals popularity

Reality check: TensorFlow vs boto3 is one of many counterexamples

Close your eyes and think about three popular Python packages. Are NumPy, Pandas or Matplotlib among them? I would not be surprised. Especially if you are a large language model (and who would better represent a hypothetical average Pythonista?). When I asked three LLMs to rank the most popular packages, they all started their list with those three iconic packages, before branching out to mention the likes of scikit-learn, TensorFlow, Keras and Django. Of those household names, only NumPy makes the top 20 download chart, and it is far away from the top 5. Conversely, boto3 dominates the download ranking despite not being mentioned by any of the surveyed LLMs. The same goes for urllib3 and botocore. In fact, requests seems to be the only package that is both highly downloaded and widely recognized as “popular”. Clearly, popularity and download counts are not proportional.

Assumed popularity vs downloads
Figure 1: Assumed popularity (based on an informal survey of ChatGPT, Anthropic Claude and DeepSeek R1) vs download counts for selected Python packages

Assumption 2 (not entirely true): package dependencies explain download counts

Reality check: requests vs certifi is one of multiple counterexamples

A simple explanation for a package being downloaded while not being directly popular among end users may be that it is required by other packages. In other words, high download counts for a package may be explained by its being a dependency of popular packages - or a dependency of a dependency etc. And if package D is a non-optional dependency of package P, then every installation of P should trigger a download of D — meaning D’s download count should be at least equal to P’s, right? Not quite. These assumptions may seem to hold when looking at Figure 2a, which shows how urllib3 accumulates more downloads than its dependencies… but they are clearly contradicted by Figure 2b.

Dependency relationships and download counts A
Figure 2a: Dependency relationships and download counts for a set of packages in which it each dependency is more downloaded than the packages depending on it.

If requests depends on certifi and charset-normalizer, how can it be more downloaded than these dependencies?

Dependency relationships and download counts B
Dependency relationships and download counts for a set of packages in which it some dependencies (certifi, charset-normalizer, idna) are less downloaded than the package depending on them (requests).

This led me to another hypothesis: perhaps frequent updates to a package can trigger more downloads, compared to more stable dependencies which need not be downloaded anew for each release of the dependent package?

Assumption 3 (getting warmer?): package releases together with dependencies explain download counts

Reality check: boto3 and botocore and a few others defy explanation

Frequent releases could indeed explain why some packages outpace their dependencies in downloads.

For instance, botocore (over 700 releases since 2022) gets significantly more downloads than its dependencies python-dateutil (2 releases since 2022) and six (1 release since 2022).

Dependency relationships, download counts and numbers of releases
Dependency relationships, download counts and numbers of releases set of packages in which the higher download count of botocore compared to its dependencies python-dateutil and six may be explained by more frequent releases.

This makes perfect sense. However, the theory collapses when we look at requests versus its dependencies certifi and idna, which actually had more frequent releases in recent years.

Dependency relationships, download counts and numbers of releases
Dependency relationships, download counts and numbers of releases requests and related packages, whereby the higher download count of requests remains unexplained.

As of March 2025, there has been no new release of requests since version 2.32.3 in May 2024, while both idna and certifi had three releases during this period. The time series below shows clear weekly patterns (with weekday spikes and weekend dips), interrupted only by the Christmas break—but no discernible impact from new releases.

Time series of daily downloads and version releases of selected packages in the last 6 months
Time series of daily downloads and version releases of selected packages in the last 6 months

Assumption 4: the AWS effect and other hidden forces

Time series of daily downloads and version releases of botocore and boto3 in the last 6 months
Time series of daily downloads and version releases of botocore and boto3 in the last 6 months

The most perplexing puzzle to me remains the following: how can boto3 be downloaded twice as many times as its dependency botocore, although both packages follow identical release schedules, with version 1.x.y of boto3 requiring version 1.x.y of botocore? My best guess is that botocore is often sourced from somewhere else than PyPI. In any case, boto3’s top position in the download charts reveals the large influence of Amazon’s cloud infrastructure on package statistics.

Conclusions

While you might expect widely recognized libraries like NumPy and Pandas to dominate PyPI download rankings, the actual data show more complex patterns. Going beyond the initial (wrong) assumption that download counts are proportional to package popularity, we saw that dependencies and release frequencies influence download counts. These factors will explain why six, a Python 2 and 3 compatibility library, still ranks among the most downloaded packages, five years after Python 2 has been sunset; because it is used by python-dateutil, which in turn is used by many other packages.

Still, each new explanation was contradicted by new unexplained counter-examples. The Python Packaging User Guide does warn about various factors reducing the accuracy of download counts (including pip’s download cache and dishonest attempts at download count inflation). If alternative distribution channels and different caching or mirroring practices can explain these discrepancies, it is probably wiser for me to get out of this rabbit hole now and accept the limitations of download counts as a package metric. After all, as the same Python Packaging User Guide rightly remarks, “Just because a project has been downloaded a lot doesn’t mean it’s good; Similarly just because a project hasn’t been downloaded a lot doesn’t mean it’s bad!”. Still, if you know more about the hidden forces impacting download counts and explaining the botocore-vs-boto3 paradox, please tell me!

References

  • The section on Analyzing PyPI package downloads of the Python Packaging User Guide is the reference on the public PyPI download statistics dataset made available on Google BigQuery, and why this dataset is not displayed or made available by PyPI directly (in summary, because if is “highly inaccurate” and “not particularly useful”).

  • Pypistats.org is the best place for easily accessing these PyPI download statistics, but only for the last 6 months.

  • A monthly dump of the 15,000 most downloaded packages from PyPI is available at https://hugovk.github.io/top-pypi-packages/.

  • This GitHub issue on Differentiating organic vs automated installations highlights the difficulty of distinguishing between different types of package downloads. Manually typing pip install and having the same package installed automatically in the context of a continuous integration (CI) pipeline are two different things, but there are lots of possible variations beyond these two cases and currently no good way to distinguish between those in download statistics.